New: OKD Docker Image is stuck – Operation not possible

by | Apr 30, 2021 | Docker

Openshift and also OKD Docker image is stuck when loading. A severe bug in the CRI-O engine causes stuck the OKD Docker images in an invalid and unusable state. There are discussions about timeouts while loading the images from the docker registry or too long filenames in the CRI-O layer. But in fact, the OKD Docker image is stuck. The binary content is not available, and any further retry to load the image runs into “error reserving ctr name”.

Bug leaves image in an inconsistent state – OKD Image is stuck

The CRI-O bugs leave the docker images in a half-loaded inconsistent state. While the image name is reserved, the binary layer contents are incomplete. In this way, the runtime knows the image, but it cannot use it. The message “pod sandbox with name … already exists” indicates the conflict situation.

OKD Image is stuck – What favors the error?

There are several theories on the origin of the error situation. Some discussions talk about too-long path names when storing the image layers in the file system. A different theory sees network latencies or slow docker registries as a cause. During my tests, the bug occurred only randomly. While some clusters ran into the bug, others were not affected.

Solving the stuck image

When the docker image gets stuck, the affected image becomes not available to the worker node. In addition, you may see increasing network traffic because of image reload tries. But in some cases deleting the affected pod or deleting the containing namespace solved the problem. At the same time, in another case, even rebooting the worker node did not solve the problem. Here, deleting the pod and rebooting the worker node could not solve the problem. As a result, the affected worker node could not operate the image. Unfortunately, we must reinstall the worker node from scratch.

When the bug becomes a trouble amplifier

The problems boil over when they favor each other. So in one error case, a hardware malfunction removed a worker node from the cluster. Here the Kubernetes failovers work pretty and start the missing pods on a new worker node. In consequence, Kubernetes starts loading all missing images. The massive image loading causes network latencies, which may favor the bug. Unfortunately, several images ran into retries, and the docker registry active the pull rate limit.

Conclusion

This bug is a severe error. Think of an occurrence in productive operation. Here, loading container images is so essential that such problems make operation a gamble. Here, a different post reports a Docker image-related problem.