Bug leaves image in an inconsistent state – OKD Image is stuck
The CRI-O bugs leave the docker images in a half-loaded inconsistent state. While the image name is reserved, the binary layer contents are incomplete. In this way, the runtime knows the image, but it cannot use it. The message “pod sandbox with name … already exists” indicates the conflict situation.
OKD Image is stuck – What favors the error?
There are several theories on the origin of the error situation. Some discussions talk about too-long path names when storing the image layers in the file system. A different theory sees network latencies or slow docker registries as a cause. During my tests, the bug occurred only randomly. While some clusters ran into the bug, others were not affected.
Solving the stuck image
When the docker image gets stuck, the affected image becomes not available to the worker node. In addition, you may see increasing network traffic because of image reload tries. But in some cases deleting the affected pod or deleting the containing namespace solved the problem. At the same time, in another case, even rebooting the worker node did not solve the problem. Here, deleting the pod and rebooting the worker node could not solve the problem. As a result, the affected worker node could not operate the image. Unfortunately, we must reinstall the worker node from scratch.
When the bug becomes a trouble amplifier
The problems boil over when they favor each other. So in one error case, a hardware malfunction removed a worker node from the cluster. Here the Kubernetes failovers work pretty and start the missing pods on a new worker node. In consequence, Kubernetes starts loading all missing images. The massive image loading causes network latencies, which may favor the bug. Unfortunately, several images ran into retries, and the docker registry active the pull rate limit.
Conclusion
This bug is a severe error. Think of an occurrence in productive operation. Here, loading container images is so essential that such problems make operation a gamble. Here, a different post reports a Docker image-related problem.