On March 2, 2022, we discovered and fixed a statistical 🪳bug🪳 that was affecting some users deploying apps on the US-Central-1 Google Cloud deployments.
Cause: The root cause was that the ingress controller stopped using the service cluster IP.
Customers Affected: Customers deploying apps on US-Central-01 Google Cloud.
Solution: There is an annotation we put on the ingress controller that should point the traffic to the Kubernetes service. The annotation had changed in the past and was no longer doing what it was designed to do.
Instead, the ingress controller was using the Pod IPs. This sounds like it should work fine, but it turns out that occasionally a pod that is scaled down (eg. during a deployment) doesn’t have its IP removed from the ingress controller in a timely manner. In this situation it can hang on for 1-4 additional minutes. This results in 502 responses, because traffic is sent to the IP of a pod that is no longer online.
The new annotation solution ensures the ingress controller points to the Kubernetes service cluster IP instead of directly to the Pod IPs. The Kubernetes service is responsible for sending the traffic to the active pods. This was the way we designed the system to begin with, it just turns out that the ingress controller annotation changed recently and went undetected.
To prevent this from going unnoticed in the future, we put in a monitor to check that the ingress controllers are pointing to the cluster IPs. Additionally we are looking at how to leverage our logging system to detect this situation and alert us proactively.
If you have any questions on this error, or would like us to investigate a simliar issue, please reach out to our Support so we can help.