[2022-09-01]: ClickHouse not responding after Terraform deployment

Date

2022-09-01 11:15 CEST

Summary

The events API started returning 500 errors during read/write/delete operations.
Why?
ClickHouse instance was not responsive and the Cloud Run service was not able to connect to it.
Why?
Terraform changes were applied including adjustments to the instance template in unresponsive instance.
Why?
Potentially a glitch on the GCP side, because at the same time a few Scheduler Jobs were failing even with manual start.
Why?
Recovery went smooth without the need of any intervention once the instance was changed by GCP.

Authors

Rafał Kuć

Impact

The issue impacted the following functionalities:

Ongoing search history read/write

Detection

Edmundas in one of the threads brought up that there are 500 errors returned from the APIs.

Resolution

We've investigated the issue and noticed that around 2 hours since the initial deployment the virtual machine on which ClickHouse is running changed its IP address after which everything started working. We didn't need to do anything more.

Timeline

The timeline of the events related to the issue in form of the table:

Time	Description
2022-09-01 11:15:00	Terraform Apply is accepted and applied
2022-09-01 11:30:00	The issues with the API endpoints are reported
2022-09-01 11:35:00	Investigation was started
2022-09-01 11:58:00	The outage is reported to `#-outages` Slack channel
2022-09-01 13:30:00	The instance IP changes and the issues are resolved
2022-09-01 13:31:00	The issue reported as resolved

Action Items (optional)

Action items:

#4179

Lessons Learned (optional)

We need more visibility to support faster reactions for the 500 errors coming from the API. In #4179 we will add additional alerts to get us informed without having to manually check the metrics and logs.

Date​

Summary​

Authors​

Impact​

Detection​

Resolution​

Timeline​

Action Items (optional)​

Lessons Learned (optional)​