[2022-09-01]: ClickHouse not responding after Terraform deployment
Date
2022-09-01 11:15 CEST
Summary
- The events API started returning 500 errors during read/write/delete operations.
- Why?
- ClickHouse instance was not responsive and the Cloud Run service was not able to connect to it.
- Why?
- Terraform changes were applied including adjustments to the instance template in unresponsive instance.
- Why?
- Potentially a glitch on the GCP side, because at the same time a few Scheduler Jobs were failing even with manual start.
- Why?
- Recovery went smooth without the need of any intervention once the instance was changed by GCP.
Authors
Rafał Kuć
Impact
The issue impacted the following functionalities:
- Ongoing search history read/write
Detection
Edmundas in one of the threads brought up that there are 500
errors returned from the
APIs.
Resolution
We've investigated the issue and noticed that around 2 hours since the initial deployment the virtual machine on which ClickHouse is running changed its IP address after which everything started working. We didn't need to do anything more.
Timeline
The timeline of the events related to the issue in form of the table:
Time | Description |
---|---|
2022-09-01 11:15:00 | Terraform Apply is accepted and applied |
2022-09-01 11:30:00 | The issues with the API endpoints are reported |
2022-09-01 11:35:00 | Investigation was started |
2022-09-01 11:58:00 | The outage is reported to #-outages Slack channel |
2022-09-01 13:30:00 | The instance IP changes and the issues are resolved |
2022-09-01 13:31:00 | The issue reported as resolved |
Action Items (optional)
Action items:
Lessons Learned (optional)
We need more visibility to support faster reactions for the 500
errors coming from the
API. In #4179 we will add additional
alerts to get us informed without having to manually check the metrics and logs.