[2023-03-22]: Issues with distributed tables in ClickHouse
Date
2023-03-22 18:00 CEST
Summary
- The Project History page was not displayed for the connected repositories.
- Why?
- Because ClickHouse was not returning the correct data.
- Why?
- The ZooKeeper nodes in the virtual file system were not present.
- Why?
- ClickHouse was deleting it.
- Why?
- Because the writes to ZooKeeper were failing.
- Why?
- Because of the amount of files ZooKeeper had to process.
- Why?
- Because of the events not being batched and large queue of events.
Detailed Summary
The Problem
During the Product Showcase call, we noticed that the events are coming in, but the data is not rendered in the Project History and the Code Digests.
The Cause
What happened during the demo was related to ZooKeeper and ClickHouse. To put a bit of technicality into the description - ClickHouse was removing the ZooKeeper virtual nodes making the inserts fail. We tried to quickly recreate the materialized views during the demo, but ClickHouse was deleting them over and over again causing the same issue to reappear.
Problem Description
The problem was related to the disk on ZooKeeper. We think that the issue was caused by the disk being filled up after the events that happened on Thursday 16th of March when we started having issues filling up the ZooKeeper disk because of parsing failures. It couldn't process the events and files fast enough failing and removing the nodes from its virtual file system. This happened because of a lack of processing resources and potential issues with disk space.
The deleted files were the ZooKeeper log files that can be found in the /data/version-2
in the ZooKeeper container file system.
Authors
Rafał Kuć
Impact
The issue impacted the following functionalities:
- Project History and Code Digests
Detection
The application was not working during the Product Showcase Call.
Resolution
The actions that were taken to resolve the issue include:
- Clearing up ZooKeeper disk
- Replacing ZooKeeper Virtual Machine with a more powerful node - from
f1-micro
toe2-small
- Replacing both ClickHouse Nodes
- Purging the Pub/Sub queue
- Recreating the views in ClickHouse
- Introducing batching for the old events writing
Timeline
The timeline of the events related to the issue in form of the table:
Time | Description |
---|---|
2023-03-22 18:00:00 | Issue identified during Product Showcase |
2023-03-22 18:10:00 | ClickHouse materialized views recreated |
2023-03-22 18:15:00 | Problem appeared again |
2023-03-22 19:00:00 | ZooKeeper instance upgraded and restarts started |
2023-03-22 22:30:00 | Majority of the issues dealt with |
2023-03-22 23:30:00 | Issue with old events identified |
2023-03-23 06:00:00 | Additional investigation |
2023-03-23 07:00:00 | ClickHouse materialized views recreated |
2023-03-23 09:00:00 | Investigation with Paweł and Piotr |
2023-03-23 10:00:00 | Queue purging |
2023-03-23 10:00:00 | ClickHouse materialized views recreated once again |
2023-03-23 15:00:00 | Finished investigation |
Lessons Learned (optional)
We need more visibility to quickly react to the issues related to the CPU utilization, disk usage, queues lag, etc. Alerts will be added.