Skip to main content

[2023-03-22]: Issues with distributed tables in ClickHouse

Date

2023-03-22 18:00 CEST

Summary

  • The Project History page was not displayed for the connected repositories.
  • Why?
  • Because ClickHouse was not returning the correct data.
  • Why?
  • The ZooKeeper nodes in the virtual file system were not present.
  • Why?
  • ClickHouse was deleting it.
  • Why?
  • Because the writes to ZooKeeper were failing.
  • Why?
  • Because of the amount of files ZooKeeper had to process.
  • Why?
  • Because of the events not being batched and large queue of events.

Detailed Summary

The Problem

During the Product Showcase call, we noticed that the events are coming in, but the data is not rendered in the Project History and the Code Digests.

The Cause

What happened during the demo was related to ZooKeeper and ClickHouse. To put a bit of technicality into the description - ClickHouse was removing the ZooKeeper virtual nodes making the inserts fail. We tried to quickly recreate the materialized views during the demo, but ClickHouse was deleting them over and over again causing the same issue to reappear.

Problem Description

The problem was related to the disk on ZooKeeper. We think that the issue was caused by the disk being filled up after the events that happened on Thursday 16th of March when we started having issues filling up the ZooKeeper disk because of parsing failures. It couldn't process the events and files fast enough failing and removing the nodes from its virtual file system. This happened because of a lack of processing resources and potential issues with disk space.

The deleted files were the ZooKeeper log files that can be found in the /data/version-2 in the ZooKeeper container file system.

Authors

Rafał Kuć

Impact

The issue impacted the following functionalities:

  • Project History and Code Digests

Detection

The application was not working during the Product Showcase Call.

Resolution

The actions that were taken to resolve the issue include:

  • Clearing up ZooKeeper disk
  • Replacing ZooKeeper Virtual Machine with a more powerful node - from f1-micro to e2-small
  • Replacing both ClickHouse Nodes
  • Purging the Pub/Sub queue
  • Recreating the views in ClickHouse
  • Introducing batching for the old events writing

Timeline

The timeline of the events related to the issue in form of the table:

TimeDescription
2023-03-22 18:00:00Issue identified during Product Showcase
2023-03-22 18:10:00ClickHouse materialized views recreated
2023-03-22 18:15:00Problem appeared again
2023-03-22 19:00:00ZooKeeper instance upgraded and restarts started
2023-03-22 22:30:00Majority of the issues dealt with
2023-03-22 23:30:00Issue with old events identified
2023-03-23 06:00:00Additional investigation
2023-03-23 07:00:00ClickHouse materialized views recreated
2023-03-23 09:00:00Investigation with Paweł and Piotr
2023-03-23 10:00:00Queue purging
2023-03-23 10:00:00ClickHouse materialized views recreated once again
2023-03-23 15:00:00Finished investigation

Lessons Learned (optional)

We need more visibility to quickly react to the issues related to the CPU utilization, disk usage, queues lag, etc. Alerts will be added.