2024-09-03: ETL Jobs filled the analyticsdb

Date

2024-09-07 09:00 WEST

Summary

AnalyticsDB started to fill up with a huge amount of data. Went from 10GB used to 760GB in 4h.
Why ?
ETL SQL batch jobs seem to produce too many results
Why ?
A SQL query was producing too many rows

Authors

@zaibon

Impact

The information about the impact of the issue including:

affected infrastructure elements
- analyticsdb disk size grew from 10GB to 670GB
affected product features
- All webapp feature depends on the data produced by the ETL pipelines.
affected users
- This incident was limited to QA environment. No customers were affected.

Root Cause

The cause is 2 folds:

The SQL used to create the repo_commit_digest data-product had a bug that made it generate multiples millions of rows instead of thousands or rows.
The scheduling system used (dataflow + cron scheduler) to run those queries did not allow to ensure only one instance of this query ever run at the same time.

The combination of these two fact created a cascade effect on the database forcing it to create too many temporary files that eventually filled all the disks of the database instance.

Trigger

Some new version of the ETL pipeline were deployed on Friday 06 September 2024. The incident started to appear nearly immediately after all the ETL jobs were started. Around 19:00 WEST on the 06 September 2024.

Detection

On 07 September 2024 06:30 WEST @prog8 detected a huge rise in the size of the analyticsDB.

Screenshot of the database system graphs at the time of the incident. graphs_1 graphs_2 graphs_3 graphs_4 graphs_5 graphs_6 graphs_7

Resolution

The final resolution resolution is still in progress.

Extracting all the SQL queries running in dataflow out.
Building a new tool that allows us to execute SQL queries simply

Timeline

The timeline of the events related to the issue in form of the table:

2024-09-7 9:00AM WEST: @zaibon increase the disk size of the database to 670GB
9h33: max size and size increase to 670, connection possible again

Time	Description
2022-09-07 06:30:00 WEST	@prog detect the size of the database increased out of control
2022-09-07 06:43:00 WEST	@prog stops all the ETL jobs writing to the database
2022-09-07 09:00:00 WEST	@zaibon increase the database disk size of the database to 700GB to allow the database to recover
2022-09-07 09:33:00 WEST	database has increased disk size and the connection to the database is possible again

Action Items (optional)

The follow-up items for the issue with GitHub issues linked.

Retrospective (optional)

The section on the retrospective related to the issue.

Lessons Learned (optional)

Do not push code on Friday and before the author goes on holiday
We need better testing practices before merging code

What went well (optional)

We got slack alert of the increase in size of the database
Team managed to stop the increase of the database size in a timely manner

Date​

Summary​

Authors​

Impact​

Root Cause​

Trigger​

Detection​

Resolution​

Timeline​

Action Items (optional)​

Retrospective (optional)​

Lessons Learned (optional)​

What went well (optional)​