[2024-04-26]: Dataflow Watermark Issue
Date
2024-04-09 - we experience the first lag
2024-04-09 - we start rolling back the code changes
2024-04-11 - we continue experiencing the lag in random pipelines
2024-04-24 - we force upgrade grpcio
to non affected 1.62.2
version and the issues stop
Summary
- We started observing issues with certain data products not being created in a timely manner and lag alerts being produced.
- Why?
- There was lag in random Dataflow pipelines in QA and Production environments.
- Why?
- The watermark was not progressing properly.
- Why?
- The SDK harness stops responding causing the SDK worker to be considered unresponsive and shut down.
- Why?
- There is a bug in Apache Beam when using the affected
grpcio
library version.
Authors
Rafał Kuć
Impact
The information about the impact of the issue including:
- Affected Infrastructure Elements: Dataflow pipelines.
- Affected Product Features: Digests, reports, SCM comments, user activity, dashboards.
- Affected Users: User using the affected product features.
Resolution
- Force upgrade of the
grpcio
to a non affected1.62.2
version in #11655.
Timeline
The timeline of the major events related to the issue in a form of the table:
Time | Description |
---|---|
2024-04-09 | We experience the first issue with the watermark progression in the pipelines |
2024-04-09 | We add reshuffle supporting more investigation |
2024-04-09 | We remove session windows form data products |
2024-04-10 | A dedicated (#etl-issues-firefight) Slack channel is created to help diagnose the issue |
2024-04-10 | We roll back to the code state before introducing backfilling |
2024-04-10 | We downgrade Apache Beam to 2.54.0 |
2024-04-10 | We move code back to pipeline split and bring back old service accounts |
2024-04-10 | We bring back recent ETL changes |
2024-04-15 | We update dependencies - pydantic, pytest and black |
2024-04-16 | We test running the jobs on the default network |
2024-04-17 | We add bigger disks and harness cache to Dataflow workers |
2024-04-17 | We add micro buffering |
2024-04-18 | We change the logic to test alway emit strategy |
2024-04-18 | We disabled Dataflow Prime |
2024-04-24 | Piotr founds the grpcio bug on the Apache Beam issues list |
2024-04-24 | We force upgrade grpcio to non affected 1.62.2 version |
2024-04-26 | We observed Dataflow QA pipelines working without issues for 48 hours |
Action Items (optional)
- Upgrade to Apache Beam
2.56.0
once available.
Lessons Learned (optional)
- Longer Apache Beam version testing.
- We need to create a production release before upgrading to a newer Apache Beam version.
- We should reach out to support and to the Beam Community earlier with the issue description.
- We should monitor Apache Beam issues with
Python
andDataflow
tags next to release notes and known issues
What went well (optional)
- Team efforts related to root cause analysis investigation.
- Correlation of the errors related to SDK and lag appearing and stopping.
- We did also chalenged our code changes early by reverting last big changes.
What went wrong (optional)
- Investigation was long due to randomness of watermark issue.
- The team should have reached out to the Apache Beam issues list earlier.