Skip to main content

[2024-04-26]: Dataflow Watermark Issue

Date

2024-04-09 - we experience the first lag 2024-04-09 - we start rolling back the code changes 2024-04-11 - we continue experiencing the lag in random pipelines 2024-04-24 - we force upgrade grpcio to non affected 1.62.2 version and the issues stop

Summary

  • We started observing issues with certain data products not being created in a timely manner and lag alerts being produced.
  • Why?
  • There was lag in random Dataflow pipelines in QA and Production environments.
  • Why?
  • The watermark was not progressing properly.
  • Why?
  • The SDK harness stops responding causing the SDK worker to be considered unresponsive and shut down.
  • Why?
  • There is a bug in Apache Beam when using the affected grpcio library version.

Authors

Rafał Kuć

Impact

The information about the impact of the issue including:

  • Affected Infrastructure Elements: Dataflow pipelines.
  • Affected Product Features: Digests, reports, SCM comments, user activity, dashboards.
  • Affected Users: User using the affected product features.

Resolution

  • Force upgrade of the grpcio to a non affected 1.62.2 version in #11655.

Timeline

The timeline of the major events related to the issue in a form of the table:

TimeDescription
2024-04-09We experience the first issue with the watermark progression in the pipelines
2024-04-09We add reshuffle supporting more investigation
2024-04-09We remove session windows form data products
2024-04-10A dedicated (#etl-issues-firefight) Slack channel is created to help diagnose the issue
2024-04-10We roll back to the code state before introducing backfilling
2024-04-10We downgrade Apache Beam to 2.54.0
2024-04-10We move code back to pipeline split and bring back old service accounts
2024-04-10We bring back recent ETL changes
2024-04-15We update dependencies - pydantic, pytest and black
2024-04-16We test running the jobs on the default network
2024-04-17We add bigger disks and harness cache to Dataflow workers
2024-04-17We add micro buffering
2024-04-18We change the logic to test alway emit strategy
2024-04-18We disabled Dataflow Prime
2024-04-24Piotr founds the grpcio bug on the Apache Beam issues list
2024-04-24We force upgrade grpcio to non affected 1.62.2 version
2024-04-26We observed Dataflow QA pipelines working without issues for 48 hours

Action Items (optional)

  • Upgrade to Apache Beam 2.56.0 once available.

Lessons Learned (optional)

  • Longer Apache Beam version testing.
  • We need to create a production release before upgrading to a newer Apache Beam version.
  • We should reach out to support and to the Beam Community earlier with the issue description.
  • We should monitor Apache Beam issues with Python and Dataflow tags next to release notes and known issues

What went well (optional)

  • Team efforts related to root cause analysis investigation.
  • Correlation of the errors related to SDK and lag appearing and stopping.
  • We did also chalenged our code changes early by reverting last big changes.

What went wrong (optional)

  • Investigation was long due to randomness of watermark issue.
  • The team should have reached out to the Apache Beam issues list earlier.