[2023-11-21]: Dataflow Pipeline Scaling Up Issue
Date
2023-11-14 20:45 CEST - started 2023-11-14 21:00 CEST - jobs killed 2023-11-21 12:00 CEST - fix deployed
Summary
- Two pipelines for user commit digest and user active branch did scale up (up to 100 nodes) the qa environment
- why?
- System lag for this pipelines did grow indefinitely
- Why?
- Recent updates on gcp (old code versions also had same issue), in our code and our changes did break our error handling and it was stuck
- Why?
- Seems its related to new Beam API parts used (new trigger type), missing some type hints and not passing them properly around in error handling and few other transforms.
- Why?
- We did not have time to properly implement type hints yet and we do not have integration tests yet. Error handling failing did derail investigation as this is most likely caused by some change on DataFlow side. Using new trigger on DataFlow did fail with not obvious/related message - sounds like its also a bug on DataFlow side.
Authors
Piotr Wiśniowski Rafał Kuć
Impact
The information about the impact of the issue including:
- Affected Infrastructure Elements: Dataflow pipelines for user commit digest and user active branch.
- Affected Product Features: Data processing for user commit digest and user active branch.
- Affected Users: No direct impact on end-users, but internal monitoring and downstream processes were affected.
Resolution
- Manually killing the jobs to avoid costs.
- Investigating if revert of the recent changes will fix issue.
- Investigating turning off error handling.
- Adding maximum number of workers per environment in Terraform and template.py file.
- Investigating more adding more type hints and refactoring job graph.
- Also analyzing latest changes to the code did yield that a new trigger is used that was not previously used by us in DataFlow.
- Testing and merging the fix to qa. Fix was done by changing trigger to some other and will result in slightly increased CPU usage.
- Detailed monitoring.
- Future: Tracking metrics for vertical scaling of Dataflow.
- Future: Setting up cost limits policy.
- Future: Think about integration testing of Beam API used.
Timeline
The timeline of the events related to the issue in form of the table:
Time | Description |
---|---|
2023-11-14 20:15:00 | New deployments with changes |
2023-11-14 20:30:00 | Alerts about system lag did fire |
2022-11-14 21:05:00 | Jobs started to scale up |
2022-11-14 21:15:00 | Jobs were killed manually |
2022-11-15 12:00:00 | Investigation started |
2022-11-24 11:00:00 | Fix deployed |
Action Items (optional)
- Tracking metrics for vertical scaling of Dataflow.
- Setting up cost limits policy.
- Plan and implement integration tests for Beam API used.
Retrospective (optional)
The section on the retrospective related to the issue.
Lessons Learned (optional)
- We definitely should test if past code versions do work earlier. This potentially could point investigation to the right track faster.
- We should not be confident that all Beam API work on DataFlow.
What went well (optional)
- alerting and quick action on our side to limit costs
What went wrong (optional)
- investigation was long due to no/little error information.
- old code versions also did stop working, which was a surprise. We could try testing this earlier.
Additional Information (optional)
Raw detailed events logged during investigation:
- Few PRs with refactors on user commit digest and user active branch did land. This included changing coders for keys from plan tuples of strings to Row class to have names assigned to fields. And chaining windowing strategy for few steps to ensure consistency in case of duplicated and unordered events.
- Pipelines were killed as the steps deployed were not compatible with past running pipelines.
- Deployment did succeed, and pipelines got back up.
- This two pipelines started running and did manage to start processing of the data, but output data freshness did start to grow.
- Alerts on ETL system lag were fired and we started closely monitor the pipelines.
- After looking on the logs a warning was found about the coders used for keys. We did try few other options (NamedTuple and pedantic BaseModel), but finally only pure raw tuple does not trigger this warning.
- After 30-60 min of pipeline running dataflow started to spawn all possible instances.
- We did manually kill the jobs to avoid costs.
- Also some errors were noticed on downstream repo-branch-digest, repo-release-digest and repo-notable-event-stats pipelines. Most likely related to bad coder users for payload.
- Pipelines with changed coders were deployed (and previous killed as chasing coder is backward incompatible change).
- Downstream repo-branch-digest, repo-release-digest and repo-notable-event-stats pipelines started operate normally.
- Same behavior of growing system lag for user commit digest and user active branch was seen, but no warning about coder.
- 40 min after deployment pipeline for user-commit-digest started spawning all possible instances.
- This was quite in the evening and I did not have permissions on gcp to kill jobs, and needed to notify Rafał to send a notification to Rafal even when he was offline.
- Rafał did got back to work and we again did kill the job manually.
- Meanwhile we did find some errors in worker logs of this jobs, but unable to understand them (no verbose msg). Example: https://console.cloud.google.com/logs/query;query=resource.type%3D%22dataflow_step%22%0Aresource.labels.job_id%3D%222023-11-14_12_27_37-8685115990442826858%22%0AlogName%3D%2528%22projects%2Farchipelo-qa%2Flogs%2Fdataflow.googleapis.com%252Fworker%22%20OR%20%22projects%2Farchipelo-qa%2Flogs%2Fdataflow.googleapis.com%252Fworker-startup%22%2529;cursorTimestamp=2023-11-14T20:38:13.434135735Z;aroundTime=2023-11-14T20:27:37.685Z;duration=PT15M?project=archipelo-qa
- We decided to park the problem till tomorrow and keep the user commit digest and user active branch down till this is investigated and fixed.
- Total cost of this two pipelines for 14 Now is ~$33, compared to normal operation ~$8 per day. The two pipelines with all nodes up were operating ~1h in total.
- Investigation pointed that errors are caused by Firestore save (batch step), and result in system latency
- Investigation pointed that with error handling turned off pipeline works normally
- Also old versions of pipelines (~month ago) and with new/old dependencies also suffer same problem
- We did also check if windowing has any effect on error handling or this error and there was no correlation
- We decided to:
- Turn off error handling for now as it’s the only way to have pipelines up and running.
- Add max num workers per env on terraform and template.py file.
- Decided to also track metrics for vertical scaling of dataflow.
- And setup some cost limits policy.
- We had a discussion about providing emergency credentials, but decided that adding max number of workers is enough for now.
- It seems that Dataflow had some changes meantime that did affect how out error handling works.
- PR created wit turning off error handling and adding max num workers
- More testing resulted in finding another issue with user-commit-digest
- Working on adding more type hints for pipeline resulted in not showing errors
- Turning off error handling, adding few type hints and improving job graph by utilizing more nesting did help and job is not stuck.
- Merging fix
- More investigation by using old code with turned off error handling shows that the lag issue was introduced in 8bb143aba2e8db1675882cf7265dcdef572cf9f0
- Testing fix by changing trigger
- Merging fix
- Deployment