[2023-11-21]: Dataflow Pipeline Scaling Up Issue

Date

2023-11-14 20:45 CEST - started 2023-11-14 21:00 CEST - jobs killed 2023-11-21 12:00 CEST - fix deployed

Summary

Two pipelines for user commit digest and user active branch did scale up (up to 100 nodes) the qa environment
why?
System lag for this pipelines did grow indefinitely
Why?
Recent updates on gcp (old code versions also had same issue), in our code and our changes did break our error handling and it was stuck
Why?
Seems its related to new Beam API parts used (new trigger type), missing some type hints and not passing them properly around in error handling and few other transforms.
Why?
We did not have time to properly implement type hints yet and we do not have integration tests yet. Error handling failing did derail investigation as this is most likely caused by some change on DataFlow side. Using new trigger on DataFlow did fail with not obvious/related message - sounds like its also a bug on DataFlow side.

Authors

Piotr Wiśniowski Rafał Kuć

Impact

The information about the impact of the issue including:

Affected Infrastructure Elements: Dataflow pipelines for user commit digest and user active branch.
Affected Product Features: Data processing for user commit digest and user active branch.
Affected Users: No direct impact on end-users, but internal monitoring and downstream processes were affected.

Resolution

Manually killing the jobs to avoid costs.
Investigating if revert of the recent changes will fix issue.
Investigating turning off error handling.
Adding maximum number of workers per environment in Terraform and template.py file.
Investigating more adding more type hints and refactoring job graph.
Also analyzing latest changes to the code did yield that a new trigger is used that was not previously used by us in DataFlow.
Testing and merging the fix to qa. Fix was done by changing trigger to some other and will result in slightly increased CPU usage.
Detailed monitoring.
Future: Tracking metrics for vertical scaling of Dataflow.
Future: Setting up cost limits policy.
Future: Think about integration testing of Beam API used.

Timeline

The timeline of the events related to the issue in form of the table:

Time	Description
2023-11-14 20:15:00	New deployments with changes
2023-11-14 20:30:00	Alerts about system lag did fire
2022-11-14 21:05:00	Jobs started to scale up
2022-11-14 21:15:00	Jobs were killed manually
2022-11-15 12:00:00	Investigation started
2022-11-24 11:00:00	Fix deployed

Action Items (optional)

Tracking metrics for vertical scaling of Dataflow.
Setting up cost limits policy.
Plan and implement integration tests for Beam API used.

Retrospective (optional)

The section on the retrospective related to the issue.

Lessons Learned (optional)

We definitely should test if past code versions do work earlier. This potentially could point investigation to the right track faster.
We should not be confident that all Beam API work on DataFlow.

What went well (optional)

alerting and quick action on our side to limit costs

What went wrong (optional)

investigation was long due to no/little error information.
old code versions also did stop working, which was a surprise. We could try testing this earlier.

Additional Information (optional)

Raw detailed events logged during investigation:

Few PRs with refactors on user commit digest and user active branch did land. This included changing coders for keys from plan tuples of strings to Row class to have names assigned to fields. And chaining windowing strategy for few steps to ensure consistency in case of duplicated and unordered events.
Pipelines were killed as the steps deployed were not compatible with past running pipelines.
Deployment did succeed, and pipelines got back up.
This two pipelines started running and did manage to start processing of the data, but output data freshness did start to grow.
Alerts on ETL system lag were fired and we started closely monitor the pipelines.
After looking on the logs a warning was found about the coders used for keys. We did try few other options (NamedTuple and pedantic BaseModel), but finally only pure raw tuple does not trigger this warning.
After 30-60 min of pipeline running dataflow started to spawn all possible instances.
We did manually kill the jobs to avoid costs.
Also some errors were noticed on downstream repo-branch-digest, repo-release-digest and repo-notable-event-stats pipelines. Most likely related to bad coder users for payload.
Pipelines with changed coders were deployed (and previous killed as chasing coder is backward incompatible change).
Downstream repo-branch-digest, repo-release-digest and repo-notable-event-stats pipelines started operate normally.
Same behavior of growing system lag for user commit digest and user active branch was seen, but no warning about coder.
40 min after deployment pipeline for user-commit-digest started spawning all possible instances.
This was quite in the evening and I did not have permissions on gcp to kill jobs, and needed to notify Rafał to send a notification to Rafal even when he was offline.
Rafał did got back to work and we again did kill the job manually.
Meanwhile we did find some errors in worker logs of this jobs, but unable to understand them (no verbose msg). Example: https://console.cloud.google.com/logs/query;query=resource.type%3D%22dataflow_step%22%0Aresource.labels.job_id%3D%222023-11-14_12_27_37-8685115990442826858%22%0AlogName%3D%2528%22projects%2Farchipelo-qa%2Flogs%2Fdataflow.googleapis.com%252Fworker%22%20OR%20%22projects%2Farchipelo-qa%2Flogs%2Fdataflow.googleapis.com%252Fworker-startup%22%2529;cursorTimestamp=2023-11-14T20:38:13.434135735Z;aroundTime=2023-11-14T20:27:37.685Z;duration=PT15M?project=archipelo-qa
We decided to park the problem till tomorrow and keep the user commit digest and user active branch down till this is investigated and fixed.
Total cost of this two pipelines for 14 Now is ~$33, compared to normal operation ~$8 per day. The two pipelines with all nodes up were operating ~1h in total.
Investigation pointed that errors are caused by Firestore save (batch step), and result in system latency
Investigation pointed that with error handling turned off pipeline works normally
Also old versions of pipelines (~month ago) and with new/old dependencies also suffer same problem
We did also check if windowing has any effect on error handling or this error and there was no correlation
We decided to:
- Turn off error handling for now as it’s the only way to have pipelines up and running.
- Add max num workers per env on terraform and template.py file.
- Decided to also track metrics for vertical scaling of dataflow.
- And setup some cost limits policy.
- We had a discussion about providing emergency credentials, but decided that adding max number of workers is enough for now.
It seems that Dataflow had some changes meantime that did affect how out error handling works.
PR created wit turning off error handling and adding max num workers
More testing resulted in finding another issue with user-commit-digest
Working on adding more type hints for pipeline resulted in not showing errors
Turning off error handling, adding few type hints and improving job graph by utilizing more nesting did help and job is not stuck.
Merging fix
More investigation by using old code with turned off error handling shows that the lag issue was introduced in 8bb143aba2e8db1675882cf7265dcdef572cf9f0
Testing fix by changing trigger
Merging fix
Deployment

Date​

Summary​

Authors​

Impact​

Resolution​

Timeline​

Action Items (optional)​

Retrospective (optional)​

Lessons Learned (optional)​

What went well (optional)​

What went wrong (optional)​

Additional Information (optional)​