Handling of data between our environments

Timeline

Time	Description
2022-11-08 19:00:01	Created

Status: Proposed

Context

Currently, we do not have any mechanism nor policy to synchronize data between our environments. As a result:

QA environment is not able to estimate or test performance of newly developed code
QA environment does not have recent data to be able to run meaningful tests

Moreover, currently our ML experimentation is done as part of Dev environment and this creates problems:

We currently do need to move generated model objects manually or by custom code from Dev to QA and Prod to be able to test and deploy them
Some data for experimentation lays on production and we depend on manually downloading of this data to local computers for analysis. As a result, we lose data lineage and have no way of governing data
Rest of data for experimentation is stored in gs://archipelo-ml-datasets/ GCS bucket in Dev environment, which can be easily tempered, and we can easily introduce chaos into the data structure
Costly labeled data lays in Dev environment, which can be easily tempered or lost

Decision

Create new environment variable DATA_FETCH_FACTOR.
- this variable will be used to limit the data fetched by data extraction or data collection code from 3rd parties
  - Production would fetch every data row
  - QA would fetch every 5th data row
  - Dev would fetch every 20th data row
- this variable purpose is to affect only the data extraction/collection steps in data pipelines that fetch data from 3rd parties in target architecture
- all other steps like data aggregation/transformations must not take this variable into account
Create a data synchronization tool that will allow to copy/overwrite portion of data files from source environment buckets to same locations in target environment buckets (from Prod to QA and from QA to Dev). The tool should be able to:
- read all parquet, jsonl, json files from specific partition (path with file wildcards) of a bucket
- deterministically pick random configurable portion of read data based on a column(s) value
- save overwriting of the subset of the data into same (smaller) files on target environment
- number, location and names of saved files should be the same. Only bucket name may change
- tool should be run on bi-weekly schedule with a configuration specifying buckets and paths to sync
- it should be possible to run the tool by TF as part of new env setup on demand with a configuration specifying buckets, paths and executions (date ranges) to sync
- other binary files should be copied as they are
- if data catalog is integrated, the tool should also add new partitions of metadata to data catalog of target environment
- tool should also be able to copy portion of data from Cloud SQL the same way as with parquet files
- ideally, do not copy files that did not change
- sample configuration for each copy action:
```
  - source: "gs://archipelo-models-prod"
    target: "gs://archipelo-models-qa"
    path: "/**/*.[bin|json]"
    factor: 1.0
  - source: "archipelo-prod:us-central1:ardb-prod-912e743a"
    target: "archipelo-qa:us-central1:ardb-qa-f4b82898"
    path: "golang_package"
    factor: 0.2
    key: "id"
```
Setup proper ML infrastructure on Production Environment to enable ML experimentation as Data Scientists should be production data users.
- create a specific bucket for ML experimentation / analysis. Note: in future this bucket will be cleaned from files older than one month due to legal reasons (GDPR).
- create new data science Google group that collects all our data interested internal users.
- add all read/write/list permissions to this bucket to all accounts associated with data science Google group.
- add read permissions to all other data related buckets (especially ETL buckets) to data science Google group.
- vertex AI permissions?
- Jupyter lab?
- Integrate Doccano (our labeling tool) with ETL process so that labeled data would automatically land in ETL production buckets (not ML experimentation bucket)

Consequences

Data management will be easier
It will be possible to keep data synchronization between the environments
It will allow for performance estimations on QA environment
Keep processing/storage costs under control
Easier ML experimentation and MLOps activities
Ensure no production data loss

Timeline​

Status: Proposed​

Context​

Decision​

Consequences​

Timeline

Status: Proposed

Context

Decision

Consequences