Skip to main content

Handling of data between our environments

Timeline

TimeDescription
2022-11-08 19:00:01Created

Status: Proposed

Context

Currently, we do not have any mechanism nor policy to synchronize data between our environments. As a result:

  • QA environment is not able to estimate or test performance of newly developed code
  • QA environment does not have recent data to be able to run meaningful tests

Moreover, currently our ML experimentation is done as part of Dev environment and this creates problems:

  • We currently do need to move generated model objects manually or by custom code from Dev to QA and Prod to be able to test and deploy them
  • Some data for experimentation lays on production and we depend on manually downloading of this data to local computers for analysis. As a result, we lose data lineage and have no way of governing data
  • Rest of data for experimentation is stored in gs://archipelo-ml-datasets/ GCS bucket in Dev environment, which can be easily tempered, and we can easily introduce chaos into the data structure
  • Costly labeled data lays in Dev environment, which can be easily tempered or lost

Decision

  1. Create new environment variable DATA_FETCH_FACTOR.

    • this variable will be used to limit the data fetched by data extraction or data collection code from 3rd parties
      • Production would fetch every data row
      • QA would fetch every 5th data row
      • Dev would fetch every 20th data row
    • this variable purpose is to affect only the data extraction/collection steps in data pipelines that fetch data from 3rd parties in target architecture
    • all other steps like data aggregation/transformations must not take this variable into account
  2. Create a data synchronization tool that will allow to copy/overwrite portion of data files from source environment buckets to same locations in target environment buckets (from Prod to QA and from QA to Dev). The tool should be able to:

    • read all parquet, jsonl, json files from specific partition (path with file wildcards) of a bucket
    • deterministically pick random configurable portion of read data based on a column(s) value
    • save overwriting of the subset of the data into same (smaller) files on target environment
    • number, location and names of saved files should be the same. Only bucket name may change
    • tool should be run on bi-weekly schedule with a configuration specifying buckets and paths to sync
    • it should be possible to run the tool by TF as part of new env setup on demand with a configuration specifying buckets, paths and executions (date ranges) to sync
    • other binary files should be copied as they are
    • if data catalog is integrated, the tool should also add new partitions of metadata to data catalog of target environment
    • tool should also be able to copy portion of data from Cloud SQL the same way as with parquet files
    • ideally, do not copy files that did not change
    • sample configuration for each copy action:
      - source: "gs://archipelo-models-prod"
    target: "gs://archipelo-models-qa"
    path: "/**/*.[bin|json]"
    factor: 1.0
    - source: "archipelo-prod:us-central1:ardb-prod-912e743a"
    target: "archipelo-qa:us-central1:ardb-qa-f4b82898"
    path: "golang_package"
    factor: 0.2
    key: "id"
  3. Setup proper ML infrastructure on Production Environment to enable ML experimentation as Data Scientists should be production data users.

    • create a specific bucket for ML experimentation / analysis. Note: in future this bucket will be cleaned from files older than one month due to legal reasons (GDPR).
    • create new data science Google group that collects all our data interested internal users.
    • add all read/write/list permissions to this bucket to all accounts associated with data science Google group.
    • add read permissions to all other data related buckets (especially ETL buckets) to data science Google group.
    • vertex AI permissions?
    • Jupyter lab?
    • Integrate Doccano (our labeling tool) with ETL process so that labeled data would automatically land in ETL production buckets (not ML experimentation bucket)

Consequences

  • Data management will be easier
  • It will be possible to keep data synchronization between the environments
  • It will allow for performance estimations on QA environment
  • Keep processing/storage costs under control
  • Easier ML experimentation and MLOps activities
  • Ensure no production data loss