Skip to main content

Software Environments

This document describes what development environments we have in Archipelo, how do we use them and what is their purpose.

We have 4 environments which are totally independent, meaning that actions on one of them do not affect others. Each one of them serve a different purpose, so it is important to understand how they are used. The release process is built around them.

What is an environment?

An environment is a place that runs all components of our apps, so that it can be used as a complete product. Depending on the need or current task, you may need to run our product on a different environment, e.g. to test some feature or reproduce a bug.

Parts that make up the whole product are:

  • Frontend - visual part of the product (Web App, Chrome Extensions or VS Code plugin) written in TypeScript.
  • Backend - servers written in Go responsible for receiving requests from the frontend through API and responding with data. It consists of 2 separate parts:
    • web - user facing API,
    • spider - service responsible for crawling the data.
  • Data storage - We have different databases in use serving different purposes, but the main one is PostgreSQL storing catalogDB and identityDB in 2 separate instances. We also have ClickHouse (events and user metrics), OpenSearch (search related data) and Redis (frontend cache). We also store some other data in GCS Buckets (ML models, GitHub cloned repositories, temporary files for dev environment testing purposes, raw files like package.json).
  • ML - Machine Learning written in Python. The models are stored in GCS Buckets and use Vertex AI service.

Cloud Services

Crucial part of our products are services provided by Google Cloud Platform (GCP). Cloud services give our app its full potential to deliver the best user experience with all the features. That's why it is in some cases impossible to test specific parts of the product without a connection to GCP Services.

List of most important GCP services we use:

  • Compute Engine - hosts Virtual Machines instances
  • Google Cloud Storage (GCS) - place for storing some of our data
  • Cloud Run - deploys our containers providing APIs
  • Cloud SQL - PostgreSQL database instances
  • BigQuery - processing Stack Overflow Q&A and keywords (interest cloud)
  • Cloud Tasks - queues of tasks for crawling data from our data sources
  • Cloud Scheduler - cron jobs for refreshing our data
  • Memorystore - managed service for Redis database instances
  • Vertex AI - train ML models and create datasets
  • Pub/Sub - messaging service used for user behavior metrics
  • Dataflow - stream and batch data processing
  • Monitoring - dashboards for GCP project monitoring
  • Logging - logs from code execution

Local

Update interval: No releases - a developer decides which snapshot of the codebase will be used by choosing a specific branch.

Self-hosted by each engineer on own hardware.

Many tasks can be developed and tested using cost-free local environment. Both frontend and backend can be run locally, as well as hybrid solution like local frontend with backend on dev environment can be also set up (which may be called a "hybrid environment"). So, it's not possible to have a full local environment with all components, because some of them need to exist in cloud services (but they are not required for every kind of testing). It is recommended to use QA or dev environments for backend since locally created databases do not have any data which can be not very helpful. To have any data useful for testing on local environment, you would need to crawl data beforehand which can be time-consuming, depending on the amount needed for the tests.

What to test on local

All development-related changes able to be tested here that do not require vast quantities of data. It is recommended to run static analysis checks here to not exploit our GitHub resources.


Dev

GCP projectarchipelo-dev
Release intervalManual, on request.

A development environment created on Google Cloud Platform that generates costs.

The advantage of this environment when compared to local one is that here we can benefit from having cloud services that allow us to unleash a full potential of the products. The data still needs to be crawled first (or it can already exist, if it was crawled earlier). The amount of available data should be kept limited, but enough for simple use cases. Dev environment relates to one specific GCP project , but may contain several instances of virtual machines (shared dev or dedicated dev).

Shared Dev

Ultimately, we want to have only 1 dev environment on the cloud shared by the whole team. It would need to be booked (more info soon) in order to be used by a specific engineer. One dev environment should cover most of the team's needs.

Currently, we do not have such environment, but we already decided to follow this approach, and we need to set it up properly when we have capacity to implement it.

Dedicated Dev

In case when several engineers want to work on a dev environment, it's possible to create additional ones, but they need to be removed when work on them is finished. When creating new dev environment from scratch, the data is not available there, so it needs to be copied from another environment (which is a manual work).

In the future, we plan to have either a script that does that, or ready-to-copy data that can be easily loaded into new environment.

What to test on dev

Simple tests that require small amount of data and use data-related feature of our products. It is recommended to also use this environment for a preliminary run of more complex tests as a verification before running the full suite on high amount of data on QA or prod environments (to do it, split your test into smaller chunks that can be run independently).


QA

GCP projectarchipelo-qa
Release intervalWith each merge to main branch of the Archipelo/top repository.

Volatile environment which is updated with every change on the main branch dedicated for more sophisticated testing.

Its purpose is to provide ability to test various features and execute tests that may affect our users if run on production. QA environment is there to provide a place for the team to see if all components of the product are integrated properly. The amount of data here is lower than the one on production, but it can contain thousands (if not millions) of records to enable meaningful simulation of how the products function in close-to-production environment.

The difference between dev and QA is that QA has Redis database used for frontend caching and higher limit for crawled data (amount of data spider service will produce for a specific environment).

What to test on QA

Tests that require more data that is available on dev environment, but it may also be considered to run the test directly on prod (rules that help to make this decision are defined below under Prod section).

Before the code is pushed to QA, it is verified by different types of tests run automatically using GitHub Actions - unit tests, integration tests, E2E tests.


Prod

GCP projectarchipelo-prod
Release intervalManual, on request.

A production environment running code with data visible to our users. We need to be very careful when touching the prod environment, since it can directly affect the users and cause some issues. This environment is very similar to QA, but it has more data (everything that we crawled already), and a better infrastructure that can hold high amount of workload at the same time. It is not recommended to use production backend for any testing. We monitor prod constantly and use it to track the performance of our products.

What to test on prod

Rules for making a choice whether a test should run on production:

  1. It's a long-running test (meaning "days"). It's hard to define a strict rule here, but if you plan to execute tests that may be running for several days, consider running them on production but before always discuss it with a team.
  2. The test do not affect the users directly. If the data can be later used in production, it's even better to have it there.
  3. The processed data is hard to move to production later. If data produced by the test would be needed for our production, and it's difficult to easily move it (e.g. we need to execute the test again on different environment), we can consider running the test directly on prod.
  4. The costs of running the test are potentially high. The cost may be related to computational workload or storage. Using cloud is not free, and we need to keep in mind to not risk spending too much money on tests, that later need to be eventually repeated on another environment. Good practice is to prepare a smaller chunk of test that produces smaller amount of data and run it as a proof of concept, and later repeat the test for the whole dataset.
  5. The cost of our time is high. We need to optimally share our time between different tasks when working on our products. When we decide to work on task A, there is a hidden cost of not spending this time on working on tasks B, C and D. This also relates to situations when our work can affect others and their time. Before running long-running tests, we need to take into account how this work will affect your time and the time of other members of the team.