Skip to main content

Doccano

The following document describes the current Doccano setup.

The Project

The current Doccano setup relies on manual setup of:

  • The Virtual Machine inside the archipelo-dev project on GCP
  • Doccano latest Docker container. See here for instructions.
  • Manually installing Doccano
  • Manually installing NGINX
  • Manually installing Certbot

Docker Build and Run

First, pull Doccano's latest Docker image:

docker pull doccano/doccano

The next step is to create a Doccano container from the pulled image. Ensure that the VM has sufficient memory and compute, then create the container, setting the following environments:

docker container create --name doccano -e "ADMIN_USERNAME=admin" -e "ADMIN_EMAIL=admin@archipelo.co" -e "ADMIN_PASSWORD=<ADMIN_PASSWORD>" -e "WORKERS=8" -e "CELERY_WORKERS=8" -v doccano-db:/data -p 8000:8000 doccano/doccano

To receive the admin password, contact:

Docker Configuration

Once the Docker container is running, you can update it to set it to restart always, so that you don't have to manually do that after the VM restart. To do that, run the following command:

docker update --restart=always <CONTAINER_ID>

You can get the container identifier by running the following command:

docker ps

The output of the above command looks as follows:

CONTAINER ID   IMAGE             COMMAND                  CREATED       STATUS       PORTS                    NAMES
01dc2b43e63f doccano/doccano "/doccano/tools/run.…" 4 hours ago Up 3 hours 0.0.0.0:8000->8000/tcp doccano

Setting up the Domain Name

The main domain doccano.archipelo.com is configured and handled by CloudFlare.

To setup the dev-archer.com domain name for Doccano you need to configure the A record to point to the Doccano external IP address. You can do that by going to the Cloud DNS section in archipelo-dev project in Google Cloud Platform.

Currently, the domain used for Doccano is doccano.dev-archer.com and the A DNS record is set to doccano.dev-archer.com. with TTL set to 3600 seconds with the value pointing to the Doccano virtual machine external IP - 35.195.54.139.

Setting Up NGINX

After installing NGINX on the same virtual machine that Doccano is running, you need to configure it to handle the traffic on port 80. We also need to make sure that the caching is turned off. To do that, remove the /etc/nginx/sites-enabled/default file and create a new file called /etc/nginx/sites-enabled/doccano with the following content:

server {

server_name doccano.dev-archer.com;

location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_no_cache 1;
proxy_cache_bypass 1;

add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;
add_header X-XSS-Protection "1; mode=block";

add_header Last-Modified $date_gmt;
add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0';
if_modified_since off;
expires off;
etag off;
}
}

server {
server_name doccano.dev-archer.com;
listen 80;
return 301 https://$host$request_uri;
}

The above configuration assumes that you will be using the doccano.dev-archer.com as the domain name.

After that, restart NGINX by running the following command:

service nginx restart

Upload File Limits

By default, NGINX comes with a limit on how large the uploaded file can be set to 1MB. This is not enough for the data that we want to work during the labeling. To change the limits, adjust the /etc/nginx/nginx.conf configuration file and add the following line to the http section:

client_max_body_size 100M;

After that, restart NGINX by running the following command:

service nginx restart

Setting Up SSL

Depending on the choice it is possible to expose Doccano via different domains. Currently, Doccano is set up to be accessible via:

Setting up SSL for them is different.

Warning The traffic on doccano.archipelo.com is proxied via CloudFlare which means it is more secure.

SSL For doccano.archipelo.com

To use doccano.archipelo.com you need to add the A record that points to the IP address of the Doccano virtual machine. You can do it via CloudFlare. This requires asking a person with the admin rights to do that for you, which includes:

  • Dom
  • Rafał
  • Stepan

SSL For doccano.dev-archer.com

For Doccano, in its current setup we are using Certbot to handle the certificate. We will change that once we move Doccano to archipelo-prod project.

You need to install Certbot on the same virtual machine that Doccano and NGINX are running. You can do that by running the following commands:

apt install certbot
apt install python3-certbot-nginx

Once the installation is successful, you can start the certificate installation process by running the following command:

certbot --nginx

Answer the questions asked by the Certbot and it should guide you through the setup process. The only thing to remember is to provide the appropriate e-mail to let the whole engineering team know when the certificate will be expiring. We suggest using the engineering@archipelo.co.

Once that is done, restart NGINX by running the following command:

service nginx restart

Automatic Certificate Renewal

It is possible to setup Certbot for automatic certificate renewal. To do that, edit the crontab by running the following command:

crontab -e

And add the following entry:

0 12 * * * /usr/bin/certbot renew --quiet

Once that is done, save the file and automatic certificate renewal will be set up and ready.