Troubleshooting
The following document provides troubleshooting tips for common issues that can happen in every environment.
PostgreSQL
Tips and tricks related to PostgreSQL.
Identifying and Killing Blocking Queries
If you suspect that the issue with the services is caused by blocked queries you can check that by running the following query:
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.username AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.username AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
The following columns from the above query are the most interesting:
- blocked_pid - the process identifier of the blocked query
- blocking_pid - the process identifier of the process that blocks the query
- blocked_statement - the query statement that is blocked
- current_statement - the current query statement showing the blocking query
Killing Blocking Query
To stop an execution of a query you can run the following statement:
SELECT pg_terminate_backend(QUERY_PID);
Where QUERY_PID
is the process identifier of the query that was identified to be
blocking.
Warning: if the reason for the blocking query is Cloud Run service, you need to revert to the previous version of the service, because it may happen that GCP will try to deploy the version again causing the same problem.
Temporary File Too Small
Some expensive queries, like creation of new tables based on a join of other tables may require larger than the default temporary file. This will result in the following error displayed in the query output:
ERROR: temporary file size exceeds temp_file_limit (9620200kB)
In such case, you need to set the temp_file_limit
for the problematic database. The value
can take up to 2147483647
. The value is specified in KB
. You can do that via Terraform,
similar to how the number of connections is provided in the database_flags
section in
ardb.tf.
Check for inconsistent B-tree indexes
In order to check for inconsistent B-tree indexes, you need to install the amcheck
extension by running:
CREATE EXTENSION amcheck;
And after that run the following query:
DO $$
DECLARE
r RECORD;
version varchar(100);
BEGIN
RAISE NOTICE 'Started amcheck on database: %', current_database();
SHOW server_version into version;
SELECT split_part(version, '.', 1) into version;
FOR r IN
SELECT c.oid, c.oid::regclass relname, i.indisunique
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i'
AND i.indisready AND i.indisvalid LOOP
BEGIN
RAISE NOTICE 'Checking index %:', r.relname;
IF version = '10' THEN
PERFORM bt_index_check(index => r.oid);
ELSE
PERFORM bt_index_check(index => r.oid, heapallindexed => r.indisunique);
END IF;
EXCEPTION
WHEN undefined_function THEN
RAISE EXCEPTION 'Failed to find the amcheck extension';
WHEN OTHERS THEN
RAISE LOG 'Failed to check index %: %', r.relname, sqlerrm;
RAISE WARNING 'Failed to check index %: %', r.relname, sqlerrm;
END;
END LOOP;
RAISE NOTICE 'Finished amcheck on database: %', current_database();
END $$;
In case an issue is found with the index, you will see the following message:
WARNING: Failed to check index account_privacy_list_pkey: item order invariant violated for index "account_privacy_list_pkey"
That means that such indices needs to be reindexed. This can be done by running:
Reindex index account_privacy_list_pkey
You may need to remove duplicate values from the table covering the index.
ETL
Dealing with Incompatible Jobs Deployment
Dataflow job deployment may fail because of various reasons, one of which is the jobs
incompability. You can see the incompatible job deployment when there is an alert in
the #alerts Slack channel or just by looking into the logs of the Dataflow job and
identifying the following log message with ERROR
severity:
Workflow failed. Causes: The new job is not compatible (...)
In such case, the job that failed needs to be updated semi-manually and the process is as follows:
- Kill the incompatible job:
Stop the incompatible job either via GCP UI or using the gcloud
command. Choose to
cancel
the job, not Drain
. Draining will take a longer time.
- Identify the newest pipeline hash:
We store jobs in the GCP bucket under a given hash. For example, to identify the latest version of the pipeline you can run the following command:
gcloud storage ls -l gs://etl-dataflow-qa-79eca957c4eac38e/dataflow/templates/ | sort -k 2 | tail -n 5
The above command will list 5 recent jobs in the QA environment, for example:
1781 2024-05-22T17:56:48Z gs://etl-dataflow-qa-79eca957c4eac38e/dataflow/templates/355e56cb1169358cf7f83039cb8ebfe9d8f672a9.json
1781 2024-05-24T06:57:50Z gs://etl-dataflow-qa-79eca957c4eac38e/dataflow/templates/9b0a4de215c27fcc8dd159cb25aaf0788ade0818.json
1781 2024-05-24T10:27:51Z gs://etl-dataflow-qa-79eca957c4eac38e/dataflow/templates/60c4b9dd903251cbfe9cc38ea3ea562e1d357fc9.json
1781 2024-05-24T11:44:23Z gs://etl-dataflow-qa-79eca957c4eac38e/dataflow/templates/5e62e572f5f432241e9fdcb3f4466c749997567a.json
1781 2024-05-24T15:08:03Z gs://etl-dataflow-qa-79eca957c4eac38e/dataflow/templates/4070726efaf73b95ec40bd379d3a3e817d724af7.json
In this case the newest one is 4070726efaf73b95ec40bd379d3a3e817d724af7
.
- Update Terraform
Update the Terraform variable called commit_hash
on the environment where the
job deployment failed and set it to the value identified in the previous step.
Once that is done and the job is killed apply the changes via Terraform.