How To Debug Slowness & High CPU Utilization Issue in PostgreSQL Database
List of Metrics to monitor for PostgreSQL Database
In this post we will learn to Debug Slowness & High CPU Utilization Issue in PostgreSQL Database
There was a case with AWS RDS instance. The CPU utilization was 100% even after changing it to m3.xlarge from t2.medium. The problem was that some queries were stuck and were running for hours. This was keeping the CPU busy. The same query when run through psql shell gives the output in 4-5 seconds. 4-5 seconds are also too much. Although trying the same query from console gets executed successfully but sometimes it was getting stuck and kept on running for hours.
These are initial steps that can be taken to help find the root cause.
First check following system resources:
Disk Space : You must have 10% disk space available for the PostgreSQL Data Partition. Disk space may fluctuate during PostgreSQL vaccum-ing during high write loads.
CPU Usage : Badly optimized queries may hold the CPU for long time. High CPU usage may be because of compute intensive queries.
I/O Usage : If PostgreSQL runs slow, first measure the CPU percentage of IO/wait that indicates the amount of time machine is waiting for the disk read/write. You can check this with “top” command in Linux.
Now check PostgreSQL metrics:
1. Total Number of Connections
max_connections determines the maximum number of concurrent transactions to a database server and give a list of clients which are leaking the database connections. This you can get with the query below:
SELECT count(*) FROM pg_stat_activity;
2. Number of Connections by state
Four possible states of connections are :
(a) active – Connection currently executing the queries of transaction.
(b) idle – Connection not executing a transaction.
(c) idle in transaction – Connection in a long-running transaction i.i.e not executing query.
(d) idle in transaction(aborted) – Connection in a situation where transaction hasn’t been rolled back due to an error.
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
3. Connections waiting for a lock
Blocked Connections waiting for a lock indicates slow execution of transaction with an exclusive lock.
SELECT count(distinct pid) FROM pg_locks WHERE granted = false
4. Maximum Transaction Age
Transactions should be short so that it can get executed within a minute. Long-running transactions prevents PostgreSQL from vaccum-ing old data, it may shutdown the database due to transaction ID(xid) wraparound. If the output is more than one hour than this is a point of worry. This is because the query is running since long time keeping the resources busy. Change the max age parameter (in application code) of the connection to lowest possible values like 2-3 seconds as per the average response time of your database query.
SELECT max(now() -xact_start) FROM pg_stat_activity WHERE state IN ('idle in transaction','active');
5. Checkpoint Interval
Frequent Checkpoints leads down performance. PostgreSQL will display about those checkpoints in its log.
Also, you can check the frequency in the pg_stat_bgwriter table.
6. Query Execution Time
You have to measure it at the application level or by setting and analyzing the log queries periodically log_min_duration_statement=0 or by monitoring the pg_stat_statements module.
For me, the fourth point above was the issue. Some queries were running since a long time and making CPU busy. The one possible solution for this issue is to find those queries and kill them. You can read about how to kill a long running query here.