Vacuum process for PostgreSQL databases
Database vacuuming is a way to increase the table and database performance of a PostgreSQL database. Learn how to use the vacuum process to clean the database.
Introduction
VACUUM is a garbage collection mechanism in PostgreSQL. It reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are modified by an update/delete are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on tables.
For more information, see https://www.postgresql.org/docs/current/sql-vacuum.html
Vacuuming basics
Vacuuming achieves the following:
Cleans out dead rows (VACUUM)
Keeps database functional (FREEZE)
Updates info about relations (ANALYZE)
Automates the vacuum for maintenance on a regular basis (autovacuum)
From a broad perspective, the vacuuming can be done in two ways:·
Plain VACUUM (without FULL) simply reclaims space and makes it available for reuse
VACUUM FULL can reclaim space and give it to the operating system.
Key vacuum operations
VACUUM
VACUUM FULL (or CLUSTER)
VACUUM FREEZE
VACUUM ANALYZE
Plain VACUUM: Frees up space for reuse.
Connect to your database, here for example "corpus" and run the below command:
CODEvacuum [tablename];
Full VACUUM: Locks the database table and puts the exclusive lock on the table but reclaims more space than a plain VACUUM.
Connect to your database, here for example "corpus" and run the below command. It runs on all the tables of the "corpus" database.
CODEvacuum full;
CODEvacuum full [tablename];
Full VACUUM and ANALYZE: Performs a Full VACUUM and gathers new statistics on query executions paths using ANALYZE:
CODEvacuum full analyze [tablename];
Verbose Full VACUUM and ANALYZE: Same as #3 but with verbose progress output:
CODEvacuum full analyze verbose [tablename];
ANALYZE gathers statistics for query planners to create the most efficient query execution paths. According to the PostgreSQL documentation, accurate statistics help planners to choose the most appropriate query plan, and thereby improve the speed of query processing.
Example
In the example below, [tablename] is optional. Without a table specified, ANALYZE is run on available tables in the current schema that the user has access to.
CODEanalyze verbose [tablename];
VACUUM FREEZE: Freezes the transaction IDs for all pages no matter whether they have been modified, so that all current rows are seen as old for all new transactions. VACUUM FREEZE marks a table's contents with a specific transaction time stamp that tells Postgres that it does not need to be vacuumed, ever. With the next update, this frozen ID disappears.
To run VACUUM FREEZE on a database, use the below command after connecting to the database:
CODEvacuum freeze verbose
CODEvacuum freeze [tablename];
Automate routine vacuum maintenance
Autovacuum is one of the background utility processes that starts automatically when you start PostgreSQL. Autovacuum in Postgres refers to both automatic VACUUM and ANALYZE and not just VACUUM:
We need VACUUM to remove dead tuples so that the space occupied by dead tuples can be reused by the table for future inserts/updates.
We need ANALYZE on the table that updates the table statistics so that the optimizer can choose optimal execution plans for an SQL statement.
It is the autovacuum in Postgres that is responsible for performing both vacuum and analyze on tables.
Prerequisites
To start autovacuum, you must have the parameter autovacuum set to ON in the postgresql.conf file. By default, it is on.
Parameters
autovacuum = on # ( ON by default )
track_counts = on # ( ON by default )
When autovacuum runs
An automatic vacuum or analyze runs on a table depending on the following mathematic equations.
The formula for calculating the effective table level autovacuum threshold is:
Autovacuum VACUUM thresold for a table = autovacuum_vacuum_scale_factor * number of tuples + autovacuum_vacuum_threshold
With the equation above, it is clear that if the actual number of dead tuples in a table exceeds this effective threshold, due to updates and deletes, that table becomes a candidate for automatic VACUUM.
The below equation says that any table with a total number of inserts/deletes/updates exceeding this threshold—since last analyze—is eligible for automatic ANALYZE.
Autovacuum ANALYZE threshold for a table = autovacuum_analyze_scale_factor * number of tuples + autovacuum_analyze_threshold
Details
Let’s understand these parameters in detail.
autovacuum_vacuum_scale_factor OR autovacuum_analyze_scale_factor: Fraction of the table records that will be added to the formula. For example, a value of 0.2 equals 20% of the table records.
autovacuum_vacuum_threshold OR autovacuum_analyze_threshold: Minimum number of obsolete records or DML needed to trigger an autovacuum.
Example
Let’s consider a table "abc" with 1000 records and the following autovacuum parameters in the postgresql.conf file by default:
1 | autovacuum_vacuum_scale_factor = 0.2 |
2 | autovacuum_vacuum_threshold = 50 |
3 | autovacuum_analyze_scale_factor = 0.1 |
4 | autovacuum_analyze_threshold = 50 |
Using the above mentioned mathematical formulas as reference:
Table "abc" is a candidate for autovacuum VACUUM when:
Total number of Obsolete records = (0.2 * 1000) + 50 = 250
Table "abc" is a candidate for autovacuum ANALYZE when:
Total number of Inserts/Deletes/Updates = (0.1 * 1000) + 50 = 150
Useful commands
check autovacuum status
By default, autovacuuming should already be turned on, but let’s double-check with below command:
SELECT name, setting FROM pg_settings WHERE name='autovacuum';
check track_counts status
SELECT name, setting FROM pg_settings WHERE name='track_counts';
check settings related to autovacuum
You can view the current and default settings of the autovacuum parameters by the pg_settings:
select name, setting, unit, category, short_desc from pg_settings where category like 'Autovacuum';
check for disk space use
The following query shows you the table that is using the most disk space in your database:
SELECT
relname AS "table_name",
pg_size_pretty(pg_table_size(C.oid)) AS "table_size"
FROM
pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
AND nspname !~ '^pg_toast' AND relkind IN ('r')
ORDER BY pg_table_size(C.oid)
DESC LIMIT 1;
check for last (auto)vacuum run
SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables;
check for disk space reclaim after full vacuum
First, connect your PostgreSQL database with superuser and create the extension "pgstattuple". After this, run the select command with your <tablename>:
\c corpus postgres;
CREATE EXTENSION pgstattuple;
SELECT (dead_tuple_len + approx_free_space)/1073741824.0
AS reclaimable FROM pgstattuple_approx('<tablename>');
tune autovacuum for tables
To tune autovacuum for tables individually, you must know the number of inserts/deletes/updates on a table for an interval. You can also view the Postgres catalog view to get that information: pg_stat_user_tables
SELECT n_tup_ins as "inserts",n_tup_upd as "updates",n_tup_del as "deletes",
n_live_tup as "live_tuples", n_dead_tup as "dead_tuples"
FROM pg_stat_user_tables;
n_dead_tup > n_live_top * 0.2
check for running vaccum tasks
Use this view to get a quick overview of running vacuum tasks:
SELECT * FROM pg_stat_progress_vacuum;
SELECT
p.pid,
now() - a.xact_start AS duration,
coalesce(wait_event_type ||'.'|| wait_event, 'f') AS waiting,
CASE
WHEN a.query ~*'^autovacuum.*to prevent wraparound' THEN 'wraparound'
WHEN a.query ~*'^vacuum' THEN 'user'
ELSE 'regular'
END AS mode,
p.datname AS database,
p.relid::regclass AS table,
p.phase,
pg_size_pretty(p.heap_blks_total *
current_setting('block_size')::int) AS table_size,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(p.heap_blks_scanned *
current_setting('block_size')::int) AS scanned,
pg_size_pretty(p.heap_blks_vacuumed *
current_setting('block_size')::int) AS vacuumed,
round(100.0 * p.heap_blks_scanned / p.heap_blks_total, 1) AS scanned_pct,
round(100.0 * p.heap_blks_vacuumed / p.heap_blks_total, 1) AS vacuumed_pct,
p.index_vacuum_count,
round(100.0 * p.num_dead_tuples / p.max_dead_tuples,1) AS dead_pct
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a using (pid)
ORDER BY now() - a.xact_start DESC;