This quick guide won’t replace the documentation. It provides an example of PostgreSQL backup, PITR configuration & restore.
Backup configuration
This is not intended to supersede any backup configuration manual, but before recovery, let’s check how the backup is running.
The /mnt/cephfs is a shared filesystem on which I place the backups and the archivelogs. Archivelogs are saved after each WAL rotation. Let’s review the config.
Full Backup
A full backup is triggered every 2 weeks via the postgres
crontab. ( crontab -e
as the postgres
user)
# m h dom mon dow command 0 4 1,15 * * /var/lib/postgresql/bin/backup_database.sh && find /mnt/cephfs/backup/postgresql/archivelogs -mtime +30 -delete
$ cat /var/lib/postgresql/bin/backup_database.sh #!/bin/bash DIR=/mnt/cephfs/backup/postgresql/$( date +%F_%Hh%M ) mkdir "$DIR" pg_basebackup -z -Z8 -Ft -D "$DIR"
WAL backup
postgres@minetest:~$ grep ^archive /etc/postgresql/14/main/postgresql.conf archive_mode = on # enables archiving; off, on, or always archive_command = 'nice xz -1 -zc %p > /mnt/cephfs/backup/postgresql/archivelogs/%f.xz && nice xz -t /mnt/cephfs/backup/postgresql/archivelogs/%f.xz'
Restore
Check that you have the restore_command
aligned with your archive_command
(location, compression, etc…), and set the recovery_target_time
. The stop point must be after the ending time of the base backup.
You can see here a simple one line command that
- checks for the compressed
WAL / archivelog
integrity (-t
) - decompresses the archive and ‘cat’ the output to stdout (`xz -dc`)
- sends the decompressed output to the
WAL
file path to restore to (> %p
)
postgres@minetest:~$ grep -E '^(restore|recovery)' /etc/postgresql/14/main/postgresql.conf restore_command = 'xz -t /mnt/cephfs/backup/postgresql/archivelogs/%f.xz && xz -dc /mnt/cephfs/backup/postgresql/archivelogs/%f.xz > %p' recovery_target_time = '2023-04-08 16:00:00'
After checking the postgresql.conf
Stop the DB
For Debian/Ubuntu:
as root
systemctl stop postgresql
Delete the DB data & restore full backup
Make a copy if you have enough space.
mv /var/lib/postgresql/14/main /var/lib/postgresql/14/main_original_backup mkdir /var/lib/postgresql/14/main tar -zxvf /mnt/cephfs/backup/postgresql/2023-04-01_04h00/base.tar.gz -C /var/lib/postgresql/14/main/ chmod 700 /var/lib/postgresql/14/main touch /var/lib/postgresql/14/main/recovery.signal
The presence of the recovery.signal
file will trigger the recovery process on the next start.
Start the DB and monitor the Point-in-time recovery
- (as root)
systemctl start postgresql
- Watch the process progress with the log :
tail -f /var/log/postgresql/postgresql-14-main.log
2023-04-09 11:00:31.389 UTC [10090] LOG: starting PostgreSQL 14.7 (Ubuntu 14.7-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit 2023-04-09 11:00:31.390 UTC [10090] LOG: listening on IPv6 address "::1", port 5432 2023-04-09 11:00:31.390 UTC [10090] LOG: listening on IPv4 address "127.0.0.1", port 5432 2023-04-09 11:00:31.452 UTC [10090] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2023-04-09 11:00:31.533 UTC [10091] LOG: database system was interrupted; last known up at 2023-04-01 04:06:15 UTC xz: /mnt/cephfs/backup/postgresql/archivelogs/00000002.history.xz: No such file or directory 2023-04-09 11:00:31.594 UTC [10091] LOG: starting point-in-time recovery to 2023-04-08 16:00:00+00 2023-04-09 11:00:32.422 UTC [10091] LOG: restored log file "00000001000000A400000057" from archive 2023-04-09 11:00:32.602 UTC [10091] LOG: redo starts at A4/57000028 2023-04-09 11:00:32.663 UTC [10091] LOG: consistent recovery state reached at A4/577AD5A8 2023-04-09 11:00:32.664 UTC [10090] LOG: database system is ready to accept read-only connections 2023-04-09 11:00:33.814 UTC [10091] LOG: restored log file "00000001000000A400000058" from archive 2023-04-09 11:00:36.060 UTC [10091] LOG: restored log file "00000001000000A400000059" from archive 2023-04-09 11:00:37.675 UTC [10091] LOG: restored log file "00000001000000A40000005A" from archive ... [Deleted hundreds of lines] 2023-04-09 11:43:38.906 UTC [10091] LOG: restored log file "00000001000000A7000000B9" from archive 2023-04-09 11:43:40.502 UTC [10091] LOG: restored log file "00000001000000A7000000BA" from archive 2023-04-09 11:43:41.949 UTC [10091] LOG: restored log file "00000001000000A7000000BB" from archive 2023-04-09 11:43:44.134 UTC [10091] LOG: restored log file "00000001000000A7000000BC" from archive 2023-04-09 11:43:45.965 UTC [10091] LOG: restored log file "00000001000000A7000000BD" from archive 2023-04-09 11:43:47.774 UTC [10091] LOG: restored log file "00000001000000A7000000BE" from archive 2023-04-09 11:43:49.698 UTC [10091] LOG: restored log file "00000001000000A7000000BF" from archive 2023-04-09 11:43:50.194 UTC [10091] LOG: recovery stopping before commit of transaction 6522027, time 2023-04-08 16:00:00.391957+00 2023-04-09 11:43:50.194 UTC [10091] LOG: pausing at the end of recovery 2023-04-09 11:43:50.194 UTC [10091] HINT: Execute pg_wal_replay_resume() to promote.
Open the database
postgres@minetest:~/14/main$ psql psql (14.7 (Ubuntu 14.7-0ubuntu0.22.04.1)) Type "help" for help. postgres=# SELECT pg_wal_replay_resume(); pg_wal_replay_resume ---------------------- (1 row) postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- A7/BFE63C48 (1 row) postgres=# \q postgres@minetest:~/14/main$ postgres@minetest:~/14/main$ ls -l recovery.signal ls: cannot access 'recovery.signal': No such file or directory
We can see the the process completely finished, and after resuming the pg_wal_replay
, the recovery.signal
file has been removed.
Conclusion
- PostgreSQL PITR is very simple, and can be triggered easily anytime you need it
pg_basebackup
is your friend- Always test recovery after configuring PITR. This will ensure you have the correct
restore_command
, and ensure you can quickly proceed to a PITR - It’s possible to use a dedicated recovery.conf and start the DB
2 responses to “Point-in-time recovery (PITR) of PostgreSQL database”
I followed these steps, but still getting below error
2023-12-19 11:03:53.488 UTC [13179] LOG: invalid checkpoint record
2023-12-19 11:03:53.488 UTC [13179] FATAL: could not locate required checkpoint record
2023-12-19 11:03:53.488 UTC [13179] HINT: If you are restoring from a backup, touch “/data/databases/pgsql14/main/recovery.signal” and add required recovery options.
If you are not restoring from a backup, try removing the file “/data/databases/pgsql14/main/backup_label”.
Be careful: removing “/data/databases/pgsql14/main/backup_label” will result in a corrupt cluster if restoring from a backup
Hi Jeevan,
Interesting case. Can you:
– ls -lh the recovery.signal file ?
– Did you try to restore to an earlier date (Check a couple of previous archived logs to pick a date) ? You might have to pg_resetwal, since the current WALs are of no use if you restore to an earlier date. Possibly choose a date before the last full backup (if you have the previous & WALs archives.)
– Anyway, make sure you have a file backup of the existing situation so you can always undo what you’re doing (tar of the data folder)
– You mentioned a corrupt cluster. Is the DB part of a cluster with replication ? How did you get there ? Unsafe power off ?
Keep us posted,
Keep me posted.