当前位置：首页 > news >正文

pg_rewind在流复制中的作用与应用

news 2025/10/15 6:15:32

文章目录

环境
文档用途
详细信息

环境

系统平台：Linux x86-64 Red Hat Enterprise Linux 7
版本：4.5.8

文档用途

本文档主要介绍pg_rewind 在流复制中的作用、工作原理、优势以及在流复制中如何应用的

详细信息

pg_rewind介绍

作用

pg_rewind是用于在集簇的时间线分叉以后，同步一个 HighGo 集簇和同一集簇的另一份拷贝的工具

优势

pg_rewind比起做一个新的基础备份或者rsync等工具的优势在于，pg_rewind不要求通读集簇中未更改的块，只有更改过的块才会被拷贝，所有其他的文件会被整个拷贝（包括配置文件），使得在数据库很大并且集簇间只有小部分块不同时速度很快

使用前提条件

wal_log_hints = on # also do full page writes of non-critical updates 或者initdb初始化集群时启用数据校验

full_page_writes = on # recover from partial page writes

工作原理

以源端的历史时间线从目标端分叉出来的点之前的最后一个检查点为起点，扫描目标端的 WAL 日志，读取每一个被动过的数据块得到在目标端中从源端被分支出去以后所有被更改过的数据块列表，使用pg_rewind命令选项（–source-pgdata）把所有那些更改过的块从源端拷贝到目标端，把除了关系文件之外所有的东西从源端拷贝到目标端，最后创建一个备份标签文件，让 HighGo 从那个检查点开始向前重放所有 WAL

如何找到分叉点

pg_rewind检查源端和目标端的时间线历史来判断它们在哪一点分叉，并且期望在目标端的pg_wal目录中找到 WAL 来返回到分叉点（分叉点可能会在目标时间线、源时间线或者它们的共同祖先上找到），但如果目标端在分叉后已经运行了很长时间旧的 WAL 文件可能已经不存在了，可以手工从 WAL 归档复制到pg_wal目录或者通过配置primary_conninfo或restore_command在数据库启动时找到

pg_rewind在流复制中的应用

创建新的主库，如下为初始化数据库的过程

initdb -D /data/highgo/data

数据库初始化之后对访问策略配置文件进行设置以允许流复制发生，添加部分内容行如下

vim pg_hba.conf# TYPE  DATABASE        USER            ADDRESS                 METHODhost   replication      all            0.0.0.0/0                 md5

主数据库配置文件添加如下参数，其中full_page_writes和wal_log_hints必须开启避免崩溃后无法恢复流复制，wal_level设置为replica或更高级别保证wal日志能正常传输，archive_mode跟archive_command的设置为了归档恢复，在没有添加复制槽的流复制中最好将wal_keep_segments设置的大点防止备库需要主库不需要的wal日志被主库删除导致不得不重做备库的情况发生

vim postgresql.auto.confwal_level = replicaarchive_mode = onarchive_command = ‘cp %p /data/hgdbbak/archive/%f’full_page_writes = onwal_log_hints = onmax_wal_senders = 10wal_keep_segments = 50hot_standby = on

使用数据库自带的启动命令pg_ctl启动主库

pg_ctl start -D /data/highgo/data

备库通过pg_basebackup命令将主库的数据目录和pg_wal目录下的日志拷贝过来，其中选项-R创建了一个标志为备库的standby.signal文件并添加了primary_conninfo参数到postgresql.auto.conf

pg_basebackup -h x.x.150.161 -p 5866 -U repl -Fp -Xs -R -D /data/highgo/datapg_ctl -D /data/highgo/data/start

在主库查看流复制状态，流复制创建成功且属于异步流复制

postgres=# select * from pg_stat_replication;-[ RECORD 1 ]----+------------------------------pid              | 482735usesysid         | 10usename          | replapplication_name | walreceiverclient_addr      | x.x.150.161client_hostname  |client_port      | 42286backend_start    | 2023-10-27 21:33:59.122261+08backend_xmin     |state            | streamingsent_lsn         | 0/10000448write_lsn        | 0/10000448flush_lsn        | 0/10000448replay_lsn       | 0/10000448write_lag        |flush_lag        |replay_lag       |sync_priority    | 0sync_state       | asyncreply_time       | 2023-10-27 21:52:41.704105+08

模拟主备切换，备库执行切换命令pg_ctl promote，执行之后时间线timeline不一致

[postgres@pg03 data]$ pg_ctl promote -D /data/highgo/datawaiting for server to promote....2023-10-29 09:38:32.722 CST [779689] LOG:  received promote request2023-10-29 09:38:32.722 CST [779693] FATAL:  terminating walreceiver process due to administrator command2023-10-29 09:38:32.722 CST [779689] LOG:  invalid record length at 0/10021400: wanted 24, got 02023-10-29 09:38:32.722 CST [779689] LOG:  redo done at 0/100213C82023-10-29 09:38:32.722 CST [779689] LOG:  last completed transaction was at log time 2023-10-29 09:34:36.921164+082023-10-29 09:38:32.723 CST [779689] LOG:  selected new timeline ID: 62023-10-29 09:38:32.761 CST [779689] LOG:  archive recovery complete2023-10-29 09:38:32.765 CST [779688] LOG:  database system is ready to accept connectionsdoneserver promoted

此时查看原备库变成了新主库，对外提供读写操作

postgres=# select pg_is_in_recovery();pg_is_in_recovery-------------------f(1 row)

原主库需要停掉，若不停掉会导致pg_rewind命令报错导致无法正常执行

[postgres@pg01 data]$ pg_ctl -D /data/highgo/data stopwaiting for server to shut down....2023-10-29 09:42:30.879 CST [99487] LOG:  received fast shutdown request2023-10-29 09:42:30.881 CST [99487] LOG:  aborting any active transactions2023-10-29 09:42:30.881 CST [99487] LOG:  background worker "logical replication launcher" (PID 103645) exited with exit code 12023-10-29 09:42:30.882 CST [99489] LOG:  shutting down2023-10-29 09:42:31.000 CST [99487] LOG:  database system is shut downdoneserver stopped

原主库通过pg_rewind命令追赶新主库，将时间线保持一致

[postgres@pg01 data]$ pg_rewind -D /data/highgo/data --source-server='host=x.x.150.163 port=5866 user=postgres password=123456' -Ppg_rewind: connected to serverpg_rewind: servers diverged at WAL location 0/10021400 on timeline 5pg_rewind: rewinding from last common checkpoint at 0/10000650 on timeline 5pg_rewind: reading source file listpg_rewind: reading target file listpg_rewind: reading WAL in targetpg_rewind: need to copy 132 MB (total source directory size is 151 MB)135219/135219 kB (100%) copiedpg_rewind: creating backup label and updating control filepg_rewind: syncing target data directorypg_rewind: Done!

修改原主库配置文件中primary_conninfo参数中的ip，使其指向新的主库

vi postgresql.auto.confprimary_conninfo = 'user=postgres passfile=''/home/postgres/.pgpass'' host=x.x.150.163 port=5866 sslmode=disable sslcompression=0 gssencmode=disable krbsrvname=postgres target_session_attrs=any'

此时启动原主库

[postgres@pg01 data]$ pg_ctl -D /data/highgo/data startwaiting for server to start....2023-10-29 09:45:46.563 CST [105924] LOG:  starting PostgreSQL 12.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit2023-10-29 09:45:46.563 CST [105924] LOG:  listening on IPv4 address "0.0.0.0", port 58662023-10-29 09:45:46.563 CST [105924] LOG:  listening on IPv6 address "::", port 58662023-10-29 09:45:46.564 CST [105924] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5866"2023-10-29 09:45:46.573 CST [105925] LOG:  database system was interrupted while in recovery at log time 2023-10-29 09:38:32 CST2023-10-29 09:45:46.573 CST [105925] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.2023-10-29 09:45:46.579 CST [105925] LOG:  entering standby mode2023-10-29 09:45:46.580 CST [105925] LOG:  redo starts at 0/100006182023-10-29 09:45:46.582 CST [105925] LOG:  consistent recovery state reached at 0/1003EC082023-10-29 09:45:46.582 CST [105925] LOG:  invalid record length at 0/1003EC08: wanted 24, got 02023-10-29 09:45:46.582 CST [105924] LOG:  database system is ready to accept read only connections2023-10-29 09:45:46.587 CST [105929] LOG:  started streaming WAL from primary at 0/10000000 on timeline 6doneserver startedpg_is_in_recovery函数验证原主库变为备库postgres=# select pg_is_in_recovery();pg_is_in_recovery-------------------t(1 row)