本文为墨天轮数据库管理服务团队原创内容,如需转载请联系小墨(VX:modb666)并注明来源。
作者:陈昱
墨天轮数据库管理服务团队技术顾问
19 年IT软件行业从业经验,超过15年以上数据库开发和维护经历,拥有丰富的软件开发及应用、数据库架构设计开发和维护项目经历。10g 和 11g OCM, 曾为多家金融、政府机构提供Oracle的现场支持和咨询服务。
适用范围
PG >= 14
问题概述
问题原因
1) 首先恢复操作系统中被 rm 的数据文件
2) 使用 PG 恢复工具抽取数据
这也就是 PG 恢复 drop 表的复杂度比 oracle 要复杂的原因之一,其次由于 PG 中支持的数据类型远超 oracle,那么不同数据类型的内部存储格式也不尽相同,所以解析底层存储抽取数据的复杂度也比 oracle 要复杂的另一个原因。
解决方案
postgres=# create table arraytab4 as select * from arraytab;
SELECT 9
postgres=# checkpoint;
CHECKPOINT
postgres=# select oid,relname,relfilenode,reltoastrelid from pg_class where relname='arraytab4';
oid | relname | relfilenode | reltoastrelid
--------+-----------+-------------+---------------
126878 | arraytab4 | 126878 | 126881
select oid,relname,relfilenode,reltoastrelid from pg_class where relname like '%pg_toast_126878%';
oid | relname | relfilenode | reltoastrelid
--------+-----------------------+-------------+---------------
126881 | pg_toast_126878 | 126881 | 0
126882 | pg_toast_126878_index | 126882 | 0
[postgres@rac1 5]$ ls -li 126878
69762045 -rw------- 1 postgres dba 8192 Dec 10 22:55 126878 --> inode 是 69762045
[postgres@rac1 5]$ ls -l 126881
-rw------- 1 postgres dba 0 Dec 10 22:55 126881
[postgres@rac1 5]$ ls -l 126882
-rw------- 1 postgres dba 8192 Dec 10 22:55 126882
2) 模拟删除数据
postgres=# drop table arraytab4;
DROP TABLE
arraytab4 = 0x61 0x72 0x72 0x61 0x79 0x74 0x61 0x62 0x34 = 617272617974616234
SELECT * FROM heap_page_items(get_raw_page('pg_class', 0 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 1 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 2 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 3 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 4 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 5 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 6 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 7 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 8 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 9 ));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 10));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 11));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 12));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 13));
SELECT * FROM heap_page_items(get_raw_page('pg_class', 11)) where lp = 32;
过一会儿 pg_class 中就没有删除表的信息了,不仅仅上面的语句查不到了,数据块中也找不到了,列的信息在pg_attribute表中也没有了,但是在数据块中还有
3) 恢复OS已经删除的数据文件
cat /opt/data6666/log/postgresql_log.Tue.csv
2024-12-10 23:00:11.805 CST,,,76839,,674ea544.12c27,136,,2024-12-03 14:29:24 CST,,0,LOG,00000,"checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.309 s, sync=0.008 s, total=0.320 s; sync files=3, longest=0.008 s, average=0.003 s; distance=16 kB, estimate=1760 kB; 【lsn=0/54D5F2A8】, redo lsn=0/54D5F250",,,,,,,,,"","checkpointer",,0
...
2024-12-10 23:03:07.466 CST,"postgres","postgres",21875,"[local]",67585429.5573,28,"idle",2024-12-10 22:46:01 CST,17/25,0,LOG,00000,"statement: drop table arraytab4;",,,,,,,,,"psql","client backend",,0
2024-12-10 23:03:07.479 CST,"postgres","postgres",21875,"[local]",67585429.5573,29,"DROP TABLE",2024-12-10 22:46:01 CST,17/0,0,LOG,00000,"duration: 13.284 ms",,,,,,,,,"psql","client backend",,0
....
2024-12-10 23:05:11.908 CST,,,76839,,674ea544.12c27,137,,2024-12-03 14:29:24 CST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,"","checkpointer",,0
2024-12-10 23:05:12.461 CST,,,76839,,674ea544.12c27,138,,2024-12-03 14:29:24 CST,,0,LOG,00000,"checkpoint complete: wrote 6 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.550 s, sync=0.001 s, total=0.554 s; sync files=6, longest=0.001 s, average=0.001 s; distance=34 kB, estimate=1587 kB; 【lsn=0/54D67D88】, redo lsn=0/54D67D30",,,,,,,,,"","checkpointer",,0
只能从日志中解析
postgres=# select pg_walfile_name('0/54D5F2A8');
pg_walfile_name
--------------------------
000000050000000000000054
(1 row)
postgres=# select pg_walfile_name('0/54D67D88');
pg_walfile_name
--------------------------
000000050000000000000054
[postgres@rac1 pg_wal]$ pg_waldump 000000050000000000000054 -s 0/54D5F2A8 -e 0/54D67D88 > /tmp/wal.log
[postgres@rac1 pg_wal]$ cat /tmp/wal.log | grep "rmgr: Transaction"
rmgr: Transaction len (rec/tot): 1093/ 1093, tx: 2951, lsn: 0/54D67838, prev 0/54D67800, desc: COMMIT 2024-12-10 23:03:07.475415 CST; rels: base/5/126878 base/5/126881 base/5/126882; dropped stats: 2/5/126882 2/5/126881 2/5/126878; inval msgs: catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 relcache 126882 relcache 126881 snapshot 2608 relcache 126881 snapshot 2608 snapshot 2608 snapshot 2608 relcache 126878 snapshot 2608
注意这里
rels: base/5/126878 base/5/126881 base/5/126882; 分别对应于当前的 "表的relfilenode, toast的refilenode, toast index的relfilenode"
dropped stats: 2/5/126882 2/5/126881 2/5/126878 分别对应于truncate前的 "toast index的relfilenode, toast的refilenode, 表的relfilenode"
【注意】
尽管drop 表后的一段时间后, pg_class表中没有被删除表的任何信息(底层存储块中都没有了),但是 pg_attribute 的底层存储中还是有的
1) 此时搜索 617272617974616234 在pg_class中已经找不到
2) 从pg_attribute_tabs中还能找到
126878: [[126878, u'b1', 16, 1, 1, -1, -1, 0, 't', u'c', u'p', u'', 'f', 'f', 'f', u'', u'', 'f', 't', 0, 0, '', '', '', '', ''],
[126878, u'array1', 1007, -1, 2, -1, -1, 0, 'f', u'i', u'x', u'', 'f', 'f', 'f', u'', u'', 'f', 't', 0, 0, '', '', '', '', ''],
[126878, u'array2', 1009, -1, 3, -1, -1, 0, 'f', u'i', u'x', u'', 'f', 'f', 'f', u'', u'', 'f', 't', 0, 100, '', '', '', '', '']],
所以缺少的pg_class 信息我们可以从 wal 日志中得到
如果表存在 reltoastrelid 字段, 那么我们可以得到如下被删除表的信息
rmtab_info = [126878,126878,'arraytab4',126881] oid,relfilenode,arraytab4,reltoastrelid
rmtab_toast_info = [126881,126881,'pg_toast_126878',0] toast的相关信息
need recover relfilenode:126878
need recover toast filenode:126881
# 为了方便使用命令生成需要恢复的文件
cat /tmp/wal.log | grep "rmgr: Transaction"|awk -F "rels:|dropped stats:|inval msgs:" '{print $2,$3}'|awk -F" |/|;" '{print $4,$7,$10,$16,$19,$22}'|awk '{print "rmtab_info = [" $6 "," $1 "," "\47tabname\47," $5 "]" "\n" "rmtab_toast_info = [" $5 "," $2 "," "\47pg_toast_"$6 "\47," "0]\n" "need recover relfilenode:" $1 "\n" "need recover toast filenode:" $2}'
rmtab_info = [126878,126878,'tabname',126881] --|
rmtab_toast_info = [126881,126881,'pg_toast_126878',0] --| 复制到 config.txt中
need recover relfilenode:126878 --|
need recover toast filenode:126881 --| 需要使用os恢复的文件
将上面生成的 rmtab_info 和 rmtab_toast_info 复制到 config.txt 配置文件中,并修改 tabname 为 arraytab4
rmtab_info = [126878,126878,'arraytab4',126881]
rmtab_toast_info = [126881,126881,'pg_toast_126878',0]
# 获得表名对应 inode
-- 先根据文件名获得16进制,方便下一步查询
126878 字符串转16进制 = 0x31 0x32 0x36 0x38 0x37 0x38 = 31 3632 3738 38
ls -lid /opt/data6666/base/5
69762042 drwx------ 2 postgres dba 12288 Dec 9 10:24 /opt/data6666/base/5
xfs_db -r /dev/mapper/rhel-root
xfs_db> inode 69762042
xfs_db> dblock 0
xfs_db> addr
current
byte offset 10240110592, length 4096
buffer block 20000216 (fsbno 4360123), 8 bbs
inode 69762042, dir inode 69762042, type dir2
dd if=/dev/mapper/rhel-root bs=1 skip=10240110592 count=4096 | od -v -x | grep -B1 "31.* 3632 3738 .*38"
0003100 3705 3636 3538 3006 ffff 4800 2804 fd7b
0003120 3106 3632 3738 e838 0000 0000 3904 4806
ffff 4800 2804 fd7b --> ffff 是删除标记 2804 fd7b 是INODE = 0428 7bfb = 69762043 --> 找到INODE
# 根据 INODE 获得删除的文件
[root@rac1 5]# xfs_db -r -c "inode 69762043" -c "addr" /dev/mapper/rhel-root
current
byte offset 10240129792, length 256
buffer block 20000240 (fsbno 4360126), 16 bbs
inode 69762043, dir inode -1, type inode
dd if=/dev/mapper/rhel-root bs=1 skip=10240129792 count=256 | od -v -x
4e49 ed81 0202 0000 0000 0000 0000 0000
0000 0100 0000 0000 0000 0000 0000 8d00
5667 8f77 2b2c 03f4 5667 6377 960c 8957
5667 6377 960c 8957 0000 0000 0000 0020
0000 0000 0000 0200 0000 0000 0000 0100
0000 0200 0000 0000 0000 0000 ae31 075e
ffff ffff
0000 0000 0000 0000 0000 5008 c0e6 0200 --|
0000 0000 0000 0040 0000 1c09 c0d3 0200 | extent 信息
0000 0000 0000 0044 0000 1c09 c0cd 0200 --|
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 5008 c0e6 0200 反过来 = 00 00 00 00 00 00 00 00 00 00 08 50 e6 c0 00 02 =
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001000 01010000 11100110 11000000 00000000 00000010 =
0 000000000000000000000000000000000000000000000000000000 0000000000000000000000000000010000101000011100110110 000000000000000000010 =
0 0 4359990 2
[startoff,startblock,blockcount,extentflag]
0:[0,4359990,2,0]
0000 0000 0000 0040 0000 1c09 c0d3 0200 反过来 = 00 00 00 00 00 00 40 00 00 00 09 1c d3 c0 00 02 =
00000000 00000000 00000000 00000000 00000000 00000000 01000000 00000000 00000000 00000000 00001001 00011100 11010011 11000000 00000000 00000010 =
0 000000000000000000000000000000000000000000000000100000 0000000000000000000000000000010010001110011010011110 000000000000000000010 =
0 32 4777630 2
1:[32,4777630,2,0] # 上一个 startoff + 上一个 blockcount = 0 + 2 = 2, 但是这里是32,舍弃
1) 首先计算文件在那个AG (根据INODE 69762043 计算)
# 首先要得到INODE在整个文件系统中的偏移量, 先假设 69762043 在第1个AG中(sb 0)
--Low bits 是block里存放的第几个inode
xfs_db -r -c "sb 0" -c "p inopblog" /dev/mapper/rhel-root
inopblog = 4 --> inode num 转成2进制,从后往前取 4 位
--Middle bits 在AG的哪个block
xfs_db -r -c "sb 0" -c "p agblklog" /dev/mapper/rhel-root
agblklog = 22 --> inode num 转成2进制,从后往前取 4 位后,在往前取 22 位
--High bits 在哪个AG
--> inode num 转成2进制,除去低位的 4+22=26 位后,剩下的就是AG
--假设在 sb 0 AG中 High bit(ag_number) Middle bits(block_in_ag) Low Bits(inode_in_blk)
69762043 = 1 0000101000011110111111 1011
= 10进制 = 1 165823 11
可以看到假设错误,因为假设是在第一个AG 即 AG=0中, 但是计算出来的AG=1, 即第2个AG中,那么按照AG=1 重新计算
xfs_db -r -c "sb 1" -c "p inopblog" /dev/mapper/rhel-root
inopblog = 4 --> inode num 转成2进制,从后往前取 4 位
--Middle bits 在AG的哪个block
xfs_db -r -c "sb 1" -c "p agblklog" /dev/mapper/rhel-root
agblklog = 22 --> inode num 转成2进制,从后往前取 4 位后,在往前取 22 位
--High bits 在哪个AG
--> inode num 转成2进制,除去低位的 4+22=26 位后,剩下的就是AG
-文件在sb 1 AG中 High bit(ag_number) Middle bits(block_in_ag) Low Bits(inode_in_blk)
69762043 = 1 0000101000011110111111 1011
= 10进制 = 1 165823 11
2) 计算 dd 命令的 skip 参数值
--sb 0 相关参数
--1个Inode 大小
xfs_db -r -c "sb 1" -c "p inodesize" /dev/mapper/rhel-root
inodesize = 256
xfs_db -r -c "sb 3" -c "p agblocks" /dev/mapper/rhel-root
agblocks = 2334208
xfs_db -r -c "sb 3" -c "p blocksize" /dev/mapper/rhel-root
blocksize = 4096
xfs_db -r -c "sb 3" -c "p agcount" /dev/mapper/rhel-root
agcount = 4
#每个block 可以容纳多少个inode
inode_nums_per_blk = blocksize/inodesize = 4096/256 = 16
--skip 值的计算方法如下:
0:[0,4359990,2,0]
4359990 = 10000101000011100110110 = 1 0000101000011100110110(22 位,取 agblklog 的长度) = 1 165686
skip = (agblocks*blocksize*ag号 + 块号*blocksize)/blocksize, 如第1行 data offset 0 startblock 13178993 (3/596081) count 65520 flag 0 的 skip 计算方法如下:
skip = (agblocks*blocksize*ag号 + 块号*blocksize)/blocksize = (2334208*4096*1 + 165686*4096)/4096= 2499894
--count 值的计算方法如下:
count=count 即 bmap 中的 count 值
--seek 值的计算方法如下:
seek=offset 即 bmap 中的 offset 值
如上面 bmap 转化为下面命令:
--第1行 skip = (2334208*4096*1 + 165686*4096)/4096= 2499894
dd if=/dev/mapper/rhel-root bs=4096 skip=2499894 count=2 of=/mnt/hgfs/shared/126878
[root@rac1 ~]# ls -l /mnt/hgfs/shared/126878
-rwxrwxrwx 1 root root 8192 Dec 11 13:31 /mnt/hgfs/shared/126878
[root@rac1 ~]# md5sum /mnt/hgfs/shared/126878
98a0917c001f5946fe890faa794ba6fb /mnt/hgfs/shared/126878
4) 使用恢复工具抽取数据
cp /mnt/hgfs/shared/126878 /opt/data6666/base/5
[root@rac1 ~]# ls -ltr /opt/data6666/base/5/126878
-rwxr-xr-x 1 root root 8192 Dec 11 13:32 /opt/data6666/base/5/126878 <-----
# 使用恢复工具进行数据抽取
1) 配置参数文件
[postgres@rac1 dist]$ cat config.txt
db_path = /opt/data6666/base/5
page_size = 8192
pg_class_oid = 1259
pg_attribute_oid = 1249
rows_perfile = 10000
output_directory = /mnt/hgfs/shared/zodu_pg/zodu_pg_single/db_recovery
largedata_directory = /mnt/hgfs/shared/zodu_pg/zodu_pg_single/db_recovery/data
rmtab_info = [126878,126878,'arraytab4',126881]
rmtab_toast_info = [126881,126881,'pg_toast_126878',0]
2) 开始抽取数据
[root@rac1 dist]# ./zodupg arraytab_bak
Toast 126297 may be already removed.
('filepath', '/opt/data6666/base/5/126294')
5) 查看抽取结果
验证抽取结果
[root@rac1 dist]# cd ..
[root@rac1 zodu_pg_single]# cd db_recovery/
[root@rac1 db_recovery]# ls -ltr
total 5
drwxrwxrwx 1 root root 0 Dec 4 18:01 data
-rwxrwxrwx 1 root root 72 Dec 11 13:34 create_table.sql
-rwxrwxrwx 1 root root 106 Dec 11 13:34 impdp_table.sql
-rwxrwxrwx 1 root root 722 Dec 11 13:34 arraytab4.csv
[root@rac1 db_recovery]# cat arraytab4.csv
b1,array1,array2
t,{10001},{meeting}
t,{10001,10002},{meeting,lunch}
t,{10001,10002,10003,10004},{{meeting,lunch},{training,presentation}}
t,{{{101},{102}},{{103},{104}},{{105},{106}}},{{{aaa},{bbb}},{{ccc},{ddd}},{{eee},{fff}}}
t,{{{101,102},{103,104}},{{105,106},{107,108}},{{109,110},{111,112}}},{{{aaa,bbb},{ccc,ddd}},{{eee,fff},{ggg,hhh}},{{iii,jjj},{kkk,lll}}}
t,{10001,10002,10003,10004},{{{aaa},{bbb}},{{ccc},{ddd}}}
t,{10001,10002,10003,10004},{{{aaa,bbb,ccc},{ddd,eee,fff},{ggg,hhh,iii}},{{jjj,kkk,lll},{mmm,nnn,ooo},{ppp,qqq,rrr}},{{sss,ttt,uuu},{vvv,www,xxx},{yyy,zzz,aaa}}}
t,{10001,10002,10003,10004},{{meeting},{training}}
t,{{10001,10002,1003},{10004,10005,1006},{10007,10008,1009}},{{meeting},{training}}
[root@rac1 db_recovery]# cat create_table.sql
CREATE TABLE arraytab4(
b1 boolean,
array1 integer[],
array2 text[]
);
[root@rac1 db_recovery]# cat impdp_table.sql
COPY arraytab4 FROM '/mnt/hgfs/shared/zodu_pg/zodu_pg_single/db_recovery/arraytab4.csv' WITH csv header;
从恢复的过程中,我们可以发现整个恢复的入口在于获取到文件名的inode号,而获取inode号的条件是先找到 drop 表对应的 relfilenode,好在这个信息在wal日志中有记录,否则恢复难度就会大大增加。此外误删除数据库不能再有数据的写入,否则如果数据被覆盖,则无法再进行数据的恢复。
THE END

墨天轮从乐知乐享的数据库技术社区蓄势出发,全面升级,提供多类型数据库管理服务。墨天轮数据库管理服务旨在为用户构建信赖可托付的数据库环境,并为数据库厂商提供中立的生态支持。
服务官网:https://www.modb.pro/service

点击进入作者服务团队主页
本文分享自微信公众号 - 墨天轮(enmocs)。
如有侵权,请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。