文档章节

postgres 物理格式

MtrS
 MtrS
发布于 2017/01/16 11:41
字数 2695
阅读 75
收藏 0

Introduction to PostgreSQL physical storage November 9, 2015

I initially wrote this post for myself, I wanted to understand enough about the underlying physical storage to be able to grasp other concepts which can affect your database. This post is probably more aimed to people with similar interests, I will try to keep it accurate but the aim is to make it simpler to understand so I may overlook some details.

Should you know about how Postgres handle the physical storage? Maybe not, but knowing how it works can be useful if you need to investigate some problems.

PostgreSQL has a chapter on the subject but personally I had to dive in extra resources to get my head around it. I found difficult to navigate through multiple parts of the documentation and external resources, so this is my attempt to gather the essential in a single post. In this post, I didn't bother to rewrite everything so in some place I reused paragraphs from the official documentation.

At the time of this article, the current version of Postgres is 9.5, it's possible that the information in this post to be voided in a future release.

To illustrate the examples of this post, we will be assuming that we have a database called foo which contain a table called bar. A bit of terminology

Once you dive into the documentation about storage then you start to hit some less common terms like relations, tuple, heap, block, page... Let me try to demystify what they mean:

a tuple or an item is a synonym for a row
a relation is a synonym for a table
a filenode is an id which represent a reference to a table or an index.
a block and page are equals and they represent a 8kb segment information the file storing the table.
a heap refer to heap file. Heap files are lists of unordered records of variable size. Although sharing a similar name, heap files are different from heap data structure.
CTID represent the physical location of the row version within its table. CTID is also a special column available for every tables but not visible unless specifically mentioned. It consists of a page number and the index of an item identifier.
OID stands for Object Identifier.
database cluster, we call a database cluster the storage area on disk. A database cluster is a collection of databases that is managed by a single instance of a running database server.
VACCUM, PostgreSQL databases require periodic maintenance known as vacuuming

Don't get surprised if you hit this terms in the rest of this post. In few places, I simplified the content when it was possible. Where is my database stored

The data files used by a database cluster are stored together within the cluster's data directory, commonly referred to as PGDATA (after the name of the environment variable that can be used to define it). PGDATA can have different location base on your OS or installation.

For each database in the cluster, there is a subdirectory within PGDATA/base, named after the database's OID in pg_database. This subdirectory is the default location for the database's files.

You can run the following query if you want to list the OID for each database in your cluster.

select oid, datname from pg_database;

oid | datname -------+------------- 1 | template1 12398 | template0 12403 | postgres 17447 | foo

If you are looking for the OID of a specific database then you add a WHERE clause to the query:

select oid, datname from pg_database WHERE datname = 'foo';

The other way to find the OID of a database is using the command line oid2name. It's a utility program to examine the file structure used by PostgreSQL for more information on how to use it check the documentation

$> oid2name All databases: Oid Database Name Tablespace

17447 foo pg_default 12398 template0 pg_default 1 template1 pg_default

The OID's above may be completely different in your system, but now we know that all the files concerning the database foo are stored in PGDATA/base/17447. If you ask yourself in which database, tables as pg_database are stored then they are not in PGDATA/base but in PGDATA/global.

Now, we know where to find the database in our file system so let try to figure out where to find our table. Where tables are stored

Each table is stored in a separate file. For ordinary relations, these files are named after the table or index's filenode number, which can be found in the column relfilenode of the table pg_class. pg_class is system table which exist in the pg_catalog schema.

When a table exceeds 1 GB, it is divided into gigabyte-sized segments. The first segment's file name is the same as the filenode; subsequent segments are named filenode.1, filenode.2, etc. This arrangement avoids problems on platforms that have file size limitations.

A table's filenode often matches its OID, this is not necessarily the case. To find table filenode path, you can run the following query when you connected to your database:

select pg_relation_filepath('bar'); pg_relation_filepath

base/17447/27741

If you don't want the path but only the filenode then you can use the function pg_relation_filenode with the same argument. As for the database OID, you can find the table filenode by using oid2name

oid2name -d foo -t bar

From database "foo": Filenode Table Name

 27741         bar

The System tables

In the previous section we talk a bit about system tables, it's interesting to know that there is a system catalog schema which contains tables that you may want to query to find extra information.

When you create a database, in addition to public and user-created schemas, each database contains a pg_catalog schema, which contains the system tables and all the built-in data types, functions, and operators. pg_catalog is always effectively part of the search path, so you don't need to use the prefix when you query the system tables.

Once connected to your database via psql, you can list the list the systems tables via:

\dt pg_catalog.*

How the rows are stored

Every table stored as an array of pages of a fixed size (usually 8Kb). In a table, all the pages are logically equivalent, so a particular item (row) can be stored in any page.

The structure used to store the table is a heap file. Heap files are lists of unordered records of variable size. The heap file is structured as a collection of pages (or block), each containing a collection of items. The term item refers to a row that is stored on a page.

A page structure looks like the following:

Page structure

It contains some headers which we are not going to cover, but they provide info about checksum, start of free space, end of free space, ... Items after the headers is an array identifier composed of (offset, length) pairs pointing to the actual items.

Because an item identifier is never moved until it is freed, its index can be used on a long-term basis to reference an item, even when the item itself is moved around on the page to compact free space. A Pointer to an item is called CTID (ItemPointer), created by PostgreSQL, it consists of a page number and the index of an item identifier.

The items themselves are stored in space allocated backwards from the end of unallocated space. To summarize, inside a page the pointers to the row are stored at the starts and the tuples (rows) are stored at the end of the page.

You can access the CTID of a row by adding ctid in the selected columns:

SELECT ctid, * from bar;

What are the limitations

According to About PostgreSQL, the max number of column is between 250 and 1600 depending on column types. The column types affect it because in PostgreSQL rows may be at most 8kb (one page) wide, they cannot span pages.

It doesn't mean that value of a column is limited to 8kb. Having a big value in columns is possible depending on the type because postgreSQL has a mechanism called TOAST which can handle that. There's still limit to how many columns you can fit in, that depends on how wide the data types used are. Even pointer to TOAST attribute still requires some bytes. The Oversized-Attribute Storage Technique

In the previous section, we've said that some column can go beyond the size of a page if their types are TOAST-able. Let's do an overview of TOAST. First, TOAST stands for The Oversized-Attribute Storage Technique, probably the best acronym in the history of Computer Science.

PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. When a row is attempted to be stored that exceeds this size, TOAST basically breaks up the data of large columns into smaller "pieces" and stores them into a TOAST table. Each table you create has its own associated (unique) TOAST table, which may or may not ever end up being used, depending on the size of rows you insert. All of this is transparent to the user and enabled by default. The mechanism is accomplished by splitting up the large column entry into 2KB bytes and storing them as chunks in the TOAST tables. It then stores the length and a pointer to the TOAST entry back where the column is normally stored. Because of how the pointer system is implemented, most TOAST'able column types are limited to a max size of 1GB.

TOAST has a number of advantages compared to a more straightforward approach such as allowing row values to span pages. The big values of TOASTed attributes will only be pulled out (if selected at all) at the time the result set is sent to the client. The table itself will much smaller and more of its rows fit in the shared buffer cache than would be the case without any out-of-line storage (TOAST). It's also more likely that the sort sets get smaller which imply having sorts being done entirely in memory.

In our example, the table bar with the OID of 27741 will have the TOAST table pg_toast_27741. There is more to be said about TOAST like that it also provide compression, but I keep it to the essential. TOAST make it transparent to the user but keep in mind that TOAST-able column have a limit 1GB. You can store a lot of data in 1GB and it can be useful to denormalize data in an ARRAY type or store documents in JSONB type. Free Space map

Each table has a free space map, which stores information about free space available in the relation. The free space map is stored in a file named with the filenode number plus the suffix _fsm.

The values stored in the free space map are not exact. They're rounded to a precision of 1/256th of page size (32 bytes with default 8kb), and they're not kept fully up-to-date as tuples are inserted and updated. Some operations update the freespace map as deleting all the rows, but the VACCUM will update the freespace map for the future row being able to know which page they can fit.

The freespace map is updated by running VACUUM because an UPDATE or DELETE of a row does not immediately remove the old version of the row. This approach is necessary to gain the benefits of multiversion concurrency control (MVCC): the row version must not be deleted while it is still potentially visible to other transactions. But eventually, an outdated or deleted row version is no longer of interest to any transaction. The space it occupies must then be reclaimed for reuse by new rows, to avoid unbounded growth of disk space requirements.

The standard form of VACUUM removes dead row versions in tables and marks the space available for future reuse. However, it will not return the space to the operating system, except in the special case where one or more pages at the end of a table become entirely free and an exclusive table lock can be obtained. It doesn't mean that the free space inside a page is fragmented, VACUUM rewrites the entire block, efficiently packing the remaining rows and leaving a single contiguous block of free space in a page.

In contrast, VACUUM FULL actively compacts tables by writing a complete new version of the table file with no dead space. This minimizes the size of the table, but can take a long time. It also requires extra disk space for the new copy of the table, until the operation completes. The goal of routine VACUUM is to avoid needing VACUUM FULL. The idea is not to keep tables at their minimum size, but to maintain steady-state usage of disk space: each table occupies space equivalent to its minimum size plus how much space gets used up between vacuumings.

In Past, when VACUUM ran, it had to look at every tuple in a table, because there was no information about which pages may not have been updated since the last VACUUM. PostgreSQL introduced the visibility map, VACUUM is now be able to perform partial scans of table data, skipping pages which are marked as fully visible. Partial scans mean fewer I/O operations for VACUUM.

You can query the frees pace available by pages in your table via following query:

CREATE EXTENSION pg_freespace; SELECT * FROM pg_freespace('bar'::regclass);

Visibility map

Each table has also a Visibility Map (VM) to keep track of which pages contain only tuples that are known to be visible to all active transactions. In others words, to track which pages are known to have no dead row. The Visibility map is stored alongside the table file in a separate file, named after the filenode number of the relation, plus a _vm suffix. For example, the filenode of a relation bar is 27741, the VM is stored in a file called 12345_vm, in the same directory as the main relation file.

We've seen in the previous section, the dead rows are due to the MVCC mechanism.

The visibility map simply stores one bit per heap page. A set bit means that all tuples on the page are known to be visible to all transactions. This means that the page does not contain any tuples that need to be vacuumed. The map is conservative in the sense that we make sure that whenever a bit is set, we know the condition is true, but if a bit is not set, it might or might not be true. Visibility map bits are only set by vacuum but are cleared by any data-modifying operations on a page. What about Indexes

Indexes are also stored as files too and in the same directory as the tables. Indexes are an other story, I didn't cover them as it will make this post much longer. Even if indexes also have pages, they follow a different structures as they are b-tree's (by default). Want to work on challenging projects using Postgres?

If you are interested in working with Postgres on some interesting projects, come join me at Opendoor. We heavily use PostgreSQL for Data Processing, Web Development, Data Warehousing, ETLs, Data Science, and Geo Analysis. We are always looking for passionate Engineers and Data Scientists to join our team. Please reach out to our recruiter, Ryan or apply via our job page Conclusion

I hope this post gave you enough information to have a better understand about how PostgreSQL may store your data. It has been useful for me to dive in the subject, it makes some other concepts or talks more easy to grasp but mainly fewer behaviors have to be accepted for granted. Much more could be said about PostgreSQL store and I hope that we will see more posts like this to make PostgreSQL internal parts understandable to beginners. References:

Chapter 63. Database Physical Storage
PostgreSQL Internals Through Pictures
Physical Row Storage Formats (University of Sydney)
28.1. Determining Disk Usage
Free Space Map and Visibility Map

© 著作权归作者所有

上一篇: 每日阅读
下一篇: Linux 系统时间
MtrS
粉丝 35
博文 670
码字总数 410296
作品 0
榆林
私信 提问
PostgreSQL的逻辑备份与物理备份

制定备份和恢复计划是每一个DBA最重要的工作之一,它决定了数据的有效性和完整性。也可以搭建跨越不同数据中心的流复制集群,能有效的帮助你避免单点故障。但是只有一份有效的备份能够帮助从...

libydwei2019
07/17
0
0
PostgreSQL DBA快速入门(三) - 逻辑和物理备份

执行定期备份并有一整套恢复计划是一个DBA的重要工作之一,这也是数据库可用性和完整性的保障。我们可以在多个机房之间部署流复制集群,以保证但节点故障,但是集群无法保证认为的意外DELET...

闻术苑
2018/08/01
239
0
物理备份和在线备份工具pg_rman使用详解(九)

pg_rman工具是实现PostgreSQL数据库在线备份,物理备份和恢复的备份工具。可以实现全量备份,增量备份。 全量: full 增量: incremental 归档: archive 开始使用pg_rman工具: chown -R postgre...

鸿鹄安然
2018/10/15
0
0
PostgreSQL pg_resetwal pg_resetxlog 强制使用某些值(例如system id)

标签 PostgreSQL , pgresetxlog , pgresetwal , 修复控制文件 , pg_controldata , 修复恢复异常 背景 使用pgresetwal , pgresetxlog 修复控制文件时,如何强制指定数据库实例systemid? pgres...

德哥
04/14
0
0
PostgreSQL数据库备份和还原

备份数据库三种方法: SQL转储 文件系统级备份。 持续存档。 pgdump是一个常规的 PostgreSQL客户端应用程序. pgdump不能以特殊权限运行,必须具有对要备份的所有表的读访问权限. pgdump连接受普...

鸿鹄安然
2018/10/14
0
0

没有更多内容

加载失败,请刷新页面

加载更多

JS--function

一、声明提前(hoist) 在js程序开始执行前,引擎会查找所有var声明的变量和function声明的函数,集中到当前作用域顶部集中创建,赋值留在原地 二、三种创建函数的方式 1、声明方式创建函数-...

wytao1995
今天
4
0
微服务之间调用控制器注解类型的差异

今天在一个业务服务通过Feign调用文件服务上传文件时遇到了几个问题: 1. 提示http请求头过大的问题; 此时需要修改bootstrap.yml,加入 server: max-http-header-size: 10000000 用以放大...

不再熬夜
今天
7
0
用 4G 工作是什么体验

七月开始,因为工作原因,在公司附近租了个住处,方便工作。离公司近了,感觉就是不一样,之前每天 5:30 就要起床赶地铁,现在可以睡到自然醒,一看才 7 点,悠闲的起床洗漱,踱步到公司,都...

zzxworld
今天
6
0
sonar报错volatile

问题发生 原先代码如下: //认证授权码private static volatile String AUTHORIZATION_CODE = "init"; git push 之后,sonar认为这是个bug检测报告截图如下: 分析排查 解释说明: Markin...

开源小菜鸟2333
今天
5
0
《Java实践指南》--读后

闲读《Java实践指南》... 1.lvy 某些项目中能够看到ivy.xml。早期使用ant的项目中,常常用ivy.xml来下载项目依赖。 2.ant 作为java程序员,应该都知道ant,虽然可能用过的人不多。为什么ant...

RippleChan
今天
5
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部