当前位置：首页 > news >正文

PostgreSQL GIN 索引揭秘

news 2025/9/16 12:18:57

文章目录

什么是GIN Index?
示例场景
GIN Index的原理
GIN Index结构
- Metapage
- Entries
- Leaf Pages
- Entry page 和 Leaf page 的关系
- Posting list 和posting tree
- 待处理列表（Pending List）
进阶解读GIN index索引结构
总结

什么是GIN Index?

GIN (Generalized Inverted Index) 索引常用于为 array、jsonb 和 tsvector（用于 fulltext search）类型建立索引。
在 array 的场景下，可以用来验证一个 array 是否包含另一个 array 或元素（比如 <@ operator）。
在早前的postgresql-一文读懂index中的operator，你可以看到完整的 operator 列表。

但我在这篇文章里真正想回答的问题是：
“为什么我们要在这些数据类型和 operator 上使用 GIN 索引？”
在 PostgreSQL 中，GIN（Generalized Inverted Index，广义倒排索引）之所以被称为“倒排”，是因为它的数据结构和查询方式与传统索引（如 B-tree）的工作原理相反。倒排索引的核心思想是将数据的存储和查询从“正向”转向“反向”。
具体来说：

传统索引（如 B-tree）是“正向”索引，基于键值（key）直接映射到对应的记录位置。查询时，从键值出发，找到相关记录。
倒排索引则是将数据的属性或值作为索引的起点，记录哪些文档或记录包含这些值。例如，对于全文搜索，GIN 会为每个词条（term）维护一个列表，列出包含该词条的所有记录的标识。这种“值到记录”的映射方式被称为“倒排”，因为它反过来存储了“记录到值”的关系。

在 GIN 的上下文中，这种设计特别适合处理多值属性（如数组、JSON 或全文搜索），因为它能够高效地查找包含特定值的记录集合。PostgreSQL 中的 GIN 实现了这种广义倒排索引，支持多种数据类型和操作符类（如数组操作符或全文搜索操作符），因此得名“广义倒排索引”。
总结来说，“倒排”反映了 GIN 从值反向映射到记录的独特索引机制，这使其在特定场景下（如复杂查询或全文检索）表现出色。

示例场景

创建一个表 articles，并使用 GIN 索引来支持全文搜索：

-- 创建表
CREATE TABLE articles (id SERIAL PRIMARY KEY,content TEXT
);-- 插入一些示例数据
INSERT INTO articles (content) VALUES
('The quick brown fox jumps over the lazy dog'),
('A quick jump over the brown fence'),
('The lazy dog sleeps');-- 创建 GIN 索引，用于全文搜索
CREATE INDEX idx_gin_content ON articles USING GIN (to_tsvector('english', content));

我们利用to_tsvector函数，查看上面GIN Index将会生成的8个items如下：

demo=# SELECT DISTINCT word
FROM ts_stat('SELECT to_tsvector(''english'', content) FROM articles');word
-------fencdogsleepjumpfoxbrownquicklazi
(8 rows)

GIN Index的原理

当我们对 content 列创建 GIN 索引时，PostgreSQL 会使用 to_tsvector 函数将文本分解为词条（terms），并为每个词条生成一个倒排列表。
例如，to_tsvector(‘english’, ‘The quick brown fox jumps over the lazy dog’) 会分解为词条：brown, dog, fox, jump, lazy, over, quick, the。
GIN 索引会记录每个词条及其出现在哪些 id 中的信息，例如：

brown -> [id: 1, id: 2]
dog -> [id: 1, id: 3]
quick -> [id: 1, id: 2]
......

这种“词条到记录 ID”的映射就是“倒排”的体现，与传统索引从记录 ID 查找值的正向方式相反。

GIN Index结构

GIN 索引的结构和 BTree 索引非常接近。接下来我们将探讨其中的一些差异。

Metapage

像 BTree Index 一样，GIN index 的第一页是 metapage，其中包含关于索引的信息。不同之处在于，这些信息会稍有不同。例如，在 GIN index 中你不会找到btree中的 fast root结构。
我们可以通过 postgres extension pageinspect 来查看 metapage 的信息。

demo=# select * from gin_metapage_info(get_raw_page('idx_gin_content',0));
-[ RECORD 1 ]----+-----------
pending_head     | 4294967295
pending_tail     | 4294967295
tail_free_size   | 0
n_pending_pages  | 0
n_pending_tuples | 0
n_total_pages    | 2
n_entry_pages    | 1
n_data_pages     | 0
n_entries        | 8
version          | 2

以上输出字段说明如下：

Pending list 相关字段

pending_head = 4294967295
pending_tail = 4294967295
4294967295是一个特殊值 (InvalidBlockNumber)，表示当前没有 pending list。
tail_free_size = 0
如果 pending list 存在，这里会表示尾页剩余的可用空间。现在为 0，说明没有 pending list。
n_pending_pages = 0
表示待处理页面（pending pages）的数量。值为 0 表明当前没有待处理的页面，这与 pending_head 和 pending_tail 的值一致，说明索引未处于批量更新模式
n_pending_tuples = 0
没有等待合并到 entry tree 的 tuple。
说明：索引里的数据已经都合并进 entry pages 了，没有暂存的东西。

Page 统计

n_total_pages = 2
整个索引文件占用 2 个 page：
page 0：metapage
page 1：entry page
n_entry_pages = 1
有 1 个 entry page（page 1）,entry page 存放的是 key → posting list/tree 的入口。
n_data_pages = 0
没有 data page。
data page 存 posting tree 的叶子节点。因为目前 posting list 很小，直接存放在 entry page 里，不需要单独开 data page。

索引条目

n_entries = 8
这个的索引里总共有 8 个不同的 key（比如 array 元素 / tsvector token / jsonb key）。

版本号

version = 2
GIN 索引的格式版本，目前 Postgres 的 GIN 是 version 2（相比 v1 支持更紧凑的存储方式，posting list 压缩等）。

在一个 GIN index 中存在两种类型的 page：entry pages 和 data pages。

Data pages 是位于 posting tree 内部的 page。
Entry pages 是包含索引中 value 的 page。

这两类 page 都带有 opaque data，其中包含：

一个 flag 用于定义类型（leaf、data、compressed、meta）
right sibling
maxoff

Entries

在一个 GIN index 中，keys（entries）存储在 entry pages 中，以 binary tree 的形式组织。这一点和 BTree index 非常接近。实际上，索引的第一页是 metapage，然后这些 keys 会被存储到一个 binary tree 中。
GIN 的 entry pages 以二叉树形式组织，看上去和 BTree 很像, 但是，这里还是存在一些主要的差异……
首先，如果你在 BTree index 中对一个 array 建立索引，那么存储的值会直接是这个 array。

BTree：key → 指向 tuple
GIN：key → posting list（或 posting tree） → tuple
例如，如果被索引的字段的值是一个阵列：array{1,6,12}
BTree：整个数组{1,6,12}被当作一个 key:

Key (array)          → Tuple(s)
-------------------------------
{1,6,12}                → (0,1)

GIN：数组会被拆开，存成多个 entry：1 、 6、12，各自维护 posting list:

Key (entry)          → Posting List (heap pointers)
-------------------------------
1                    → (0,1)
6                    → (0,1)
12                   → (0,1)

这正是 GIN 擅长全文检索的原因。
在我们的例子中，虽然不是数组，但 to_tsvector 把句子拆成了多个词，每个词单独建 entry。

GIN Entry Page
+---------+-------------------+
| "dog"   | -> (0,1), (0,3)   |
| "brown" | -> (0,1), (0,2)   |
| "sleep" | -> (0,3)          |
| "quick" | -> (0,1), (0,2)   |
| ......  | -> ......         |
+---------+-------------------+

第二个区别是：在 GIN index 中，values 是唯一的。
在 BTree 中，同一个 value 可以对应多个 items， tuple (value, pointer) 来保证 index entry 的唯一性。
而在 GIN index 中，values 的唯一性使得它非常适合用于同一个 value 出现在许多不同 rows 的情况。

Leaf Pages

在BTree index中，在leaf level上，items的数量与rows的数量相等。因此同一个值可能会重复出现多次：

BTree Leaf Page"brown" → Row1"brown" → Row2"dog"   → Row1"dog"   → Row3"fox"   → Row1"jumps" → Row1"lazy"  → Row1"lazy"  → Row3"over"  → Row1"over"  → Row2"quick" → Row1"quick" → Row2"the"   → Row1"the"   → Row3

正如我之前所说，在GIN index中entries是唯一的。因此leaf levels包含指向rows的pointers的list或tree，即post list或post tree。

GIN Entry Tree"brown" → [Row1, Row2]"dog"   → [Row1, Row3]"fox"   → [Row1]"jumps" → [Row1]"lazy"  → [Row1, Row3]"over"  → [Row1, Row2]"quick" → [Row1, Row2]"the"   → [Row1, Row3]

Entry page 和 Leaf page 的关系

在 entry tree 里，和 BTree 类似：

Entry pages (internal pages)

存放的是 entries 的范围信息（keys）：
每个 entry 对应一个 posting list 或 posting tree 的指针
这类似于 BTree 的 internal page。

Leaf pages
存放具体的 entry (value)

entry 对应的是posting list还是posting tree的指针取决于索引的大小：

小索引（entry page = root = leaf，posting list 内联）

Entry Tree└── Entry Page (root & leaf)├── entry = "dog"   → posting list [ctid(0,1), ctid(0,3)]├── entry = "quick" → posting list [ctid(0,1), ctid(0,2)]├── entry = "the"   → posting list [ctid(0,1), ctid(0,3)]└── ...

这里只有 metapage + 一个 entry page，entry page 既是 root 也是 leaf，posting list 全部内联。
2. 大索引（entry page 内存指针，posting list 太大 → posting tree）

Entry Tree├── Entry Page (internal)│       key range + child pointers│└── Leaf Page├── entry = "dog" → posting list [ctid(0,1), ctid(0,3)]├── entry = "quick" → posting list [ctid(0,1), ctid(0,2)]├── entry = "the" → pointer to posting tree└── ...

在这种情况下：

entry page 作为 internal page（只做导航）；
leaf page 存 entry，但如果某个 entry（如 “the”）太大，就存一个指针指向 posting tree。

Posting list 和posting tree

在一个 leaf page 中，entries 包含一个称为 posting list 的 item pointers 列表，这个列表是以压缩格式存储的。

如果列表变得太大，以至于该 item 无法再放入 index page，那么 posting list 会被拆分到不同的页面中，这些页面以 BTree 组织。这就是所谓的 posting tree。在 leaf item 中，会存储指向这个树的指针，而不是 posting list。
在 posting list 中指向 heap 的指针是按物理顺序存储的。而在 posting tree 中，这些指针则作为 keys。
所以，现在我们讲完了 leaf，这里展示一下 GIN 索引的各个层级结构
在 GIN 的 entry tree 里，每个 entry 是唯一的。
比如 “dog”：

Leaf Page (GIN)
Entry: "dog"Posting list of item pointers:→ (Row1, ctid=(0,1))→ (Row3, ctid=(0,3))

这里 posting list 很小，能直接放在 leaf page 里。

当 posting list 太大时 → Posting tree
假设 “the” 出现在 100 万行文章中，posting list 太大，无法塞进一个 index page。
这时，GIN 会把 posting list 拆分成多个 page，用 BTree 组织，形成 posting tree：

Entry Tree (Entry Page / Leaf Page)
Entry: "the"→ pointer to Posting Tree

Posting Tree 结构：

Posting Tree Root Page (BTree internal)├── → Data Page 1 [Row1, Row3, Row8, Row20, ...]├── → Data Page 2 [Row101, Row102, Row110, ...]└── → Data Page 3 [Row999, Row1000, ...]

这里：

leaf item 里不再存 posting list，而是存一个指针，指向 posting tree root。
posting tree 内部的指针就是 keys，用来导航到具体的 data page。

待处理列表（Pending List）

在 GIN index 中插入新行是相当慢的，因为 values 的唯一性，插入操作比在普通 BTree 中插入更慢——这是因为必须更新 posting list 或 posting tree。

为了优化插入，我们将新的 entries 存储在一个 pending list 中，它是一个简单的线性 pages 列表。当 pending list 达到限制，或者发生 VACUUM 时，这些 entries 会被移动到 BTree 中，使用的是 bulk insert，这种方式经过优化，尤其是在每个 value 对应多行的情况下。

pending list 的大小限制可以逐个索引设置：

CREATE INDEX ... WITH (gin_pending_list_limit=...)
ALTER INDEX ... SET (gin_pending_list_limit=...)

或者通过全局配置参数 gin_pending_list_limit 来设置。

gin_pending_list_limit = '64MB';

pending list 的缺点是：在 GIN index 中进行搜索时，必须同时扫描 BTree 和 pending list。
如果你的场景中数据很少发生变化，并且你不在乎更新操作很慢，那么可以通过在创建索引时，或者使用 ALTER INDEX 将 fastupdate 设置为 false 来禁用 pending list。

需要注意的是，如果你使用 ALTER INDEX 禁用了 pending list，那么已有的 pending list 并不会自动被刷新，因此你可能需要在表上执行 VACUUM，以确保所有数据都被移动到 BTree 中。

进阶解读GIN index索引结构

首先，我们透过pageinspace扩展去查看metapage所包含的内容

 pending_head | pending_tail | tail_free_size | n_pending_pages | n_pending_tuples | n_total_pages | n_entry_pages | n_data_pages | n_entries | version
--------------+--------------+----------------+-----------------+------------------+---------------+---------------+--------------+-----------+---------4294967295 |   4294967295 |              0 |               0 |                0 |             2 |             1 |            0 |         8 |       2
(1 row)

从输出中的n_total_pages来看，这个index共有2个page,其中metapage占用1个page,n_entry_page=1代表有1个entry page.
由于pageinspect并没有提供直接查看entry page的函数，我们只能从侧面来证明这个entry page的内容：

demo=# SELECT DISTINCT word
FROM ts_stat('SELECT to_tsvector(''english'', content) FROM articles');word
-------fencdogsleepjumpfoxbrownquicklazi
(8 rows)

共产生8个词条，这与n_entries=8是一致的。

另外，我们也可以透过pg_filedump来dump gin index的内部结构

 pg_filedump -i -f  -R 1 /var/lib/postgresql/16/main/base/16448/24863

这里-R 1,意指dump page 1(page 0是metapage,page 1是entry page)
输出如下：


*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility
*
* File: /var/lib/postgresql/16/main/base/16448/24863
* Options used: -i -f -R 1
*******************************************************************Block    1 ********************************************************
<Header> -----Block Offset: 0x00002000         Offsets: Lower      56 (0x0038)Block: Size 8192  Version    4            Upper    7952 (0x1f10)LSN:  logid      0 recoff 0x12b23040      Special  8184 (0x1ff8)Items:    8                      Free Space: 7896Checksum: 0x0000  Prune XID: 0x00000000  Flags: 0x0000 ()Length (including item array): 560000: 00000000 4030b212 00000000 3800101f  ....@0......8...0010: f81f0420 00000000 d89f4000 b89f4000  ... ......@...@.0020: a09f3000 889f3000 689f4000 489f4000  ..0...0.h.@.H.@.0030: 289f4000 109f3000                    (.@...0.<Data> -----Item   1 -- Length:   32  Offset: 8152 (0x1fd8)  Flags: NORMALBlock Id: 2147483664  linp Index: 2  Size: 32Has Nulls: 0  Has Varwidths: 11fd8: 00801000 02002040 0d62726f 776e0000  ...... @.brown..1fe8: 00000000 01000100 01000000 00000000  ................Item   2 -- Length:   32  Offset: 8120 (0x1fb8)  Flags: NORMALBlock Id: 2147483664  linp Index: 2  Size: 32Has Nulls: 0  Has Varwidths: 11fb8: 00801000 02002040 09646f67 00000000  ...... @.dog....1fc8: 00000000 01000100 02000000 00000000  ................Item   3 -- Length:   24  Offset: 8096 (0x1fa0)  Flags: NORMALBlock Id: 2147483664  linp Index: 1  Size: 24Has Nulls: 0  Has Varwidths: 11fa0: 00801000 01001840 0b66656e 63000000  .......@.fenc...1fb0: 00000000 02000000                    ........Item   4 -- Length:   24  Offset: 8072 (0x1f88)  Flags: NORMALBlock Id: 2147483664  linp Index: 1  Size: 24Has Nulls: 0  Has Varwidths: 11f88: 00801000 01001840 09666f78 00000000  .......@.fox....1f98: 00000000 01000000                    ........Item   5 -- Length:   32  Offset: 8040 (0x1f68)  Flags: NORMALBlock Id: 2147483664  linp Index: 2  Size: 32Has Nulls: 0  Has Varwidths: 11f68: 00801000 02002040 0b6a756d 70000000  ...... @.jump...1f78: 00000000 01000100 01000000 00000000  ................Item   6 -- Length:   32  Offset: 8008 (0x1f48)  Flags: NORMALBlock Id: 2147483664  linp Index: 2  Size: 32Has Nulls: 0  Has Varwidths: 11f48: 00801000 02002040 0b6c617a 69000000  ...... @.lazi...1f58: 00000000 01000100 02000000 00000000  ................Item   7 -- Length:   32  Offset: 7976 (0x1f28)  Flags: NORMALBlock Id: 2147483664  linp Index: 2  Size: 32Has Nulls: 0  Has Varwidths: 11f28: 00801000 02002040 0d717569 636b0000  ...... @.quick..1f38: 00000000 01000100 01000000 00000000  ................Item   8 -- Length:   24  Offset: 7952 (0x1f10)  Flags: NORMALBlock Id: 2147483664  linp Index: 1  Size: 24Has Nulls: 0  Has Varwidths: 11f10: 00801000 01001840 0d736c65 65700000  .......@.sleep..1f20: 00000000 03000000                    ........<Special Section> -----GIN Index Section:Flags: 0x00000002 (LEAF)  Maxoff: 0Blocks: RightLink (-1)1ff8: ffffffff 00000200                    ........*** End of Requested Range Encountered. Last Block Read: 1 ***

输出中，从item 1…item8,这个entrypage一共包含8个item,这与我们上面的查询一致的，并且每个item都包含这样的结构：

  1fb8: 00801000 02002040 09646f67 00000000  ...... @.dog....1fc8: 00000000 01000100 02000000 00000000  ................

第一行中（ 1fb8: 00801000 02002040 09646f67 00000000 … @.dog…
）：

00 80 10 00 02 00 20 40
这前 8 字节是元信息 / tuple header（Postgres 的 tuple header / GIN internal header），包含诸如 t_infomask、t_hoff、gin-item 的元字段等。此处不是我们关心的 posting-list 内容，因此不展开逐位解释
这块包含 lexeme（词条）本身：
09 64 6f 67 00 00 00 00
09：这是 varlena/文本的头部字节（包含长度/标志等），常见于 Postgres 存储的短文本格式。
我们不必在此把 varlena 的头位 bit-by-bit 拆开 —— 重点是后面实际的字符字节。
64 6f 67 是 ASCII “dog”，这是索引的 lexeme（词条）。

demo=# demo=# select chr(x'64'::int),chr(x'6F'::int),chr(x'67'::int);chr | chr | chr
-----+-----+-----d   | o   | g

后面 00 00 00 00 是对齐/填充，使后面 posting-list 从对齐位置开始。

第二行中：
在 GIN 数据页里，紧跟词条后面就是 posting list 或 posting tree 的指针。
在进一步解读前，我们需要先了解posting list/posting tree指针的源码结构：

typedef struct ItemPointerData
{BlockIdData ip_blkid;   /* 4 字节，块号 */OffsetNumber ip_posid;  /* 2 字节，行号(slot) */
} ItemPointerData;

BlockIdData = 4 字节，存储 heap 表的 block number
OffsetNumber = 2 字节，存储该 block 上的行号 (line pointer index)
注意：实际存储是小端字节序（Postgres 在磁盘上是 little-endian）
一个 ItemPointerData 占 6 字节，但在实际存储时会补齐到 4 字节对齐，所以通常会看到 8 字节一组。

第二行中的 16 字节是：

00 00 00 00   01 00 01 00   02 00 00 00   00 00 00 00

我们拆开：
00 00 00 00 —— 对齐 / padding（跳过）。
01 00 01 00 —— 关键的第一段，按 16-bit 分成两部分（小端）：

前两个字节 01 00（小端） = 0x0001 = 十进制 1,这是block number(块号），代表指向heap table的第1个块号，而在heap table中第一个块号是0
后两个字节 01 00（小端） = 0x0001 = 十进制 1，即第一个 offset（offset1）= 1。
到这里，我们获得post list中第一个tid (block, offset1) = (0, 1)。
02 00 00 00 —— 接下来是一个 32-bit 小端整数：0x00000002 = 十进制 2，这是一个 offset delta（后续 offset 相对前一 offset 的增量）。
于是，我们获得同一块的第二个offset(1+2=3),因此第二个tid (block, offset1) = (0, 3)。
00 00 00 00 —— 结束 / 填充（通常用 0 作为终结标记）。

最终结论（映射回 heap table的tid）
所以这段 posting-list 对应的两个 heap tuple 是：
(0,1) —— 表里 id = 1，内容 “The quick brown fox jumps over the lazy dog”（含 “dog”）
(0,3) —— 表里 id = 3，内容 “The lazy dog sleeps”（含 “dog”）
也就是说 “dog” 在行 (0,1) 和 (0,3)，与表里的文本一致。

demo=# select ctid,* from articles;ctid  | id |                   content
-------+----+---------------------------------------------(0,1) |  1 | The quick brown fox jumps over the lazy dog(0,2) |  2 | A quick jump over the brown fence(0,3) |  3 | The lazy dog sleeps
(3 rows)

总结

一个 GIN 索引包含：

一个 metapage
一个 BTree of key entries（键条目的 B 树）
叶子页 (leaves) 要么包含指向 posting tree 的指针，要么包含一个 posting list of heap pointers（堆指针的 posting 列表）
这些指针在物理内存中是有序的；在 posting tree 中，使用 tid 作为键来构建树
还没有被索引的行存放在 pending list 中

GIN 索引具有非常独特的结构。最重要的部分是理解：
被索引的 values 会被拆分以生成 keys。
这也是它在 fulltext search、arrays 和 jsonb 上非常高效的原因。

不过，GIN 也有一些针对 integers 的扩展。例如，如果想索引一个不同值(different values)不多的列，这可能会很有价值，因为 BTree 会被优化。但在搜索方面，我发现它不一定比 BTree 更好，可能是因为需要访问 posting lists 和 posting trees。