引用 LWN: Memory folios :https://lwn.net/Articles/856016/ 和 Merge tag 'folio-5.16':https://github.com/torvalds/linux/commit/49f8275c7d92,重点如下。
1.1 folio 的定义
Add memory folios, a new type to represent either order-0 pages or the head page of a compound page.
folio 可以看成是 page 的一层包装,没有开销的那种。folio 可以是单个页,也可以是复合页。
(图片引用围绕 HugeTLB 的极致优化)
上图是 page 结构体的示意图,64 字节管理 flags, lru, mapping, index, private, {ref_, map_}count, memcg_data 等信息。当 page 是复合页的时候,上述 flags 等信息在 head page 中,tail page 则复用管理 compound_{head, mapcount, order, nr, dtor} 等信息。
struct folio {
/* private: don't document the anon union */
union {
struct {
/* public: */
unsigned long flags;
struct list_head lru;
struct address_space *mapping;
pgoff_t index;
void *private;
atomic_t _mapcount;
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#endif
/* private: the union with struct page is transitional */
};
struct page page;
};
};
folio 的结构定义中,flags, lru 等信息和 page 完全一致,因此可以和 page 进行 union。这样可以直接使用 folio->flags 而不用 folio->page->flags。
#define page_folio(p) (_Generic((p), \
const struct page *: (const struct folio *)_compound_head(p), \
struct page *: (struct folio *)_compound_head(p)))
#define nth_page(page,n) ((page) + (n))
#define folio_page(folio, n) nth_page(&(folio)->page, n)
第一眼看 page_folio 可能有点懵,其实等效于:
switch (typeof(p)) {
case const struct page *:
return (const struct folio *)_compound_head(p);
case struct page *:
return (struct folio *)_compound_head(p)));
}
就这么简单。
_Generic 是 C11 STANDARD - 6.5.1.1 Generic selection(https://www.open-std.org/JTC1/sc22/wg14/www/docs/n1570.pdf) 特性,语法如下:
Generic selection
Syntax
generic-selection:
_Generic ( assignment-expression , generic-assoc-list )
generic-assoc-list:
generic-association
generic-assoc-list , generic-association
generic-association:
type-name : assignment-expression
default : assignment-expression
page 和 folio 的相互转换也很直接。不管 head,tail page,转化为 folio 时,意义等同于获取 head page 对应的 folio;folio 转化为 page 时,folio->page 用于获取 head page,folio_page(folio, n) 可以用于获取 tail page。
问题是,本来 page 就能代表 base page,或者 compound page,为什么还需要引入 folio?
1.2 folio 能做什么?
The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head().
原因是,page 的含义太多了,可以是 base page,可以是 compound head page,还可以是 compound tail page。
这里以 mem_cgroup_move_account 函数调用举例,一次 mem_cgroup_move_account 调用,最多能执行 7 次 compound_head。
static inline struct page *compound_head(struct page *page)
{
unsigned long head = READ_ONCE(page->compound_head);
if (unlikely(head & 1))
return (struct page *) (head - 1);
return page;
}
再以 page_mapping(page) 为例具体分析,进入函数内部,首先执行 compound_head(page) 获取 page mapping 等信息。另外还有一个分支 PageSwapCache(page) ,当执行这个分支函数的时候,传递的是 page,函数内部还需执行一次 compound_head(page) 来获取 page flag 信息。
struct address_space *page_mapping(struct page *page)
{
struct address_space *mapping;
page = compound_head(page);
/* This happens if someone calls flush_dcache_page on slab page */
if (unlikely(PageSlab(page)))
return NULL;
if (unlikely(PageSwapCache(page))) {
swp_entry_t entry;
entry.val = page_private(page);
return swap_address_space(entry);
}
mapping = page->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;
return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
EXPORT_SYMBOL(page_mapping);
1.3 folio 的直接价值
1)减少太多冗余 compound_head 的调用。
2)给开发者提示,看到 folio,就能认定这是 head page。
Here's an example where our current confusion between "any page"
and "head page" at least produces confusing behaviour, if not an
outright bug, isolate_migratepages_block():
page = pfn_to_page(low_pfn);
if (PageCompound(page) && !cc->alloc_contig) {
const unsigned int order = compound_order(page);
if (likely(order < MAX_ORDER))
low_pfn += (1UL << order) - 1;
goto isolate_fail;
}
compound_order() does not expect a tail page; it returns 0 unless it's
a head page. I think what we actually want to do here is:
if (!cc->alloc_contig) {
struct page *head = compound_head(page);
if (PageHead(head)) {
const unsigned int order = compound_order(head);
low_pfn |= (1UL << order) - 1;
goto isolate_fail;
}
}
Not earth-shattering; not even necessarily a bug. But it's an example
of the way the code reads is different from how the code is executed,
and that's potentially dangerous. Having a different type for tail
and not-tail pages prevents the muddy thinking that can lead to
tail pages being passed to compound_order().
1.4 folio-5.16 已经合入
This converts just parts of the core MM and the page cache.
首先闭包里是 folio_test_slab(folio),folio_test_swapcache(folio) 等基础设施,然后向上扩展到 folio_mapping。page_mapping 的调用者很多,mem_cgroup_move_account 能顺利地调用 folio_mapping,而 page_evictable 却还是保留使用 page_mapping。那么闭包在这里停止扩展。
struct address_space *folio_mapping(struct folio *folio)
{
struct address_space *mapping;
/* This happens if someone calls flush_dcache_page on slab page */
if (unlikely(folio_test_slab(folio)))
return NULL;
if (unlikely(folio_test_swapcache(folio)))
return swap_address_space(folio_swap_entry(folio));
mapping = folio->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;
return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
struct address_space *page_mapping(struct page *page)
{
return folio_mapping(page_folio(page));
}
mem_cgroup_move_account(page, ...) {
folio = page_folio(page);
mapping = folio_mapping(folio);
}
page_evictable(page, ...) {
ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
}
这次 folio-5.16 合入的都是 fs 相关的代码,组里大佬提到 “Linux-mm 社区大佬不同意全部把 page 替换成 folio,对于匿名页和 slab,短期内还是不能替换”。于是我继续翻阅 Linux-mm 邮件列表。
2.1 folio 的社区讨论
2.1.1 命名
2.1.2 FS 开发者的意见
目前 page cache 中都是 4K page,page cache 中的大页也是只读的,例如代码大页(https://openanolis.cn/sig/Cloud-Kernel/doc/475049355931222178)特性。为什么 Transparent huge pages in the page cache 一直没有实现,可以参考这篇 LWN(https://lwn.net/Articles/686690/)。其中一个原因是,要实现 读写 file THP,基于 buffer_head 的 fs 对 page cache 的处理过于复杂。
-
buffer_head
buffer_head 代表的是物理内存映射的块设备偏移位置,一般一个 buffer_head 也是 4K 大小,这样一个 buffer_head 正好对应一个 page。某些文件系统可能采用更小的block size,例如 1K,或者 512 字节。这样一个 page 最多可以有 4 或者 8 个buffer_head 结构体来描述其内存对应的物理磁盘位置。
这样,在处理 multi-page 读写的时候,每个 page 都需要通过 get_block 获取 page 和 磁盘偏移的关系,低效且复杂。 -
iomap
iomap 最初是从 XFS 内部拿出来的,基于 extent,天然支持 multi-page。即在处理 multi-page 读写的时候,仅需一次翻译就能获取所有 page 和 磁盘偏移的关系。
通过 iomap,文件系统与 page cache 隔离开来了,例如,它们在表示大小的时候都使用字节,而不是有多少 page。因此,Matthew Wilcox 建议任何直接使用 page cache 的文件系统都应该考虑要换到 iomap 或 netfs_lib 了。
隔离 fs 与 page cache 的方式或许不止 folio,但是例如 scatter gather 是不被接受的,抽象太复杂。
2.1.3 MM 开发者的意见
Unlike the filesystem side, this seems like a lot of churn for very little tangible value. And leaves us with an end result that nobody appears to be terribly excited about.
But the folio abstraction is too low-level to use JUST for file cache and NOT for anon. It's too close to the page layer itself and would duplicate too much of it to be maintainable side by side.
2.1.4 达成一致
2.2 folio 的深层价值
I think the problem with folio is that everybody wants to read in her/his hopes and dreams into it and gets disappointed when see their somewhat related problem doesn't get magically fixed with folio.
Folio started as a way to relief pain from dealing with compound pages. It provides an unified view on base pages and compound pages. That's it.
It is required ground work for wider adoption of compound pages in page cache. But it also will be useful for anon THP and hugetlb.
Based on adoption rate and resulting code, the new abstraction has nice downstream effects. It may be suitable for more than it was intended for initially. That's great.
But if it doesn't solve your problem... well, sorry...
The patchset makes a nice step forward and cuts back on mess I created on the way to huge-tmpfs.
I would be glad to see the patchset upstream.
大家都知道“struct page 相关的混乱”,但没有人去解决,大家都在默默忍受这长期以来的困扰,在代码中充斥着如下代码。
if (compound_head(page)) // do A;
else // do B;
3.1 folio 开发计划
For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready.
3.2 folio 还能提升性能
The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%.
3.3 我应该怎么用 folio?
—— 完 ——
加入微信群:添加社区助理-龙蜥社区小龙(微信:openanolis_assis),备注【龙蜥】与你同在;加入钉钉群:扫描下方钉钉群二维码。
关于龙蜥
龙蜥社区是立足云计算打造面向国际的 Linux 服务器操作系统开源根社区及创新平台。龙蜥操作系统(Anolis OS)是龙蜥社区推出的 Linux 发行版,拥有三大核心能力:提效降本、更加稳定、更加安全。
目前,Anolis OS 23 已发布,全面支持智能计算,兼容主流 AI 框架,支持一键安装 nvidia GPU 驱动、CUDA 库等,完善适配 Intel、兆芯、鲲鹏、龙芯等芯片,并提供全栈国密支持。
加入我们,一起打造面向云时代的操作系统!
往期精彩推荐
本文分享自微信公众号 - OpenAnolis龙蜥(OpenAnolis)。
如有侵权,请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。