# 对linux的IO的再认识

2013/05/29 09:01

OK，言归正传，从我的各个阶段认识层次开始说起吧。

1.一开始，我对linux的IO层的认识从标准IO库开始，从《C程序设计语言》(TCPL)中知道有<stdio.h>，然后有FILE *这种简单的接口对文件进行读写。

4.《UNIX高级编程》中提及了mmap，另一种IO机制，由文件映射来实现IO。在stevens的另一本书中说得更加详细点，《UNIX网络编程，卷2》。

5.以上都是对系统接口的使用，在去年我粗读了《深入理解计算机系统》(CSAPP)，对linux的虚拟内存机制有一定的了解。有一篇总结文章写得很好，我就不重复了，是《系统调用分析》。当然最直接的方法是去啃内核，不过我对内核还是包有一定的恐惧心理，暂时搁置。

6.不过在网络上的搜索过程中，我看到了02年的linus关于O_DIRECT的讨论，终于知道了linuxIO的系统为什么是这么设计的，为什么到目前为止异步IO还是没有成熟，以及一系列问题的答案，接下来我要开始吐槽和摘要这篇文章。

-------------------------------------------------------------------------------------------------------------------------

linus反击：

The thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey
on some serious mind-controlling substances [*].
[*] In other words, it's an Oracleism. 

O_DIRECT这个接口就是傻逼，可能是一帮磕了药的神经错乱的猴子设计出来的。[*]

[*] 换种说法，这是原教旨主义（Oracle）

接下来，linus提出了一组新的接口

- readahead(fd, offset, size)

Obvious (except the readahead is free to ignore the size, it's just a
hint)

- mmap( MAP_UNCACHED )

This only sets up the "vma" descriptor (like all other MMAP's). It's
exactly like a regular private mapping, except instead of just
incrementing the page count on a page-in, it will look at whether the
page can just be removed from the page cache and inserted as a private
page into the mapping ("stealing" the page).

- fdatasync_area( fd, offset, len)

Obvious. It's fdatasync, except it only guarantees the specific range.

This is really does the "reverse" of mmap(MAP_UNCACHED) (and like a
mapping, addr/len have to be page-aligned).

This walks the page tables, and does the _smart_ thing:

- if no mapping exists, it looks at the backing store of the vma,
and gets the page directly from the backing store instead of
bothering to populate the page tables.

- if the mapped page exists, it removes it from the page table

- in either case, it moves the page it got into the page cache of the
destination file descriptor.

NOTE on zero-copy / no-page-fault behaviour:
- mwrite has to walk the page tables _anyway_ (the same as O_DIRECT),
since that's the only way to do zero-copy.
- since mwrite has to do that part, it's trivial to notice that the page
tables don't exist. In fact, it's a very natural result of the whole
algorithm.
- if user space doesn't touch the mapping itself in any way (other than
point mwrite() at it), you never build up any page tables at all, and
you never even need to touch the TLB (ie no flushes, no nothing).
- note how even "mmap( MAP_UNCACHED )" doesn't actually touch the TLB or
the page tables (unless it uses MAP_FIXED and you use it to unmap a
previous area, of course - that's all in the normal mmap code already)

See?

那么，为什么linus一直拒绝O_DIRECT这种绕开page cache的“高效”的方式来实现同步IO呢？他后面提了page cache设计的三个原因：

- 暂存区，保持文件系统的块特性，让普通的read和write不需要操心对齐
- 一个缓存层（性能考虑）

linus认为，简单的绕过page cache是平庸的，人们太关注“绕过缓存直达硬盘”这种概念了。

Yes. However, it's even _nicer_ if you don't need to walk the page tables
at all.

Quite a lot of operations could be done directly on the page cache. I'm
not a huge fan of mmap() myself - the biggest advantage of mmap is when
you don't know your access patterns, and you have reasonably good
locality. In many other cases mmap is just a total loss, because the page
table walking is often more expensive than even a memcpy().

That's _especially_ true if you have to move mappings around, and you have
to invalidate TLB's.

memcpy() often gets a bad name. Yeah, memory is slow, but especially if
you copy something you just worked on, you're actually often better off
letting the CPU cache do its job, rather than walking page tables and
trying to be clever.

Just as an example: copying often means that you don't need nearly as much
locking and synchronization - which in turn avoids one whole big mess
(yes, the memcpy() will look very hot in profiles, but then doing extra
work to avoid the memcpy() will cause spread-out overhead that is a lot
worse and harder to think about).

This is why a simple read()/write() loop often _beats_ mmap approaches.
And often it's actually better to not even have big buffers (ie the old
"avoid system calls by aggregation" approach) because that just blows your
cache away.

Right now, the fastest way to copy a file is apparently by doing lots of
~8kB read/write pairs (that data may be slightly stale, but it was true at
some point). Never mind the system call overhead - just having the extra
buffer stay in the L1 cache and avoiding page faults from mmap is a bigger
win.

And I don't think mmap _can_ beat that. It's fundamental.

In contrast, direct page cache accesses really can do so. Exactly because
they don't touch any page tables at all, and because they can take
advantage of internal kernel data structure layout and move pages around
without any cost..

也就是说，memcpy虽然名声很差，因为内存很慢，但其实大部分memcpy的工作由CPU的L1 cache完成了。相比之下，mmap的工作需要遍历页表，而一次page fault就会进入中断。所以 8KB每次的read/write的速度往往比mmap要快，只要这8KB都在L1 cache中。但如果实现了linus所说的智能的mwrite，就可以避免页表的使用，而只是由page cache来完成工作。

在邮件列表中，linus频繁抨击Oracle和写数据库的那伙人，正是因为他们对O_DIRECT的滥用破坏了接口的完整性。

在这个帖子里面我发现了inux的系统调用splice/vmsplice，可以最快的从两个文件描述符之间拷贝数据，详见《splice系列系统调用

关于page cache，我另外找到一篇文章，讲得很好《Linux Cache 机制探究》。

这个帖子里面还有其他好玩的点：

a.linus相信大规模的微机会打败小规模的大机，并且他认为这就是为什么windows和linux这二十年成功的原因。就是美。目前看来，智能手机也是这个进程的一部分。

b.linus预言随着内存的增长，内存数据库会干掉现在的一堆垃圾（当时的数据库）。内存数据库将解决目前的IO问题，唯一需要对硬盘做的操作的就是写日志和读备份。现在距离他作出预言已经过了10年，看来事情正像他所说得那样发展。

c.linus说大家不要把我看的太认真，我知道自己相信什么，但linux的某种美妙之处在于linus所相信的并不重要

Btw, anybody that takes me too seriously is an idiot. I know what _I_
believe in, but part of the beauty of Linux is that what I believe doesn't
really matter all that much.

d.邮件列表中有Larry McVoy，后来发现他是一个挺有名的内核维护者，同时他还搞了商业版本控制软件BitKeeper，并且被linus用于linux内核的版本管理。但后来两者分道扬镳，详见《BitKeeper姻缘了断》。而linus开发了自己的版本控制系统git。

-------------------------------------------------------------------------------------------------------------------------

7. 我做了一些实验，关于fwrite/write/mmap的性能对比，结论还是挺有趣的，目前还没整理好，且听下回分解~~

2
89 收藏

### 作者的其它热门文章

2013/10/29 21:03

2013/06/02 19:20

2013/05/31 12:38

#### 引用来自“2007robot”的评论

socket是个更加大的坑，，起码要把TCP/IP那厚厚的几卷读完然后分析内核的TCP栈实现才能搞定。。

2013/05/30 10:20

#### 引用来自“2007robot”的评论

socket是个更加大的坑，，起码要把TCP/IP那厚厚的几卷读完然后分析内核的TCP栈实现才能搞定。。
2013/05/30 09:21

2013/05/30 08:22

2013/05/29 22:45

2013/05/29 21:03

2013/05/29 13:44

2013/05/29 09:56

16 评论
89 收藏
2