对linux的IO的再认识

原创
2013/05/29 09:01
阅读数 1W

呃,其实我对linux的IO感兴趣主要的动力是为了让zlog写日志文件更快一点。虽然zlog是个用户态的函数库,但为了提升速度,必须对linux底层的机制有一定的了解。

OK,言归正传,从我的各个阶段认识层次开始说起吧。

1.一开始,我对linux的IO层的认识从标准IO库开始,从《C程序设计语言》(TCPL)中知道有<stdio.h>,然后有FILE *这种简单的接口对文件进行读写。

2.读了《UNIX高级编程》(APUE)后,知道了stdio.h实现了读写缓存,目的是为了减少系统调用的开销,最后真正工作的系统调用是read和write。

3.精读《UNIX高级编程》后,知道read和write也不是直接写设备,而是把数据从用户态内存拷贝到内核缓冲区(也就是page cache),或者反过来,这是二次缓冲。内核需要把多个进程的读写合并,并且放到写队列中。从这里开始,理解了stdio.h和read/write都是同步IO。还有异步IO,但是目前在linux下没有成熟的异步IO库。关于异步IO有一篇文章《Linux kernel AIO这个奇葩》。

4.《UNIX高级编程》中提及了mmap,另一种IO机制,由文件映射来实现IO。在stevens的另一本书中说得更加详细点,《UNIX网络编程,卷2》。

5.以上都是对系统接口的使用,在去年我粗读了《深入理解计算机系统》(CSAPP),对linux的虚拟内存机制有一定的了解。有一篇总结文章写得很好,我就不重复了,是《系统调用分析》。当然最直接的方法是去啃内核,不过我对内核还是包有一定的恐惧心理,暂时搁置。

6.不过在网络上的搜索过程中,我看到了02年的linus关于O_DIRECT的讨论,终于知道了linuxIO的系统为什么是这么设计的,为什么到目前为止异步IO还是没有成熟,以及一系列问题的答案,接下来我要开始吐槽和摘要这篇文章。

-------------------------------------------------------------------------------------------------------------------------

    缘起:有人说read和write的O_DIRECT选项速度极快,于是激怒了linus

    linus反击:

The thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey
on some serious mind-controlling substances [*].
[*] In other words, it's an Oracleism. 

    O_DIRECT这个接口就是傻逼,可能是一帮磕了药的神经错乱的猴子设计出来的。[*]

    [*] 换种说法,这是原教旨主义(Oracle)

    接下来,linus提出了一组新的接口

- readahead(fd, offset, size)

   Obvious (except the readahead is free to ignore the size, it's just a
   hint)

 - mmap( MAP_UNCACHED )

   This only sets up the "vma" descriptor (like all other MMAP's). It's
   exactly like a regular private mapping, except instead of just
   incrementing the page count on a page-in, it will look at whether the
   page can just be removed from the page cache and inserted as a private
   page into the mapping ("stealing" the page).

 - fdatasync_area( fd, offset, len)

   Obvious. It's fdatasync, except it only guarantees the specific range.

 - mwrite(fd, addr, len)

   This is really does the "reverse" of mmap(MAP_UNCACHED) (and like a
   mapping, addr/len have to be page-aligned).

   This walks the page tables, and does the _smart_ thing:

    - if no mapping exists, it looks at the backing store of the vma,
      and gets the page directly from the backing store instead of
      bothering to populate the page tables.

    - if the mapped page exists, it removes it from the page table

    - in either case, it moves the page it got into the page cache of the
      destination file descriptor.

NOTE on zero-copy / no-page-fault behaviour:
 - mwrite has to walk the page tables _anyway_ (the same as O_DIRECT),
   since that's the only way to do zero-copy.
 - since mwrite has to do that part, it's trivial to notice that the page
   tables don't exist. In fact, it's a very natural result of the whole
   algorithm.
 - if user space doesn't touch the mapping itself in any way (other than
   point mwrite() at it), you never build up any page tables at all, and
   you never even need to touch the TLB (ie no flushes, no nothing).
 - note how even "mmap( MAP_UNCACHED )" doesn't actually touch the TLB or
   the page tables (unless it uses MAP_FIXED and you use it to unmap a
   previous area, of course - that's all in the normal mmap code already)

See?

    这组接口的优点在于,从page cache层到用户态内存之间,并非复制,而是移动。通过对用户进程空间的内存映射和修改页表,达到了0复制的效果。实际上,目前linux基本实现了readahead和mmap,而设想中的mwrite和fdatasync_area未实现。

    那么,为什么linus一直拒绝O_DIRECT这种绕开page cache的“高效”的方式来实现同步IO呢?他后面提了page cache设计的三个原因:

- 暂存区,保持文件系统的块特性,让普通的read和write不需要操心对齐
- 一个同步的实体,保证read和write不互相干扰,保证了mmap能够并发
- 一个缓存层(性能考虑)

    linus认为,简单的绕过page cache是平庸的,人们太关注“绕过缓存直达硬盘”这种概念了。

    进一步深入的比较read/write和mmap的性能差距,linus谈到:

Yes. However, it's even _nicer_ if you don't need to walk the page tables
at all.

Quite a lot of operations could be done directly on the page cache. I'm
not a huge fan of mmap() myself - the biggest advantage of mmap is when
you don't know your access patterns, and you have reasonably good
locality. In many other cases mmap is just a total loss, because the page
table walking is often more expensive than even a memcpy().

That's _especially_ true if you have to move mappings around, and you have
to invalidate TLB's.

memcpy() often gets a bad name. Yeah, memory is slow, but especially if
you copy something you just worked on, you're actually often better off
letting the CPU cache do its job, rather than walking page tables and
trying to be clever.

Just as an example: copying often means that you don't need nearly as much
locking and synchronization - which in turn avoids one whole big mess
(yes, the memcpy() will look very hot in profiles, but then doing extra
work to avoid the memcpy() will cause spread-out overhead that is a lot
worse and harder to think about).

This is why a simple read()/write() loop often _beats_ mmap approaches.
And often it's actually better to not even have big buffers (ie the old
"avoid system calls by aggregation" approach) because that just blows your
cache away.

Right now, the fastest way to copy a file is apparently by doing lots of
~8kB read/write pairs (that data may be slightly stale, but it was true at
some point). Never mind the system call overhead - just having the extra
buffer stay in the L1 cache and avoiding page faults from mmap is a bigger
win.

And I don't think mmap _can_ beat that. It's fundamental.

In contrast, direct page cache accesses really can do so. Exactly because
they don't touch any page tables at all, and because they can take
advantage of internal kernel data structure layout and move pages around
without any cost..

    也就是说,memcpy虽然名声很差,因为内存很慢,但其实大部分memcpy的工作由CPU的L1 cache完成了。相比之下,mmap的工作需要遍历页表,而一次page fault就会进入中断。所以 8KB每次的read/write的速度往往比mmap要快,只要这8KB都在L1 cache中。但如果实现了linus所说的智能的mwrite,就可以避免页表的使用,而只是由page cache来完成工作。

    在邮件列表中,linus频繁抨击Oracle和写数据库的那伙人,正是因为他们对O_DIRECT的滥用破坏了接口的完整性。

    在这个帖子里面我发现了inux的系统调用splice/vmsplice,可以最快的从两个文件描述符之间拷贝数据,详见《splice系列系统调用

    关于page cache,我另外找到一篇文章,讲得很好《Linux Cache 机制探究》。

    这个帖子里面还有其他好玩的点:

    a.linus相信大规模的微机会打败小规模的大机,并且他认为这就是为什么windows和linux这二十年成功的原因。就是美。目前看来,智能手机也是这个进程的一部分。

    b.linus预言随着内存的增长,内存数据库会干掉现在的一堆垃圾(当时的数据库)。内存数据库将解决目前的IO问题,唯一需要对硬盘做的操作的就是写日志和读备份。现在距离他作出预言已经过了10年,看来事情正像他所说得那样发展。

    c.linus说大家不要把我看的太认真,我知道自己相信什么,但linux的某种美妙之处在于linus所相信的并不重要

Btw, anybody that takes me too seriously is an idiot. I know what _I_
believe in, but part of the beauty of Linux is that what I believe doesn't
really matter all that much.

    d.邮件列表中有Larry McVoy,后来发现他是一个挺有名的内核维护者,同时他还搞了商业版本控制软件BitKeeper,并且被linus用于linux内核的版本管理。但后来两者分道扬镳,详见《BitKeeper姻缘了断》。而linus开发了自己的版本控制系统git。

-------------------------------------------------------------------------------------------------------------------------

7. 我做了一些实验,关于fwrite/write/mmap的性能对比,结论还是挺有趣的,目前还没整理好,且听下回分解~~

展开阅读全文
打赏
2
89 收藏
分享
加载中
学习了
2013/10/29 21:03
回复
举报

引用来自“孙新建”的评论

深入理解计算机系统 这本书真让人蛋疼,太难读懂了.

太全了。。正在看。。。体系显然跟不上
2013/06/02 19:20
回复
举报
看来博主也是一个喜欢探究底层的程序员,在大学的时候文件系统就是研究的重点,我非常好奇文件系统的原理,毕业设计也是基于linux的文件系统的安全性增强。后来毕业了搞了web(不想搞底层,太累)就荒废了,对不起我的几本书 linux那些内核书籍,当工作之余的读物来看。公司的分布式文件系统存储也是我来负责的,这跟当年喜欢文件系统有很大帮助。楼主看看最新flashcache FB开源的那个。再分析一把。
2013/05/31 12:38
回复
举报

引用来自“难易”的评论

引用来自“2007robot”的评论

楼主有没有时间分析一下Linux下socket的工作机制,这部分我在看APUE(2rd).

socket是个更加大的坑,,起码要把TCP/IP那厚厚的几卷读完然后分析内核的TCP栈实现才能搞定。。

具体的实现细节可以不追究,主要是通过实例分析一下它的工作机制,这对于socket应用理解有好处的啊.
2013/05/30 10:20
回复
举报
难易博主

引用来自“2007robot”的评论

楼主有没有时间分析一下Linux下socket的工作机制,这部分我在看APUE(2rd).

socket是个更加大的坑,,起码要把TCP/IP那厚厚的几卷读完然后分析内核的TCP栈实现才能搞定。。
2013/05/30 09:21
回复
举报
深入理解计算机系统 这本书真让人蛋疼,太难读懂了.
2013/05/30 08:22
回复
举报
最近正准备对T级别的文件做分割,如何快速实现,Mark先,今天已经没有时间看了
2013/05/29 22:45
回复
举报
楼主有没有时间分析一下Linux下socket的工作机制,这部分我在看APUE(2rd).
2013/05/29 21:03
回复
举报
以前测试的时候 hpux 上面打开一步io后, oracle的性能提高的非常多。
2013/05/29 13:44
回复
举报
期待性能对比篇~
2013/05/29 09:56
回复
举报
更多评论
打赏
16 评论
89 收藏
2
分享
OSCHINA
登录后可查看更多优质内容
返回顶部
顶部