[转载]What Your Computer Does While You Wait

发布于 06/19 03:33
字数 1340
阅读 4
收藏 0

What Your Computer Does While You Wait


Dec 1st, 2008

This post takes a look at the speed - latency and throughput - of various subsystems in a modern commodity PC, an Intel Core 2 Duo at 3.0GHz. I hope to give a feel for the relative speed of each component and a cheatsheet for back-of-the-envelope performance calculations. I've tried to show real-world throughputs (the sources are posted as a comment) rather than theoretical maximums. Time units are nanoseconds (ns, 10-9 seconds), milliseconds (ms, 10-3seconds), and seconds (s). Throughput units are in megabytes and gigabytes per second. Let's start with CPU and memory, the north of the northbridge:

北桥 ![Latency and throughput in an Intel Core 2 Duo computer, North Side]

The first thing that jumps out is how absurdly fast our processors are. Most simple instructions on the Core 2 take one clock cycle to execute, hence a third of a nanosecond at 3.0Ghz. For reference, light only travels ~4 inches (10 cm) in the time taken by a clock cycle. It's worth keeping this in mind when you're thinking of optimization - instructions are comically cheap to execute nowadays.

As the CPU works away, it must read from and write to system memory, which it accesses via the L1 and L2 caches. The caches use static RAM, a much faster (and expensive) type of memory than the DRAM memory used as the main system memory. The caches are part of the processor itself and for the pricier memory we get very low latency. One way in which instruction-level optimization is still very relevant is code size. Due to caching, there can be massive performance differences between code that fits wholly into the L1/L2 caches and code that needs to be marshalled into and out of the caches as it executes.

Normally when the CPU needs to touch the contents of a memory region they must either be in the L1/L2 caches already or be brought in from the main system memory. Here we see our first major hit, a massive ~250 cycles of latency that often leads to a stall, when the CPU has no work to do while it waits. To put this into perspective, reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), and main system memory is taking a 4-minute walk down the hall to buy a Twix bar.

The exact latency of main memory is variable and depends on the application and many other factors. For example, it depends on the CAS latency and specifications of the actual RAM stick that is in the computer. It also depends on how successful the processor is at prefetching - guessing which parts of memory will be needed based on the code that is executing and having them brought into the caches ahead of time.

Looking at L1/L2 cache performance versus main memory performance, it is clear how much there is to gain from larger L2 caches and from applications designed to use it well. For a discussion of all things memory, see Ulrich Drepper's What Every Programmer Should Know About Memory (pdf), a fine paper on the subject.

People refer to the bottleneck between CPU and memory as the von Neumann bottleneck. Now, the front side bus bandwidth, ~10GB/s, actually looks decent. At that rate, you could read all of 8GB of system memory in less than one second or read 100 bytes in 10ns. Sadly this throughput is a theoretical maximum (unlike most others in the diagram) and cannot be achieved due to delays in the main RAM circuitry. Many discrete wait periods are required when accessing memory. The electrical protocol for access calls for delays after a memory row is selected, after a column is selected, before data can be read reliably, and so on. The use of capacitors calls for periodic refreshes of the data stored in memory lest some bits get corrupted, which adds further overhead. Certain consecutive memory accesses may happen more quickly but there are still delays, and more so for random access. Latency is always present.

南桥 Down in the southbridge we have a number of other buses (e.g., PCIe, USB) and peripherals connected:

![Latency and throughput in an Intel Core 2 Duo computer, South Side 延迟和吞吐Intel Core 2 南桥] 延迟和吞吐Intel Core 2  南桥 (https://oscimg.oschina.net/oscnet/b2355247e46f6f359675e1ea0371f491e83.jpg)

Sadly the southbridge hosts some truly sluggish performers, for even main memory is blazing fast compared to hard drives. Keeping with the office analogy, waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months. This is why so many workloads are dominated by disk I/O and why database performance can drive off a cliff once the in-memory buffers are exhausted. It is also why plentiful RAM (for buffering) and fast hard drives are so important for overall system performance.

While the "sustained" disk throughput is real in the sense that it is actually achieved by the disk in real-world situations, it does not tell the whole story. The bane of disk performance are seeks, which involve moving the read/write heads across the platter to the right track and then waiting for the platter to spin around to the right position so that the desired sector can be read. Disk RPMs refer to the speed of rotation of the platters: the faster the RPMs, the less time you wait on average for the rotation to give you the desired sector, hence higher RPMs mean faster disks. A cool place to read about the impact of seeks is the paper where a couple of Stanford grad students describe the Anatomy of a Large-Scale Hypertextual Web Search Engine (pdf).

When the disk is reading one large continuous file it achieves greater sustained read speeds due to the lack of seeks. Filesystem defragmentation aims to keep files in continuous chunks on the disk to minimize seeks and boost throughput. When it comes to how fast a computer feels, sustained throughput is less important than seek times and the number of random I/O operations (reads/writes) that a disk can do per time unit. Solid state disks can make for a great option here.

Hard drive caches also help performance. Their tiny size - a 16MB cache in a 750GB drive covers only 0.002% of the disk - suggest they're useless, but in reality their contribution is allowing a disk to queue up writes and then perform them in one bunch, thereby allowing the disk to plan the order of the writes in a way that - surprise - minimizes seeks. Reads can also be grouped in this way for performance, and both the OS and the drive firmware engage in these optimizations.

Finally, the diagram has various real-world throughputs for networking and other buses. Firewire is shown for reference but is not available natively in the Intel X48 chipset. It's fun to think of the Internet as a computer bus. The latency to a fast website (say, google.com) is about 45ms, comparable to hard drive seek latency. In fact, while hard drives are 5 orders of magnitude removed from main memory, they're in the same magnitude as the Internet. Residential bandwidth still lags behind that of sustained hard drive reads, but the 'network is the computer' in a pretty literal sense now. What happens when the Internet is faster than a hard drive?

I hope this diagram is useful. It's fascinating for me to look at all these numbers together and see how far we've come. Sources are posted as a comment. I posted a full diagram showing both north and south bridges here if you're interested.


粉丝 35
博文 650
码字总数 404459
作品 0
私信 提问
[AlwaysOn Availability Groups]AlwaysOn等待类型

AlwaysOn等待类型 当排查AlwaysOn延迟,等待统计信息可以在DMV中查看累计的AlwaysOn等待类型。 查看AlwaysOn等待类型 SELECT * FROM sys.dm_os_wait_stats WHERE wait_type LIKE '%hadr%' O......


查看和修改mysql变量 文章作者:Enjoy 转载请注明原文链接。 mysql运行变量有全局变量和会话变量之分,全局用global,当前会话(连接)变量用session。 查看变量命令: show variables,show ...

Condition Variables(条件变量)用法指南

int pthreadcondtimedwait(pthreadcondt *restrict cond, pthreadmutext restrict mutex,const struct timespec restrict abstime); int pthreadcondwait(pthreadcondt *restrict cond, pthr......


转载自: http://liuyieyer.iteye.com/blog/2214722?utmsource=tuicool&utmmedium=referral 由于网站使用nginx做的反向代理负载均衡。在没有默认的系统TCP参数情况下回导致大量的TIME_WAIT出...

Docker系列教程26-Docker Compose控制服务启动顺序

原文:,转载请说明出处。 在生产中,往往有严格控制服务启动顺序的需求。然而Docker Compose自身并不具备该能力。要想实现启动顺序的控制,Docker Compose建议我们使用: wait-for-it dock...






点击上方关注我们,让小care关爱你! 程序员群体一直都是低调多金的代表,而近段时间以来,程序员在网络上除了高薪之外,总是会和屌丝、苦逼、格子衫、没情趣...联系在一起。黑程序员的段子也...


Fcitx 总共要安装的包如下 fcitxfcitx-binfcitx-config-commonfcitx-config-gtk | fcitx-config-gtk2fcitx-datafcitx-frontend-allfcitx-frontend-gtk2fcitx-frontend-gtk3......


前言: 最近整理一些以前的学习笔记(有部分缺失,会有些乱,日后再补)。 过去都是存储在本地,此次传到网络留待备用。 计算机网络的功能: 1.数据通信; 2.资源共享; 3.增加数据可靠性; 4....

spring boot升级到spring cloud

1、先升级spring boot 版本到2.1.3 <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-dependencies</artifactId> <version>2.1.3.RELEAS......


DevOps源于Development和Operations的组合 常见的定义 DevOps是一种重视“软件开发人员(Dev)”和“IT运维技术人员(Ops)”之间沟通合作的文化、运动或惯例。透过自动化“软件交付”和“架构变...