文档章节

linux 内存屏障 && C++11

貌似高手
 貌似高手
发布于 2017/04/05 18:23
字数 4773
阅读 332
收藏 0

概览

SMP这种并行架构相比传统的单处理器带来相当可观的性能提升。一个不可避免的问题是并行架构的处理器间的交互问题。一种可能的解决方案是,每个CPU都有自己唯一可访问内存,处理器间通过消息传递进行通信。这种架构的问题是带给程序员(尤其是系统程序员)巨大的编程负担,因为需要处理数据分隔与传递。相反,被广泛应用的另一种架构是,多处理器间共享一个共享的内存地址空间。这种架构下每个处理器依然可能有自己的本地内存,但不同的是所有的内存对所有的处理器都是可以访问的,只存在访问效率的问题。本文也针对这种架构进行讨论。

共享的内存空间允许每个处理器都进行对任意内存位置的访问(access)操作,这会引起在单处器上没有的问题。考虑这种情况(此图示和例子来源于内核文档Documentation/memory-barriers.txt):

 

在如图的这种系统模型中,假设存在如下的内存访问操作:

 

 

由于处理器出于效率而引入的乱序执行(out-of-order execution)和缓存的关系, 对于内存来说, 最后x和y的值可以有如下组合:

 

因此,对于在操作系统这一层次编程的程序员来说,他们需要一个内存模型,以协调处理器间正确地使用共享内存,这个模型叫做内存一致性模型(memory consistency model)或简称内存模型(memory model)

 

另外有一点需要明确的是,计算机是一个层层抽象的机器,最底层是裸机,再往上是能理解汇编程序的虚拟机,再往上是能理解更高层级编程语言(如C/C++,Java)的更高层级虚拟机。在高级层级编程语言这一层次,编程语言以及运行环境定义了这个层次的内存模型, 如C++, Java规范中分别定义了各自的内存模型。不过,本文主要着眼的是处理器架构层次定义的内存模型

一种直观的内存模型叫做顺序一致性模型(sequential consistency model), 简单讲,顺序一致性模型保证两件事:

 
  1. 每个处理器(或线程)的执行顺序是按照最后汇编的二进制程序中指令的顺序来执行的,
  2. 对于每个处理器(或线程),所有其他处理器(或线程)看到的执行顺序跟它的实际执行顺序一样。

由上两点就可以推出: 整个程序依照着一种有序的执行顺序执行。这种模型是非常直观的,运行在此架构下的程序的执行顺序跟一个不明白内存模型是何物的程序员他所期待的执行顺序一样。

不过,这种模型的效率低下,因为它阻止了处理器发挥乱序执行的能力以最大化并行度,因此在商用的处理器架构中,没有人使用这种模型。取而代之的是更弱的内存一致性模型

比如,耳熟能详的x86及其后续的兼容于它的x86_64架构,采用的是一种叫流程一致性(process consistency)的模型,简言之:

 
  1. 对于某个处理器的`写操作`,它可以按照其意愿重排执行顺序,但对于所有其它处理器,他们看到的顺序,就是它实际执行的顺序。也许这点很难理解,但很重要,后文详述。

还有更松散的弱内存模型, 给予处理器非常自由的重排指令的能力。因此,需要程序员(尤指系统程序员)采取必要的措施进行限制,以保证想要的结果。这就是本文的主题。实际上,这种处理器典型的就是DEC ALPHA, 因此,Linux的通用内存模型是基于它之上构建的。后文详述。

在继续之前,有三个概念要澄清。

 
  1. `程序顺序(program order)`: 指给定的处理器上,程序最终编译后的二进制程序中的内存访问指令的顺序,因为编译器的优化可能会重排源程序中的指令顺序。
  2. `执行顺序(execution order)`: 指给定的处理器上,内存访问指令的实际执行顺序。这可能由于处理器的乱序执行而与程序顺序不同。
  3. `观察顺序(perceived order)`: 指给定的处理器上,其观察到的所有其他处理器的内存访问指令的实际执行顺序。这可能由于处理器的缓存及处理器间内存同步的优化而与执行顺序不同。

前面的几种内存模型的描述中,其实就用到了这几种顺序的等价描述。

何谓内存屏障

上文已经粗略描述了多处理架构下,为了提高并行度,充分挖掘处理器效率的做法会导致的一些与程序员期待的不同的执行结果的情况。本节更详细地描述这种情况, 即为何顺序一致性的模型难以保持的原因。

总的来说,在系统程序员关注的操作系统层面,会重排程序指令执行顺序的两个主要的来源是处理器优化编译器优化

处理器优化

共享内存的多处理架构的典型结构图如下:

 

如图,共享内存的涵义其实是每个处理器拥有自己本地内存, 但所有的非本地内存亦可访问,但显然存在速度的差别。此外,每个处理器还拥有自己的本地缓存系统; 所有的处理器通过一个网络来通信。

显然,这种架构下的内存操作会有巨大的延时问题。为了缓解这些问题,处理器会采取一些优化措施, 而导致程序顺序被破坏。

  • 情景一: 设想某处理器发出一条某内存位置读的指令,恰好这个内存位置在远端内存,而且处理器本地缓存也没有命中。于是,为了等待这个值,处理器需要空转(stall)。这显然是效率的极大浪费,事实现代的处理器都有乱序执行引擎, 指令并不是直接被执行,而是放到等待队列里,等待操作数到位后才执行,而这期间处理器优先处理其他指令。也就是出于效率考虑,处理器会重排指令, 这就违背了程序顺序

  • 情景二: 设想有一个热点全局变量,那么在程序运行一段时间后,很可能很多个处理器的本地缓存都有该变量的一份拷贝。再设想现在有处理器A修改这个全局变量,这个修改会发布一条消息能过网络通知所有其他处理器更新该变量缓存。由于路径的问题,消息不会同时到达所有处理器,那么存在一种可能性,某处理器此时仍观察到旧的值,而采取一些基于该旧值的动作。也就是,执行顺序观察顺序不同,这可能导致出人意表的程序行为。

编译器优化

编译器的优化操作,如寄存器分配(register allocation), 循环不变里代码移动(loop-invariant code motion), 共同子表达式(commonsub-expression elimination), 等等,都有可能导致内存访问操作被重排,甚至消除。因此,编译器优化也会影响指令重排。

另外,还有一种情况需要单独说明。一些设备会把它们的控制接口映射到程序的进程空间,对这种设备的访问叫做内存映射IO(Memory Mapped IO, MMIO), 对设备的地址寄存器与数据寄存器等寄存器的访问就如同读写内存一样方便。一般的访问模式是先传访问端口到地址寄存器AR, 再从数据寄存器DR中访问数据, 其代码顺序为:

 
  1. *AR = 1;
  2. x = *DR;

这种顺序可能会被编译器颠倒,结果自然是错误的程序行为。

综上,内存屏障就是在处理器/设备交互过程中,显式加入的一些干预措施,以防止处理器优化或编译优化,以保持想要的内存指令访问顺序

内存屏障的种类

Linux内核实现的屏障种类有以下几种:

写屏障(write barriers)

  • 定义: 在写屏障之前的所有写操作指令都会在写屏障之后的所有写操作指令更早发生。

  • 注意1: 这种顺序性是相对这些动作的承接者,即内存来说。也就是说,在一个处理器上加入写屏障不能保证别的处理器上看到的就是这种顺序,也就是观察顺序执行顺序无关。

  • 注意2: 写屏障不保证屏障之前的所有写操作在屏障指令结束前结束。也就是说,写屏障序列化了写操作的发生顺序,却没保证操作结果发生的序列化。

读屏障(write barriers)

  • 定义: 在读屏障之前的所有读操作指令都会在读屏障之后的所有读操作指令更早发生。另外,它还包含后文描述的数据依赖屏障的功能

  • 注意1: 这种顺序性是相对这些动作的承接者,即内存来说。也就是说,在一个处理器上加入读屏障不能保证别的处理器上实际执行的就是这种顺序,也就是观察顺序执行顺序无关。

  • 注意2: 读屏障不保证屏障之前的所有读操作在屏障指令结束前结束。也就是说,读屏障序列化了读操作的发生顺序,却没保证操作结果发生的序列化。

写屏障/读屏障举例

注意,之所以要把这两种屏障放在一起举例原因就是:写屏障必须与读屏障一起使用

例如:

 

假如,CPU 2上观察到x值为2, 能否保证其观察到的y值为1?

不能!这就是前面的注意1强调的内容。原因可能是CPU 2上的缓存中存有a的旧值,而正如何谓内存屏障一节中情景二所说的,由于CPU 1上写操作消息传递的延迟,可能CPU 2还未接收到a值更改的消息。

正确的做法是,在CPU 2上插入读屏障。配对的读/写屏障才能保证正确的程序行为

 

通用屏障(general barriers)

 

  • 定义: 在通用屏障之前的所有写和读操作指令都会在通用屏障之后的所有写和读操作指令更早发生。

  • 注意1: 这种顺序性是相对这些动作的承接者,即内存来说。也就是说,在一个处理器上加入通用屏障不能保证别的处理器上看到的就是这种顺序,也就是观察顺序执行顺序无关。

  • 注意2: 通用屏障不保证屏障之前的所有写和读操作在屏障指令结束前结束。也就是说,通用屏障序列化了写和读操作的发生顺序,却没保证操作结果发生的序列化。

  • 注意3: 通用屏障是最严格的屏障,这也意味着它的低效率。它可以替换在写屏障或读屏障出现的地方

数据依赖屏障(data dependency barriers):

 

 

 

http://preshing.com/20130823/the-synchronizes-with-relation/     before-happen

http://blog.jobbole.com/52164/ 双重检测

http://blog.csdn.net/roland_sun/article/details/47670099 arm 中实现msei 协议的指令strex ldrex

http://blog.csdn.net/holandstone/article/details/8596871 java volatile 原子性分析

The Synchronizes-With Relation

 

In an earlier post, I explained how atomic operations let you manipulate shared variables concurrently without any torn reads or torn writes. Quite often, though, a thread only modifies a shared variable when there are no concurrent readers or writers. In such cases, atomic operations are unnecessary. We just need a way to safely propagate modifications from one thread to another once they’re complete. That’s where the synchronizes-with relation comes in.

Synchronizes-with” is a term invented by language designers to describe ways in which the memory effects of source-level operations – even non-atomic operations – are guaranteed to become visible to other threads. This is a desirable guarantee when writing lock-free code, since you can use it to avoid unwelcome surprises caused by memory reordering.

Synchronizes-with” is a fairly modern computer science term. You’ll find it in the specifications of C++11, Java 5+ and LLVM, all of which were published within the last 10 years. Each specification defines this term, then uses it to make formal guarantees to the programmer. One thing they have in common is that whenever there’s a synchronizes-with relationship between two operations, typically on different threads, there’s a happens-before relationship between those operations as well.

Before digging deeper, I’ll let you in on a small insight: In every synchronizes-with relationship, you should be able to identify two key ingredients, which I like to call the guard variable and the payload. The payload is the set of data being propagated between threads, while the guard variable protects access to the payload. I’ll point out these ingredients as we go.

Now let’s look at a familiar example using C++11 atomics.

A Write-Release Can Synchronize-With a Read-Acquire

Suppose we have a Message structure which is produced by one thread and consumed by another. It has the following fields:

struct Message
{
    clock_t     tick;
    const char* str;
    void*       param;
};

We’ll pass an instance of Message between threads by placing it in a shared global variable. This shared variable acts as the payload.

Message g_payload;

Now, there’s no portable way to fill in g_payload using a single atomic operation. So we won’t try. Instead, we’ll define a separate atomic variable, g_guard, to indicate whether g_payload is ready. As you might guess, g_guard acts as our guard variable. The guard variable must be manipulated using atomic operations, since two threads will operate on it concurrently, and one of those threads performs a write.

std::atomic<int> g_guard(0);

To pass g_payload safely between threads, we’ll use acquire and release semantics, a subject I’ve written about previously using an example very similar to this one. If you’ve already read that post, you’ll recognize the final line of the following function as a write-release operation on g_guard.

void SendTestMessage(void* param)
{
    // Copy to shared memory using non-atomic stores.
    g_payload.tick  = clock();
    g_payload.str   = "TestMessage";
    g_payload.param = param;
    
    // Perform an atomic write-release to indicate that the message is ready.
    g_guard.store(1, std::memory_order_release);
}

While the first thread calls SendTestMessage, the second thread calls TryReceiveMessage intermittently, retrying until it sees a return value of true. You’ll recognize the first line of this function as a read-acquire operation on g_guard.

bool TryReceiveMessage(Message& result)
{
    // Perform an atomic read-acquire to check whether the message is ready.
    int ready = g_guard.load(std::memory_order_acquire);
    
    if (ready != 0)
    {
        // Yes. Copy from shared memory using non-atomic loads.
        result.tick  = g_payload.tick;
        result.str   = g_payload.str;
        result.param = g_payload.param;
        
        return true;
    }
    
    // No.
    return false;
}

If you’ve been following this blog for a while, you already know that this example works reliably (though it’s only capable of passing a single message). I’ve already explained how acquire and release semantics introduce memory barriers, and given a detailed example of acquire and release semantics in a working C++11 application.

The C++11 standard, on the other hand, doesn’t explain anything. That’s because a standard is meant to serve as a contract or an agreement, not as a tutorial. It simply makes the promise that this example will work, without going into any further detail. The promise is made in §29.3.2 of working draft N3337:

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

It’s worth breaking this down. In our example:

  • Atomic operation A is the write-release performed in SendTestMessage.
  • Atomic object M is the guard variable, g_guard.
  • Atomic operation B is the read-acquire performed in TryReceiveMessage.

As for the condition that the read-acquire must “take its value from any side effect” – let’s just say it’s sufficient for the read-acquire to read the value written by the write-release. If that happens, the synchronized-with relationship is complete, and we’ve achieved the coveted happens-before relationship between threads. Some people like to call this a synchronize-with or happens-before “edge”.

Most importantly, the standard guarantees (in §1.10.11-12) that whenever there’s a synchronizes-with edge, the happens-before relationship extends to neighboring operations, too. This includes all operations before the edge in Thread 1, and all operations after the edge in Thread 2. In the example above, it ensures that all the modifications to g_payload are visible by the time the other thread reads them.

Compiler vendors, if they wish to claim C++11 compliance, must adhere to this guarantee. At first, it might seem mysterious how they do it. But in fact, compilers fulfill this promise using the same old tricks which programmers technically had to use long before C++11 came along. For example, in this post, we saw how an ARMv7 compiler implements these operations using a pair of dmb instructions. A PowerPC compiler could implement them using lwsync, while an x86 compiler could simply use a compiler barrier, thanks to x86’s relatively strong hardware memory model.

Of course, acquire and release semantics are not unique to C++11. For example, in Java version 5 onward, every store to a volatile variable is a write-release, while every load from a volatile variable is a read-acquire. Therefore, any volatile variable in Java can act as a guard variable, and can be used to propagate a payload of any size between threads. Jeremy Manson explains this in his blog post on volatile variables in Java. He even uses a diagram very similar to the one shown above, calling it the “two cones” diagram.

It’s a Runtime Relationship

In the previous example, we saw how the last line of SendTestMessage synchronized-with the first line of TryReceiveMessage. But don’t fall into the trap of thinking that synchronizes-with is a relationship between statements in your source code. It isn’t! It’s a relationship between operations which occur at runtime, based on those statements.

This distinction is important, and should really be obvious when you think about it. A single source code statement can execute any number of times in a running process. And if TryReceiveMessage is called too early – before Thread 1’s store to g_guard is visible – there will be no synchronizes-with relationship whatsoever.

It all depends on whether the read-acquire sees the value written by the write-release, or not. That’s what the C++11 standard means when it says that atomic operation B must “take its value” from atomic operation A.

Other Ways to Achieve Synchronizes-With

Just as synchronizes-with is not only way to achieve a happens-before relationship, a pair of write-release/read-acquire operations is not the only way to achieve synchronizes-with; nor are C++11 atomics the only way to achieve acquire and release semantics. I’ve organized a few other ways into the following chart. Keep in mind that this chart is by no means exhaustive.

The example in this post generates lock-free code (on virtually all modern compilers and processors), but C++11 and Java expose blocking operations which introduce synchronize-with edges as well. For instance, unlocking a mutex always synchronizes-with a subsequent lock of that mutex. The language specifications are pretty clear about that one, and as programmers, we naturally expect it. You can consider the mutex itself to be the guard, and the protected variables as the payload. IBM even published an article on Java’s updated memory model in 2004 which contains a “two cones” diagram showing a pair of lock/unlock operations synchronizing-with each other.

As I’ve shown previously, acquire and release semantics can also be implemented using standalone, explicit fence instructions. In other words, it’s possible for a release fence to synchronize-with an acquire fence, provided that the right conditions are met. In fact, explicit fence instructions are the only available option in Mintomic, my own portable API for lock-free programming. I think that acquire and release fences are woefully misunderstood on the web right now, so I’ll probably write a dedicated post about them next.

The bottom line is that the synchronizes-with relationship only exists where the language and API specifications say it exists. It’s their job to define the conditions of their own guarantees at the source code level. Therefore, when using low-level ordering constraints in C++11 atomics, you can’t just slap std::memory_order_acquire and release on some operations and hope things magically work out. You need to identify which atomic variable is the guard, what’s the payload, and in which codepaths a synchronizes-with relationship is ensured.

Interestingly, the Go programming language is a bit of convention breaker. Go’s memory model is well specified, but the specification does not bother using the term “synchronizes-with” anywhere. It simply sticks with the term “happens-before”, which is just as good, since obviously, happens-before can fill the role anywhere that synchronizes-with would. Perhaps Go’s authors chose a reduced vocabulary because “synchronizes-with” is normally used to describe operations on different threads, and Go doesn’t expose the concept of threads.

« The Happens-Before Relation Acquire and Release Fences »

本文转载自:http://larmbr.me/2014/02/14/the-memory-barriers-in-linux-kernel%281%29/

貌似高手
粉丝 9
博文 75
码字总数 63031
作品 0
海淀
高级程序员
私信 提问
C++11 中的双重检查锁定模式

双重检查锁定模式(DCLP)在无锁编程方面是有点儿臭名昭著案例学术研究的味道。直到2004年,使用java开发并没有安全的方式来实现它。在c++11之前,使用便捷式c+开发并没有安全的方式来实现它。...

Dyllian
2013/10/14
3.7K
7
深入理解 Linux 的 RCU 机制

欢迎大家前往腾讯云社区,获取更多腾讯海量技术实践干货哦~ 作者:梁康 RCU(Read-Copy Update),是 Linux 中比较重要的一种同步机制。顾名思义就是“读,拷贝更新”,再直白点是“随意读,但...

腾讯云开发者社区
2017/10/30
42
0
编程中的内存屏障(Memory Barriers)(c/c++) (1)

明天就要transfor去做检索引擎了,今天闲下来了,更新一下博客哈。之前 @高V 同学对本人之前《代码技巧及优化(c/c++)》的文章第六条,有关cache命中和cpu流水优化比较感兴趣,也提出了一些他...

长平狐
2013/01/05
549
0
10km/feature_se

feature_se(feature search engine) (人脸)特征内存搜索引擎(feature search engine),提供高速的人脸特征相似度比对搜索/排序,支持多线程并行搜索,适用于百万级以上人脸库的快速搜索。(C+...

10km
2018/04/24
0
0
linux内存屏障浅析

内存屏障是一个很神奇的东西,之前翻译了linux内核文档memory-barriers.txt,对内存屏障有了一定有理解。现在用自己的方式来整理一下。 在我看来,内存屏障主要解决了两个问题:单处理器下的...

子璐
2015/07/15
0
0

没有更多内容

加载失败,请刷新页面

加载更多

[转] Java 无界阻塞队列 DelayQueue 入门实战

原文出处:http://cmsblogs.com/ 『chenssy』 DelayQueue是一个支持延时获取元素的无界阻塞队列。里面的元素全部都是“可延期”的元素,列头的元素是最先“到期”的元素,如果队列里面没有元...

泥瓦匠BYSocket
15分钟前
4
0
zk中集群版中角色和消息类型

服务器角色 LEADER LEARNER FOLLOWING OBSERVER 消息类型 数据同步 服务器初始化 请求处理型 会话管理型 LEADER 集群工作核心,作用有: 1事务请求唯一调度和处理者,保证事务处理顺序性 2集...

writeademo
16分钟前
3
0
阿里云推送的基本使用-Swift;iOS10+

func initCloudPush(){ CloudPushSDK.asyncInit("*****", appSecret: "*******") { (result) in if result!.success{ print("deviceId===......

west_zll
28分钟前
3
0
分布式及高可用元数据采集原理

转载本文需注明出处:微信公众号EAWorld,违者必究。 引言: 元数据采集是元数据产品的核心部分,如何提升采集效率是需要仔细斟酌的事情,既要保持稳定性也要保持跟上主流技术的发展趋势。元...

EAWorld
43分钟前
4
0
为构建社交关系链手淘都做了啥?

作者|王卫(泓冰) 出品|阿里巴巴新零售淘系技术部 01、淘宝社交关系推荐的背景 1、互联网下半场到来:互联网的下半场,人口红利消失,各大平台需要对用户做精细化运营,用户的增长和留存是每一...

阿里云官方博客
44分钟前
6
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部