librbd cache 和kvm cache调查

2016/09/02 16:05
阅读数 612

     对kvm+librbd的组合时,开通块缓存对数据可靠性进行了分析。众所周知,开启磁盘缓存在一定程度上会提高读写性能,但是用户会担心系统突然死机或者掉电等情况下,会不会导致缓存数据丢失。 其实网上已经有不少文章提到过这块[3][5] ,不过鉴于稳妥起见,需要求证并记录下来。




      缓存分为两部分设置。1.qemu配置 2.ceph集群配置。在传统的缓存应用中,系统死机或掉电等异常情况,确实会导致数据丢失。不过随着ext3 ext4、lvm、devicemapper等对barriers的写入支持后,如果后端存储支持barriers,开启缓存并不是一个可怕的操作。barriers的写入与传统sync写入的区别是,传统sync每次io都执行一次sync同步操作,而barriers可以通过一系列io写入缓存后,再执行sync操作,比如文件系统先写入日志再sync,再写入数据,然后再sync。虽然不如writeback效率, 但是保证了操作的原子性。上层的barriers操作在linux scsi中间层翻译成了SYCHRONIZE CACHE SCSI命令,所以该特性必须在上层和底层同时支持才能起作用。随着KVM版本的更新,qemu也支持了barriers特性[7]:

    Now things have changed: newer KVM releases enable a “barrier-passing” feature that assure a 100% permanent data storage for guest-side synchronized writes, regardless of the caching mode in use. This means that we can potentially use the high-performance “writeback” setting without fear of data loss (see the green arrow above). However your guest operating system had to use barriers in the first place: this means that for most EXT3-based Linux distributions (as Debian) you had to manually enable barriers or use a filesystem with write barriers turned on by default (most notably EXT4).

If you virtualize an old operating system without barriers support, you had to use the write-through cache setting or at most the no-cache mode. In the latter case you don't have 100% guarantee that synchronized writes will be stored to disk; however, if your guest OS didn't support barriers, it is intrinsics unsafe on standard hardware also. So the no-cache mode seems a good bet for these barrier-less operating system, specially considering the high performance impact of write-through cache mode.

    下面是qemu2.6代码调用librbd的io路径,可以看到如果qemu中如果不设置writeback模式,则ceph中rbd cache也会被关闭,这点代码与网上相关文献描述一致,也就是说ceph的rbd cache配置会被qemu的设置覆盖。


static int drive_init_func(void *opaque, QemuOpts *opts, Error **errp)

DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type)


    value = qemu_opt_get(all_opts, "cache");

    if (value) {

        int flags = 0;

        bool writethrough;


        if (bdrv_parse_cache_mode(value, &flags, &writethrough) != 0) {

            error_report("invalid cache option");

            return NULL;



        /* Specific options take precedence */

        if (!qemu_opt_get(all_opts, BDRV_OPT_CACHE_WB)) {

            qemu_opt_set_bool(all_opts, BDRV_OPT_CACHE_WB,

                              !writethrough, &error_abort);


        if (!qemu_opt_get(all_opts, BDRV_OPT_CACHE_DIRECT)) {

            qemu_opt_set_bool(all_opts, BDRV_OPT_CACHE_DIRECT,

                              !!(flags & BDRV_O_NOCACHE), &error_abort);


        if (!qemu_opt_get(all_opts, BDRV_OPT_CACHE_NO_FLUSH)) {

            qemu_opt_set_bool(all_opts, BDRV_OPT_CACHE_NO_FLUSH,

                              !!(flags & BDRV_O_NO_FLUSH), &error_abort);


        qemu_opt_unset(all_opts, "cache");





 * Set open flags for a given cache mode


 * Return 0 on success, -1 if the cache mode was invalid.


int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough)


    *flags &= ~BDRV_O_CACHE_MASK;


    if (!strcmp(mode, "off") || !strcmp(mode, "none")) {

        *writethrough = false;

        *flags |= BDRV_O_NOCACHE;

    } else if (!strcmp(mode, "directsync")) {

        *writethrough = true;

        *flags |= BDRV_O_NOCACHE;

    } else if (!strcmp(mode, "writeback")) {

        *writethrough = false;

    } else if (!strcmp(mode, "unsafe")) {

        *writethrough = false;

        *flags |= BDRV_O_NO_FLUSH;

    } else if (!strcmp(mode, "writethrough")) {

        *writethrough = true;

    } else {

        return -1;



    return 0;




static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,

                         Error **errp)



     * Fallback to more conservative semantics if setting cache

     * options fails. Ignore errors from setting rbd_cache because the

     * only possible error is that the option does not exist, and

     * librbd defaults to no caching. If write through caching cannot

     * be set up, fall back to no caching.


    if (flags & BDRV_O_NOCACHE) {

        rados_conf_set(s->cluster, "rbd_cache", "false");

    } else {

        rados_conf_set(s->cluster, "rbd_cache", "true");



   ceph官网[1] 和[6]也提到了配置问题,ceph支持rbd_cache_writethrough_until_flush ,此选项默认是writethrough模式,只有当客户端io进行flush操作时,才将rbd改成writeback模式,防止一些不支持barriers的客户端直接用cache可能存在的数据丢失风险。开启ceph缓存主要需要设置如下选项:

       rbd cache = true


       rbd_cache_writethrough_until_flush = true

   ceph文档[10] 也对虚拟机镜像和cache的关系做出了描述:

         Important If you set rbd_cache=true, you must set cache=writeback or risk data loss. Without cache=writeback, QEMU will not send flush requests to librbd. If QEMU exits uncleanly in this configuration, filesystems on top of rbd can be corrupted.

         Important The raw data format is really the only sensible format option to use with RBD. Technically, you could use other QEMU-supported formats (such as qcow2 or vmdk), but doing so would add additional overhead, and would also render the volume unsafe for virtual machine live migration when caching (see below) is enabled.














2 收藏
0 评论
2 收藏