文档章节

ceph-disk源码分析 <转>

banwh
 banwh
发布于 2017/08/22 11:51
字数 4697
阅读 13
收藏 0

ceph-disk源码分析

原文 http://www.hl10502.com/2017/06/23/ceph-disk-1/#more

ceph-disk是一个用于部署osd数据及journal分区或目录的工具,基于Python开发。ceph-disk工具放在ceph-base包中,安装ceph-base rpm包将默认安装此工具,比如Jewel版ceph-10.2.7中的ceph-base-10.2.7-0.el7.x86_64.rpm。

 

ceph-disk命令行

ceph-disk 命令格式如下

  • ceph-disk [-h] [-v] [--log-stdout] [--prepend-to-path PATH]
                     [--statedir PATH] [--sysconfdir PATH] [--setuser USER]
                     [--setgroup GROUP]
                     {prepare,activate,activate-lockbox,activate-block,activate-journal,activate-all,list,suppress-activate,unsuppress-activate,deactivate,destroy,zap,trigger}
                     ...

     

  • prepare:使用一个文件目录或磁盘来准备创建OSD
  • activate:激活一个OSD
  • activate-lockbox:激活一个lockbox
  • activate-block:通过块设备激活一个OSD
  • activate-journal:通过journal激活一个OSD
  • activate-all:激活所有标记的OSD分区
  • list:列出磁盘、分区和OSD
  • suppress-activate:禁止激活一个设备
  • unsuppress-activate:停止禁止激活一个设备
  • deactivate:停用一个OSD
  • destroy:销毁一个OSD
  • zap:清除设备分区
  • trigger:激活任何设备(由udev调用)

ceph-disk工作机制

通过ceph-disk创建osd, 数据分区和journal分区将自动mount。创建osd,主要是prepare和activate。

假设/dev/sdb是OSD要使用的数据盘,OSD要使用的journal分区在/dev/sdc上创建,/dev/sdc是SSD, 创建激活OSD的命令如下:

  • ceph-disk prepare /dev/sdb /dev/sdc
  • ceph-disk activate /dev/sdb1

sgdisk 命令参考 https://linux.die.net/man/8/sgdisk
udevadm 命令参考 https://linux.die.net/man/8/udevadm

prepare过程

  • 使用sgdisk命令销毁数据盘/dev/sdb的GPT和MBR,清除所有分区
  • 获取osd_journal_size大小,默认5120M,可以指定设置,准备journal分区
  • 一个SSD很可能被多个OSD共享来划分各自的journal分区,/dev/sdc上已分好的区不变,使用sgdisk在分区上增加新的分区作为journal,不影响原来的分区,如果不指定创建分区的uuid,自动为journal分区生成一个journal_uuid,journal分区的typecode为45b0969e-9b03-4f30-b4c6-b4b80ceff106,journal是一个link,指向一个固定的位置,再由这个link指向真正的journal分区,这样可以解决盘符漂移带来的问题
  • 使用sgdisk创建数据分区,使用--largest-new来使用磁盘最大可能空间,即将所有的空间用来创建数据分区/dev/sdb1
  • 格式化数据分区/dev/sdb1为xfs
  • 挂载数据分区/dev/sdb1到临时目录
  • 在临时目录下写入ceph_fsid、fsid、magic、journal_uuid四个临时文件,文件内容相应写入
  • 创建journal链接,使用ln -s将sdc新建的journal分区连接到临时目录journal文件
  • 卸载、删除临时目录
  • 修改OSD分区的typecode为4fbd7e29-9d25-41b8-afd0-062c0ceff05d
  • udevadm trigger强制内核触发设备事件

activate过程

/lib/udev/rules.d/目录下的两个rules文件60-ceph-by-parttypeuuid.rules、95-ceph-osd.rules

其实并不需要显式的调用activate这个命令。是因为prepare最后的udevadm trigger强制内核触发设备事件,udev event调用了ceph-disk trigger命令,分析该分区的typecode,是ceph OSD的数据分区,会自动调用ceph-disk activate /dev/sdb1。typecode为journal的分区则通过ceph-disk activate-journal来激活。

[root@ceph ~]# cat /lib/udev/rules.d/95-ceph-osd.rules
# OSD_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
  OWNER="ceph", GROUP="ceph", MODE="660"

 

  • 获取文件系统类型xfs、osd_mount_options_xfs、osd_fs_mount_options_xfs
  • 挂载/dev/sdb1到临时目录
  • 卸载、删除临时目录
  • 启动OSD进程

源码结构

ceph-disk就两个文件

  • __init__.py:空白初始化文件
  • main.py:所有的ceph-disk命令操作在这个文件中,代码超5000行

类图

main.py所有类图:ceph-disk.png

Prepare类

Prepare是准备OSD的操作。两个子类PrepareBluestore、PrepareFilestore分别对应Bluestore、Filestore。目前Jewel10.2.7默认Filestore
ceph-disk-class-1.png

PrepareData类

PrepareData是准备OSD的数据操作,磁盘数据分区、Journal分区。两个子类PrepareFilestoreData、PrepareBluestoreData
ceph-disk-class-2.png

PrepareSpace类

PrepareSpace是用来获取磁盘分区大小。两个子类PrepareJournal、PrepareBluestoreBlock
ceph-disk-class-3.png

DevicePartition类

DevicePartition是设备分区的加密模式。四个子类DevicePartitionCrypt、DevicePartitionCryptLuks、DevicePartitionCryptPlain、DevicePartitionMultipath对应四种不同的dmcrypt
ceph-disk-class-4.png

OSD管理

创建OSD主要是分为prepare与activate两个操作。

main.py主函数

if __name__ == '__main__':
    main(sys.argv[1:])
    warned_about = {}

main函数

def main(argv):
    # 命令行解析
    args = parse_args(argv)
    #设置日志级别
    setup_logging(args.verbose, args.log_stdout)
    if args.prepend_to_path != '':
        path = os.environ.get('PATH', os.defpath)
        os.environ['PATH'] = args.prepend_to_path + ":" + path
    # 设置ceph-disk.prepare.lock、ceph-disk.activate.lock的目录/var/lib/ceph/tmp
    setup_statedir(args.statedir)
    # 设置配置文件目录/etc/ceph/
    setup_sysconfdir(args.sysconfdir)
    global CEPH_PREF_USER
    CEPH_PREF_USER = args.setuser
    global CEPH_PREF_GROUP
    CEPH_PREF_GROUP = args.setgroup
    # 执行子命令函数
    if args.verbose:
        args.func(args)
    else:
        main_catch(args.func, args)


parse_args函数解析子命令

def parse_args(argv):
    parser = argparse.ArgumentParser(
        'ceph-disk',
    )
...
...
...
    # prepare 子命令解析
    Prepare.set_subparser(subparsers)
    # activate 子命令解析
    make_activate_parser(subparsers)
    make_activate_lockbox_parser(subparsers)
    make_activate_block_parser(subparsers)
    make_activate_journal_parser(subparsers)
    make_activate_all_parser(subparsers)
    make_list_parser(subparsers)
    make_suppress_parser(subparsers)
    make_deactivate_parser(subparsers)
    make_destroy_parser(subparsers)
    make_zap_parser(subparsers)
    make_trigger_parser(subparsers)
    args = parser.parse_args(argv)
    return args

main_catch函数

def main_catch(func, args):
    try:
        func(args)
    except Error as e:
        raise SystemExit(
            '{prog}: {msg}'.format(
                prog=args.prog,
                msg=e,
            )
        )
    except CephDiskException as error:
        exc_name = error.__class__.__name__
        raise SystemExit(
            '{prog} {exc_name}: {msg}'.format(
                prog=args.prog,
                exc_name=exc_name,
                msg=error,
            )
        )

 

prepare

ceph-disk prepare命令行格式为:

ceph-disk prepare [-h] [--cluster NAME] [--cluster-uuid UUID]
                         [--osd-uuid UUID] [--dmcrypt]
                         [--dmcrypt-key-dir KEYDIR] [--prepare-key PATH]
                         [--fs-type FS_TYPE] [--zap-disk] [--data-dir]
                         [--data-dev] [--lockbox LOCKBOX]
                         [--lockbox-uuid UUID] [--journal-uuid UUID]
                         [--journal-file] [--journal-dev] [--bluestore]
                         [--block-uuid UUID] [--block-file] [--block-dev]
                         DATA [JOURNAL] [BLOCK]


Prepare类set_subparser函数解析子命令,默认函数是main

@staticmethod
    def set_subparser(subparsers):
        parents = [
            Prepare.parser(),
            PrepareData.parser(),
            Lockbox.parser(),
        ]
        parents.extend(PrepareFilestore.parent_parsers())
        parents.extend(PrepareBluestore.parent_parsers())
        parser = subparsers.add_parser(
            'prepare',
            parents=parents,
            help='Prepare a directory or disk for a Ceph OSD',
        )
        parser.set_defaults(
            func=Prepare.main,
        )
        return parser


调用factory函数

@staticmethod
    def main(args):
        Prepare.factory(args).prepare()

默认PrepareFilestore

@staticmethod
    def factory(args):
        if args.bluestore:
            return PrepareBluestore(args)
        else:
            return PrepareFilestore(args)

PrepareFilestore类初始化

  • PrepareFilestoreData初始化,继承PrepareData
  • PrepareJournal初始化
  • def __init__(self, args):
            if args.dmcrypt:
                self.lockbox = Lockbox(args)
            self.data = PrepareFilestoreData(args)
            self.journal = PrepareJournal(args)

     

PrepareData初始化,获取fsid、生成新的osd_uuid

  • 执行/usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid获取fsid
  • 生成新的osd_uuid
  • def __init__(self, args):
            self.args = args
            self.partition = None
            self.set_type()
            if self.args.cluster_uuid is None:
                self.args.cluster_uuid = get_fsid(cluster=self.args.cluster)
            if self.args.osd_uuid is None:
                self.args.osd_uuid = str(uuid.uuid4())

     

PrepareJournal初始化,继承PrepareSpace类

  • 调用check_journal_reqs函数,获取check-allows-journal、check-wants-journal、check-needs-journal并校验
  • 调用父类初始化
  • def __init__(self, args):
            self.name = 'journal'
            (self.allows_journal,
             self.wants_journal,
             self.needs_journal) = check_journal_reqs(args)
            if args.journal and not self.allows_journal:
                raise Error('journal specified but not allowed by osd backend')
            super(PrepareJournal, self).__init__(args)

     

PrepareSpace类初始化

  • 调用get_space_size函数获取osd_journal_size大小
  • def __init__(self, args):
            self.args = args
            self.set_type()
            self.space_size = self.get_space_size()
            if getattr(self.args, self.name + '_uuid') is None:
                setattr(self.args, self.name + '_uuid', str(uuid.uuid4()))
            self.space_symlink = None
            self.space_dmcrypt = None

     

子类PrepareJournal的get_space_size函数

  • 执行/usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size命令
  • def get_space_size(self):
            return int(get_conf_with_default(
                cluster=self.args.cluster,
                variable='osd_journal_size',
            ))

     

由于PrepareFilestore、PrepareBluestore继承Prepare类,prepare函数在Prepare类中定义

def prepare(self):
        with prepare_lock:
            self.prepare_locked()

PrepareFilestore类的prepare_locked函数,调用PrepareFilestoreData的prepare函数,PrepareJournal作为参数

def prepare_locked(self):
        if self.data.args.dmcrypt:
            self.lockbox.prepare()
        self.data.prepare(self.journal)

PrepareFilestoreData的父类PrepareData的prepare函数,如果是device设备,调用prepare_device函数

def prepare(self, *to_prepare_list):
        if self.type == self.DEVICE:
            self.prepare_device(*to_prepare_list)
        elif self.type == self.FILE:
            self.prepare_file(*to_prepare_list)
        else:
            raise Error('unexpected type ', self.type)

PrepareFilestoreData类的prepare_device函数

  • 调用父类PrepareData的prepare_device函数
  • 调用set_data_partition函数
  • 调用populate_data_path_device函数
  • def prepare_device(self, *to_prepare_list):
            # 父类PrepareData的prepare_device函数
            super(PrepareFilestoreData, self).prepare_device(*to_prepare_list)
            for to_prepare in to_prepare_list:
                # PrepareJournal类的prepare函数,调用prepare_device函数创建journal分区
                to_prepare.prepare()
            # 设置创建数据分区
            self.set_data_partition()
            # 创建OSD
            self.populate_data_path_device(*to_prepare_list)

     

PrepareData的prepare_device函数

  • 调用sanity_checks函数校验设备是否已使用
  • 调用set_variables函数设置变量
  • 调用zap函数清除分区,并使分区生效
  • def prepare_device(self, *to_prepare_list):
            # 校验device
            self.sanity_checks()
            # 设置变量
            self.set_variables()
            if self.args.zap_disk is not None:
                # 清除分区,并使分区生效
                zap(self.args.data)
    

     

调用zap函数清除分区,并使分区生效。[dev]为设备,比如/dev/sdb

  • /usr/sbin/sgdisk --zap-all -- [dev]
  • /usr/sbin/sgdisk --clear --mbrtogpt -- [dev]
  • /usr/bin/udevadm settle --timeout=600
  • /usr/bin/flock -s [dev] /usr/sbin/partprobe [dev]
  • /usr/bin/udevadm settle --timeout=600
  • def zap(dev):
        """
        Destroy the partition table and content of a given disk.
        """
        dev = os.path.realpath(dev)
        dmode = os.stat(dev).st_mode
        if not stat.S_ISBLK(dmode) or is_partition(dev):
            raise Error('not full block device; cannot zap', dev)
        try:
            LOG.debug('Zapping partition table on %s', dev)
            # try to wipe out any GPT partition table backups.  sgdisk
            # isn't too thorough.
            lba_size = 4096
            size = 33 * lba_size
            with open(dev, 'wb') as dev_file:
                dev_file.seek(-size, os.SEEK_END)
                dev_file.write(size * b'\0')
            # 清除分区
            command_check_call(
                [
                    'sgdisk',
                    '--zap-all',
                    '--',
                    dev,
                ],
            )
            command_check_call(
                [
                    'sgdisk',
                    '--clear',
                    '--mbrtogpt',
                    '--',
                    dev,
                ],
            )
            # 使分区生效
            update_partition(dev, 'zapped')

     

PrepareJournal类的prepare函数

def prepare(self):
        if self.type == self.DEVICE:
            self.prepare_device()
        elif self.type == self.FILE:
            self.prepare_file()
        elif self.type == self.NONE:
            pass
        else:
            raise Error('unexpected type ', self.type)

prepare_device函数,调用Device类的create_partition函数创建journal分区

...
...
	device = Device.factory(getattr(self.args, self.name), self.args)
	# 创建journal分区
	num = device.create_partition(
	    uuid=getattr(self.args, self.name + '_uuid'),
	    name=self.name,
	    size=self.space_size,
	    num=num)
...
...

create_partition函数创建journal分区

  • 调用ptype_tobe_for_name函数,获取journal的typecode:45b0969e-9b03-4f30-b4c6-b4b80ceff106
  • 创建journal分区
    • /usr/sbin/sgdisk --new=2:0:+5120M --change-name=2:ceph journal --partition-guid=2:f693b826-e070-4b42-af3e-07d011994583 --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdb
  • 分区生效
    • /usr/bin/udevadm settle --timeout=600
    • /usr/bin/flock -s /dev/sdb /usr/sbin/partprobe /dev/sdb
    • /usr/bin/udevadm settle --timeout=600
    • def create_partition(self, uuid, name, size=0, num=0):
              ptype = self.ptype_tobe_for_name(name)
              if num == 0:
                  num = get_free_partition_index(dev=self.path)
              if size > 0:
                  new = '--new={num}:0:+{size}M'.format(num=num, size=size)
                  if size > self.get_dev_size():
                      LOG.error('refusing to create %s on %s' % (name, self.path))
                      LOG.error('%s size (%sM) is bigger than device (%sM)'
                                % (name, size, self.get_dev_size()))
                      raise Error('%s device size (%sM) is not big enough for %s'
                                  % (self.path, self.get_dev_size(), name))
              else:
                  new = '--largest-new={num}'.format(num=num)
              LOG.debug('Creating %s partition num %d size %d on %s',
                        name, num, size, self.path)
              command_check_call(
                  [
                      'sgdisk',
                      new,
                      '--change-name={num}:ceph {name}'.format(num=num, name=name),
                      '--partition-guid={num}:{uuid}'.format(num=num, uuid=uuid),
                      '--typecode={num}:{uuid}'.format(num=num, uuid=ptype),
                      '--mbrtogpt',
                      '--',
                      self.path,
                  ]
              )
              # 使分区生效
              update_partition(self.path, 'created')
              return num

       

set_data_partition函数,调用create_data_partition函数创建数据分区

def set_data_partition(self):
        if is_partition(self.args.data):
            LOG.debug('OSD data device %s is a partition',
                      self.args.data)
            self.partition = DevicePartition.factory(
                path=None, dev=self.args.data, args=self.args)
            ptype = self.partition.get_ptype()
            ready = Ptype.get_ready_by_name('osd')
            if ptype not in ready:
                LOG.warning('incorrect partition UUID: %s, expected %s'
                            % (ptype, str(ready)))
        else:
            LOG.debug('Creating osd partition on %s',
                      self.args.data)
            self.partition = self.create_data_partition()

调用Device类的create_partition创建数据分区并使分区生效

  • /usr/sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:1b9521d7-ee24-4043-96a7-1a3140bbff27 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdb
  • /usr/bin/udevadm settle --timeout=600
  • /usr/bin/flock -s /dev/sdb /usr/sbin/partprobe /dev/sdb
  • /usr/bin/udevadm settle --timeout=600
  • def create_data_partition(self):
            device = Device.factory(self.args.data, self.args)
            partition_number = 1
            device.create_partition(uuid=self.args.osd_uuid,
                                    name='data',
                                    num=partition_number,
                                    size=self.get_space_size())
            return device.get_partition(partition_number)

     

populate_data_path_device函数创建OSD

  • 格式化数据分区为xfs
  • 创建临时目录并挂载
  • ceph_fsid、fsid、magic、journal_uuid文件写入OSD的临时文件
  • 执行restorecon命令,恢复文件安全
  • 卸载、删除临时目录
  • 更改OSD分区的typecode为4fbd7e29-9d25-41b8-afd0-062c0ceff05d,对应为ready
  • 使分区生效
  • 强制内核触发设备事件
  • def populate_data_path_device(self, *to_prepare_list):
            partition = self.partition
            if isinstance(partition, DevicePartitionCrypt):
                partition.map()
            try:
                args = [
                    'mkfs',
                    '-t',
                    self.args.fs_type,
                ]
                if self.mkfs_args is not None:
                    args.extend(self.mkfs_args.split())
                    if self.args.fs_type == 'xfs':
                        args.extend(['-f'])  # always force
                else:
                    args.extend(MKFS_ARGS.get(self.args.fs_type, []))
                args.extend([
                    '--',
                    partition.get_dev(),
                ])
                try:
                    LOG.debug('Creating %s fs on %s',
                              self.args.fs_type, partition.get_dev())
                    # 格式化数据分区为xfs
                    command_check_call(args)
                except subprocess.CalledProcessError as e:
                    raise Error(e)
                # 挂载临时目录
                path = mount(dev=partition.get_dev(),
                             fstype=self.args.fs_type,
                             options=self.mount_options)
                try:
                    # OSD的ceph_fsid、fsid、magic、journal_uuid文件写入临时文件
                    self.populate_data_path(path, *to_prepare_list)
                finally:
                    # 执行restorecon命令,恢复文件安全
                    path_set_context(path)
                    # 卸载临时目录,并删除临时目录
                    unmount(path)
            finally:
                if isinstance(partition, DevicePartitionCrypt):
                    partition.unmap()
            if not is_partition(self.args.data):
                try:
                    # 更改OSD分区的typecode为4fbd7e29-9d25-41b8-afd0-062c0ceff05d,对应为ready
                    command_check_call(
                        [
                            'sgdisk',
                            '--typecode=%d:%s' % (partition.get_partition_number(),
                                                  partition.ptype_for_name('osd')),
                            '--',
                            self.args.data,
                        ],
                    )
                except subprocess.CalledProcessError as e:
                    raise Error(e)
                # 使分区生效
                update_partition(self.args.data, 'prepared')
                # 强制内核触发设备事件
                command_check_call(['udevadm', 'trigger',
                                    '--action=add',
                                    '--sysname-match',
                                    os.path.basename(partition.rawdev)])

     

activate

ceph-disk activate命令行格式为:

ceph-disk activate [-h] [--mount] [--activate-key PATH]
                          [--mark-init INITSYSTEM] [--no-start-daemon]
                          [--dmcrypt] [--dmcrypt-key-dir KEYDIR]
                          [--reactivate]
                          PATH

activate子命令解析make_activate_parser函数,默认的执行函数是main_activate。

  • 调用mount_activate函数,挂载OSD
  • 获取挂载点,校验journal文件
  • 启动OSD进程
  • def main_activate(args):
        cluster = None
        osd_id = None
        LOG.info('path = ' + str(args.path))
        if not os.path.exists(args.path):
            raise Error('%s does not exist' % args.path)
        if is_suppressed(args.path):
            LOG.info('suppressed activate request on %s', args.path)
            return
        # ceph-disk.activate.lock文件:/var/lib/ceph/tmp/ceph-disk.activate.lock
        with activate_lock:
            mode = os.stat(args.path).st_mode
            if stat.S_ISBLK(mode):
                if (is_partition(args.path) and
                        (get_partition_type(args.path) ==
                         PTYPE['mpath']['osd']['ready']) and
                        not is_mpath(args.path)):
                    raise Error('%s is not a multipath block device' %
                                args.path)
                # 挂载数据分区
                (cluster, osd_id) = mount_activate(
                    dev=args.path,
                    activate_key_template=args.activate_key_template,
                    init=args.mark_init,
                    dmcrypt=args.dmcrypt,
                    dmcrypt_key_dir=args.dmcrypt_key_dir,
                    reactivate=args.reactivate,
                )
                # 获取挂载点
                osd_data = get_mount_point(cluster, osd_id)
            elif stat.S_ISDIR(mode):
                (cluster, osd_id) = activate_dir(
                    path=args.path,
                    activate_key_template=args.activate_key_template,
                    init=args.mark_init,
                )
                osd_data = args.path
            else:
                raise Error('%s is not a directory or block device' % args.path)
            # exit with 0 if the journal device is not up, yet
            # journal device will do the activation
            # 校验journal文件
            osd_journal = '{path}/journal'.format(path=osd_data)
            if os.path.islink(osd_journal) and not os.access(osd_journal, os.F_OK):
                LOG.info("activate: Journal not present, not starting, yet")
                return
            if (not args.no_start_daemon and args.mark_init == 'none'):
                command_check_call(
                    [
                        'ceph-osd',
                        '--cluster={cluster}'.format(cluster=cluster),
                        '--id={osd_id}'.format(osd_id=osd_id),
                        '--osd-data={path}'.format(path=osd_data),
                        '--osd-journal={journal}'.format(journal=osd_journal),
                    ],
                )
            if (not args.no_start_daemon and
                    args.mark_init not in (None, 'none')):
                # 启动OSD进程
                start_daemon(
                    cluster=cluster,
                    osd_id=osd_id,
                )

     

mount_activate函数

def mount_activate(
    dev,
    activate_key_template,
    init,
    dmcrypt,
    dmcrypt_key_dir,
    reactivate=False,
):
    if dmcrypt:
        # 获取分区UUID
        part_uuid = get_partition_uuid(dev)
        dev = dmcrypt_map(dev, dmcrypt_key_dir)
    try:
        # 获取文件系统类型xfs
        fstype = detect_fstype(dev=dev)
    except (subprocess.CalledProcessError,
            TruncatedLineError,
            TooManyLinesError) as e:
        raise FilesystemTypeError(
            'device {dev}'.format(dev=dev),
            e,
        )
    # TODO always using mount options from cluster=ceph for
    # now; see http://tracker.newdream.net/issues/3253
    # 获取osd_mount_options_xfs
    mount_options = get_conf(
        cluster='ceph',
        variable='osd_mount_options_{fstype}'.format(
            fstype=fstype,
        ),
    )
    if mount_options is None:
        # 获取osd_fs_mount_options_xfs
        mount_options = get_conf(
            cluster='ceph',
            variable='osd_fs_mount_options_{fstype}'.format(
                fstype=fstype,
            ),
        )
    # remove whitespaces from mount_options
    if mount_options is not None:
        mount_options = "".join(mount_options.split())
    # 挂载临时目录
    path = mount(dev=dev, fstype=fstype, options=mount_options)
    # check if the disk is deactive, change the journal owner, group
    # mode for correct user and group.
    if os.path.exists(os.path.join(path, 'deactive')):
        # logging to syslog will help us easy to know udev triggered failure
        if not reactivate:
            unmount(path)
            # we need to unmap again because dmcrypt map will create again
            # on bootup stage (due to deactivate)
            if '/dev/mapper/' in dev:
                part_uuid = dev.replace('/dev/mapper/', '')
                dmcrypt_unmap(part_uuid)
            LOG.info('OSD deactivated! reactivate with: --reactivate')
            raise Error('OSD deactivated! reactivate with: --reactivate')
        # flag to activate a deactive osd.
        deactive = True
    else:
        deactive = False
    osd_id = None
    cluster = None
    try:
        # 挂载OSD
        (osd_id, cluster) = activate(path, activate_key_template, init)
        # Now active successfully
        # If we got reactivate and deactive, remove the deactive file
        if deactive and reactivate:
            os.remove(os.path.join(path, 'deactive'))
            LOG.info('Remove `deactive` file.')
        # check if the disk is already active, or if something else is already
        # mounted there
        active = False
        other = False
        src_dev = os.stat(path).st_dev
        # 校验是否已经激活(挂载到正确目录)
        try:
            dst_dev = os.stat((STATEDIR + '/osd/{cluster}-{osd_id}').format(
                cluster=cluster,
                osd_id=osd_id)).st_dev
            if src_dev == dst_dev:
                active = True
            else:
                parent_dev = os.stat(STATEDIR + '/osd').st_dev
                if dst_dev != parent_dev:
                    other = True
                elif os.listdir(get_mount_point(cluster, osd_id)):
                    LOG.info(get_mount_point(cluster, osd_id) +
                             " is not empty, won't override")
                    other = True
        except OSError:
            pass
        if active:
            LOG.info('%s osd.%s already mounted in position; unmounting ours.'
                     % (cluster, osd_id))
            # 卸载临时目录,并删除临时目录
            unmount(path)
        elif other:
            raise Error('another %s osd.%s already mounted in position '
                        '(old/different cluster instance?); unmounting ours.'
                        % (cluster, osd_id))
        else:
            move_mount(
                dev=dev,
                path=path,
                cluster=cluster,
                osd_id=osd_id,
                fstype=fstype,
                mount_options=mount_options,
            )
        return cluster, osd_id
    except:
        LOG.error('Failed to activate')
        unmount(path)
        raise
    finally:
        # remove our temp dir
        # 删除临时目录
        if os.path.exists(path):
            os.rmdir(path)


手工管理OSD

准备OSD

以 /usr/sbin/ceph-disk -v prepare --zap-disk --cluster ceph --fs-type xfs -- /dev/sdb为例,ceph-disk prepare命令执行过程如下。

查看journal参数

[root@ceph-231 ~]# /usr/bin/ceph-osd --check-allows-journal -i 0 --cluster ceph --setuser ceph --setgroup ceph
yes
[root@ceph-231 ~]# /usr/bin/ceph-osd --check-wants-journal -i 0 --cluster ceph --setuser ceph --setgroup ceph
yes
[root@ceph-231 ~]# /usr/bin/ceph-osd --check-needs-journal -i 0 --cluster ceph --setuser ceph --setgroup ceph
no

查看已挂载的设备,/dev/sdb未被挂载,可以用来创建OSD

[root@ceph-231 ~]# cat /proc/mounts
...
...
/dev/sda1 / ext3 rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
...
...
/dev/sda5 /var/log ext3 rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
...
...

清除分区

[root@ceph-231 ~]# /usr/sbin/sgdisk --zap-all -- /dev/sdb
[root@ceph-231 ~]# /usr/sbin/sgdisk --clear --mbrtogpt -- /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600
[root@ceph-231 ~]# /usr/bin/flock -s /dev/sdb /usr/sbin/partprobe /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600

获取osd_journal_size,默认5120M

[root@ceph-231 ~]# /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size
5120

生成journal_uuid

[root@ceph-231 ~]# uuidgen
f693b826-e070-4b42-af3e-07d011994583

创建journal分区,{num}用具体数字替换

  • 如果数据盘与journal分区是同一个磁盘,{num}为2
  • 如果数据盘与journal分区不在同一个磁盘,查看journal盘的分区信息,{num}为分区数+1
    • 执行 parted –machine – /dev/sdb print 查看journal盘分区信息
[root@ceph-231 ~]# /usr/sbin/sgdisk --new={num}:0:+5120M --change-name={num}:"ceph journal" --partition-guid={num}:f693b826-e070-4b42-af3e-07d011994583 --typecode={num}:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600
[root@ceph-231 ~]# /usr/bin/flock -s /dev/sdb /usr/sbin/partprobe /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600

生成data分区uuid

[root@ceph-231 ~]# uuidgen
1b9521d7-ee24-4043-96a7-1a3140bbff27

 

创建data分区

[root@ceph-231 ~]# /usr/sbin/sgdisk --largest-new=1 --change-name=1:"ceph data" --partition-guid=1:1b9521d7-ee24-4043-96a7-1a3140bbff27 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600
[root@ceph-231 ~]# /usr/bin/flock -s /dev/sdb /usr/sbin/partprobe /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600

格式化数据分区为xfs

[root@ceph-231 ~]# /usr/sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdb1

查看挂载属性

[root@ceph-231 ~]# /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[root@ceph-231 ~]# /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[root@ceph-231 ~]# /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[root@ceph-231 ~]# /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs

osd_mkfs_options_xfs、osd_fs_mkfs_options_xfs、osd_mount_options_xfs、osd_fs_mount_options_xfs四个值均为空,xfs的默认挂载属性为noatime,inode64,挂载临时目录

[root@ceph-231 ~]# mkdir /var/lib/ceph/tmp/mnt.uCrLyH
[root@ceph-231 ~]# /usr/bin/mount -t xfs -o noatime,inode64 -- /dev/sdb1 /var/lib/ceph/tmp/mnt.uCrLyH
[root@ceph-231 ~]# /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.uCrLyH

获取集群fsid

[root@ceph-231 ~]# /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
ad3bdf51-ae79-44c3-b634-0c9f4995bbf5

集群fs_id写入ceph_fsid临时文件

[root@ceph-231 ~]# vi /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid.1308.tmp
[root@ceph-231 ~]# /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid.1308.tmp
[root@ceph-231 ~]# /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid.1308.tmp
[root@ceph-231 ~]# mv /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid.1308.tmp /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid

生成osd_uuid

[root@ceph-231 ~]# uuidgen
410fa9bc-cdbf-469e-a08a-c246048d5e9b


osd_uuid写入fsid文件临时文件

[root@ceph-231 ~]# vi /var/lib/ceph/tmp/mnt.uCrLyH/fsid.1308.tmp
[root@ceph-231 ~]# /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/fsid.1308.tmp
[root@ceph-231 ~]# /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/fsid.1308.tmp
[root@ceph-231 ~]# mv /var/lib/ceph/tmp/mnt.uCrLyH/fsid.1308.tmp /var/lib/ceph/tmp/mnt.uCrLyH/fsid

写入magic临时文件,内容为 ceph osd volume v026

[root@ceph-231 ~]# vi /var/lib/ceph/tmp/mnt.uCrLyH/magic.1308.tmp
[root@ceph-231 ~]# /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/magic.1308.tmp
[root@ceph-231 ~]# /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/magic.1308.tmp
[root@ceph-231 ~]# mv /var/lib/ceph/tmp/mnt.uCrLyH/magic.1308.tmp /var/lib/ceph/tmp/mnt.uCrLyH/magic

查看journal盘sdb2的uuid

[root@ceph-231 ~]# ll /dev/disk/by-partuuid/ | grep sdb2
lrwxrwxrwx 1 root root 10 Jun 27 19:21 f693b826-e070-4b42-af3e-07d011994583 -> ../../sdb2

journal_uuid写入临时文件

[root@ceph-231 ~]# vi /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid.1308.tmp
[root@ceph-231 ~]# /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid.1308.tmp
[root@ceph-231 ~]# /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid.1308.tmp
[root@ceph-231 ~]# mv /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid.1308.tmp /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid

创建journal链接

[root@ceph-231 ~]# ln -s /dev/disk/by-partuuid/f693b826-e070-4b42-af3e-07d011994583 /var/lib/ceph/tmp/mnt.uCrLyH/journal

restorecon命令,恢复文件安全

[root@ceph-231 ~]# /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH
[root@ceph-231 ~]# /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH

卸载、删除临时目录

[root@ceph-231 ~]# /bin/umount -- /var/lib/ceph/tmp/mnt.uCrLyH
[root@ceph-231 ~]# rm -rf /var/lib/ceph/tmp/mnt.uCrLyH

修改OSD分区的typecode为4fbd7e29-9d25-41b8-afd0-062c0ceff05d,对应为ready

[root@ceph-231 ~]# /usr/sbin/sgdisk --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600
[root@ceph-231 ~]# /usr/bin/flock -s /dev/sdb /usr/sbin/partprobe /dev/sdb
[root@ceph-231 ~]# /usr/bin/udevadm settle --timeout=600

强制内核触发设备事件

[root@ceph-231 ~]# /usr/bin/udevadm trigger --action=add --sysname-match sdb1

 

激活OSD

以/usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/sdb1为例,ceph-disk activate命令执行过程如下。

获取文件系统类型xfs

[root@ceph-231 ~]# /sbin/blkid -p -s TYPE -o value -- /dev/sdb1
xfs

获取osd_mount_options_xfs

[root@ceph-231 ~]# /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs

获取osd_fs_mount_options_xfs

[root@ceph-231 ~]# /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs

osd_mount_options_xfs与osd_fs_mount_options_xfs为空,xfs的默认挂载属性为noatime,inode64,挂载临时目录/var/lib/ceph/tmp/mnt.GoeBOu

[root@ceph-231 ~]# mkdir /var/lib/ceph/tmp/mnt.GoeBOu
[root@ceph-231 ~]# /usr/bin/mount -t xfs -o noatime,inode64 -- /dev/sdb1 /var/lib/ceph/tmp/mnt.GoeBOu
[root@ceph-231 ~]# /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.GoeBOu

卸载、删除临时目录

[root@ceph-231 ~]# /bin/umount -- /var/lib/ceph/tmp/mnt.GoeBOu
[root@ceph-231 ~]# rm -rf /var/lib/ceph/tmp/mnt.GoeBOu

启动OSD进程,0为osd id

[root@ceph-231 ~]# /usr/bin/systemctl disable ceph-osd@0
[root@ceph-231 ~]# /usr/bin/systemctl enable --runtime ceph-osd@0
[root@ceph-231 ~]# /usr/bin/systemctl start ceph-osd@0

 

© 著作权归作者所有

共有 人打赏支持
banwh
粉丝 1
博文 54
码字总数 79705
作品 0
海淀
程序员
Ceph与OpenStack整合(仅为云主机提供云盘功能)

1. Ceph与OpenStack整合(仅为云主机提供云盘功能) 创建: linhaifeng,最新修改: 大约1分钟以前 ceph 部署一个cinder-volume节点。部署过程中可能报错(部署过程请参考官方文档) 报错内容:...

linhaifeng4573
2016/05/26
0
0
小白ceph-deploy部署osd故障

centos6.4 内核版本linux-2.6.32 在deploy节点部署osd并激活,报错信息如下: [root@cephadm my-cluster]# ceph-deploy osd prepare ceph02:/tmp/osd0 ceph03:/tmp/osd1 [ceph_deploy.conf][......

AndrewSun
2014/11/27
5.3K
3
解决找不到/bootstrap-osd/ceph.keyring

命令执行后, 如果提示: 可以执行: 分析发现此处可能连接了外网,要确保DNS正常。 命令后,如果提示: [ceph_deploy][ERROR ] RuntimeError: bootstrap-osd keyring not found; run 'gath...

itfanr
2015/04/07
0
0
Ceph - howto, rbd, lvm, cluster

Install ceph Installation depends about which version you want as they're all locked into Fixed releases (argonaut, bobtail etc). So go here for install options for your distro ......

加油2018
2014/07/30
0
0
Linux 文件系统修复

1.问题引出 在Linux操作系统下,安装ceph分布式存储系统,在安装OSD时,执行了一条命令: 1 ceph-deploy disk zap osdnode1:/dev/sda3 命令执行失败,日志如下: 1 2 3 4 5 6 7 8 9 10 11 12...

西昆仑
2015/08/27
139
0

没有更多内容

加载失败,请刷新页面

加载更多

RxJS的另外四种实现方式(四)——性能最高的库(续)

接上一篇RxJS的另外四种实现方式(三)——性能最高的库 上一篇文章我展示了这个最高性能库的实现方法。下面我介绍一下这个性能提升的秘密。 首先,为了弄清楚Most库究竟为何如此快,我必须借...

一个灰
40分钟前
1
0
麒麟AI首席科学家现世

8月31日,华为发布了新一代顶级人工智能手机芯片麒麟980,成为全球首款7nm工艺手机芯片,AI方面也实现飞跃,支持人脸识别、物体识别、物体检测、图像分割、智能翻译等。 虽然如今人人都在热议...

问题终结者
昨天
1
0
告警系统主脚本、告警系统配置文件、告警系统监控项目

告警系统主脚本 main.sh内容 #!/bin/bash#Written by aming.# 是否发送邮件的开关export send=1# 过滤ip地址export addr=`/sbin/ifconfig |grep -A1 "ens33: "|awk '/inet/ {pr...

芬野de博客
昨天
2
0
MySQL autocommit探究

-- sessionA:tx_isolation=REPEATABLE-READmysql> select connection_id();+-----------------+| connection_id() |+-----------------+| 28 |+-----------------+......

安小乐
昨天
7
0
c++多线程锁 Mutex  自动判断死锁

c++多线程锁可以使用absl::Mutex std::mutex这两种,下面是demo代码。 使用absl:Mutex的时候打印: [mutex.cc : 1338] RAW: Cycle: [mutex.cc : 1352] RAW: mutex@0x683b68 stack: @ 0x43856......

青黑
昨天
3
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部