hadoop集群和hive数据迁移方案

原创
2017/04/07 16:51
阅读数 1.6K

最近公司要做hadoop集群和hive数据的迁移,如何在保证原有业务稳定运行的情况下完成所有数据的迁移,也是一个不容易的过程。我总结了一下流程,给其他hadoop集群和hive数仓的管理员做参考。

1. 首先肯定要保证新的hadoop集群和hive的顺利安装和配置,为了减少遇到不必要的麻烦,新集群采用了跟之前一样的版本,都是hadoop2.6.3和hive1.2.1。如果是跨版本迁移可能要更注意细节了。

2. 复制hdfs数据到新的集群,直接采用distcp命令即可在2个集群间方便的进行数据迁移和同步。参数参考如下:

[hadoop@master1 hive]$ hadoop distcp 
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                Reuse existing data in target files and append new
                        data to them if possible
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugpcaxt)(replication,
                        block-size, user, group, permission,
                        checksum-type, ACL, XATTR, timestamps). If -p is
                        specified with no <arg>, then preserves
                        replication, block size, user, group, permission,
                        checksum type and timestamps. raw.* xattrs are
                        preserved when both the source and destination
                        paths are in the /.reserved/raw hierarchy (HDFS
                        only). raw.* xattrpreservation is independent of
                        the -p flag. Refer to the DistCp documentation for
                        more details.
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories

具体执行命令如下,nn1和nn2分别为集群的namenode节点(如果是hadoop1.x到2.x的迁移,hdfs可能会报错,只能使用hftp的方式进行数据传输,具体方法自行搜索)

hadoop distcp hdfs://nn1/user/hive/ hdfs://nn2:8020/user

 

由于distcp命令也是转换成MR任务来执行,会消耗集群大量的资源,所以如果数据量较大,最好选择在MR job运行较少的时间段做迁移。

 

3. 大家知道Hive数据就是存在HDFS里的,而元数据(metastore)一般会选择存在mysql里,如果新集群的元数据存储数据库还是mysql的话,可以直接选择mysqldump把数据同步过去就ok了。当然要把warehouse全部distcp到目标集群对应目录下,都是存储在/somepath/hive/warehouse/dbname.db/tablename目录下。

我试过数据和元数据同步之后,可以直接在新的hive命令行直接查询所有数据,即使是分区表也没问题。

 

  当然如果你的元数据存储有变化或者元数据出错的话,可能只有采用建表语句+动态分区的方法来迁移hive数据了,具体方案可以参考

https://odinliu.com/2016/02/02/%E6%9C%80%E8%BF%91%E6%90%9EHadoop%E9%9B%86%E7%BE%A4%E8%BF%81%E7%A7%BB%E8%B8%A9%E7%9A%84%E5%9D%91%E6%9D%82%E8%AE%B0/

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部