hive-hbase集成 column mapping

原创
2014/10/10 15:33
阅读数 547

##Column Mapping##

<pre> There are two SERDEPROPERTIES that control the mapping of HBase columns to Hive: hbase.columns.mapping hbase.table.default.storage.type: Can have a value of either string (the default) or binary, this option is only available as of Hive 0.9 and the string behavior is the only one available in earlier versions The column mapping support currently available is somewhat cumbersome and restrictive: for each Hive column, the table creator must specify a corresponding entry in the comma-delimited hbase.columns.mapping string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want a mapping entry must be either :key or of the form column-family-name:[column-name][#(binary|string) (the type specification that delimited by # was added in Hive 0.9.0, earlier versions interpreted everything as strings) If no type specification is given the value from hbase.table.default.storage.type will be used Any prefixes of the valid values are valid too (i.e. #b instead of #binary) If you specify a column as binary the bytes in the corresponding HBase cells are expected to be of the form that HBase's Bytes class yields. there must be exactly one :key mapping (we don't support compound keys yet) (note that before HIVE-1228 in Hive 0.6, :key was not supported, and the first Hive column implicitly mapped to the key; as of Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future) if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp. Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table The next few sections provide detailed examples of the kinds of column mappings currently possible. </pre>

###Multiple Columns and Families### 下面的例子中,有三个Hive列和两个HBase列簇,Hive中的两列(value2和value2)对应一个列簇(a,Hbase列名为b和c),其他的Hive列对应一个单独的列(e)并且拥有它自己的列簇(d)。

<pre> CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,a:b,a:c,d:e" ); INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2 FROM pokes WHERE foo=98 OR foo=100; </pre>

下面时在HBase中看到的:

<pre> hbase(main):014:0> describe "hbase_table_1" DESCRIPTION ENABLED {NAME => 'hbase_table_1', FAMILIES => [{NAME => 'a', COMPRESSION => 'N true ONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_M EMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'd', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN _MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0170 seconds hbase(main):015:0> scan "hbase_table_1" ROW COLUMN+CELL 100 column=a:b, timestamp=1267740457648, value=val_100 100 column=a:c, timestamp=1267740457648, value=101 100 column=d:e, timestamp=1267740457648, value=102 98 column=a:b, timestamp=1267740457648, value=val_98 98 column=a:c, timestamp=1267740457648, value=99 98 column=d:e, timestamp=1267740457648, value=100 2 row(s) in 0.0240 seconds </pre>

并且在Hive中查询:

<pre> hive> select * from hbase_table_1; Total MapReduce jobs = 1 Launching Job 1 out of 1 ... OK 100 val_100 101 102 98 val_98 99 100 Time taken: 4.054 seconds </pre>

###Hive MAP to HBase Column Family###

下面时一个Hive MAP 数据类型怎样用于访问整个列簇。每一行可以有一个不同的列的集合,列名对应map keys并且列值对应map values。

<pre> CREATE TABLE hbase_table_1(value map<string,int>, row_key int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "cf:,:key" ); INSERT OVERWRITE TABLE hbase_table_1 SELECT map(bar, foo), foo FROM pokes WHERE foo=98 OR foo=100; </pre>

(这个例子也证明了使用Hive的一个列而不是第一列作为HBase的row key。)

下面时在HBase中看到的。(两个不同的列名 在不同的行)

<pre> hbase(main):012:0> scan "hbase_table_1" ROW COLUMN+CELL 100 column=cf:val_100, timestamp=1267739509194, value=100 98 column=cf:val_98, timestamp=1267739509194, value=98 2 row(s) in 0.0080 seconds </pre>

在Hive中查询:

<pre> hive> select * from hbase_table_1; Total MapReduce jobs = 1 Launching Job 1 out of 1 ... OK {"val_100":100} 100 {"val_98":98} 98 Time taken: 3.808 seconds </pre>

注意,MAP的key数据类型必须是string,应为它用于命名HBase列,所以下面的表定义会失败:

<pre> CREATE TABLE hbase_table_1(key int, value map<int,int>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,cf:" ); </pre>

注意:在apache-hive-0.13.1-bin中,没有报错。

###Illegal:Hive Primitive to HBase Column Family###

表定义,例如下面的是非法的因为一个Hive列映射到一整个列簇必须拥有MAP类型:

<pre> CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,cf:" ); FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to string) </pre>

###Example with binary columns###

应用默认值 hbase.table.default.storage.type:

<pre> CREATE TABLE hbase_table_1 (key int, value string, foobar double) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key#b,cf:val,cf:foo#b" ); </pre>

指定hbase.table.default.storage.type:

<pre> CREATE TABLE hbase_table_1 (key int, value string, foobar double) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,cf:val#s,cf:foo", "hbase.table.default.storage.type" = "binary" ); </pre>

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部