hive 常用小技巧

原创
2014/05/12 19:30
阅读数 5.1K

1. parse_url如果查不到会返回 NULL

parse_url用来解析url中的数据。常用的就是解析HOST和QUERY。

string

parse_url(string urlString, string partToExtract [, string keyToExtract])

Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. e.g. parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') returns 'facebook.com'. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, e.g. parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'.

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page')<=0 limit 10;   

Total MapReduce CPU Time Spent: 2 seconds 440 msec
OK
Time taken: 38.136 seconds

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 660 msec
OK
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
Time taken: 40.133 seconds

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is not NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
Time taken: 34.121 seconds

2. 默认分区会导致查询出错,删掉时要确认表中(show partitions visitlogs)是否还有。尤其是使用外部表时要注意删除数据。

ALTER TABLE visitlogs DROP PARTITION (dt='2012-12-25', vhour='__HIVE_DEFAULT_PARTITION__');

hive>
    > select sessidmodex(sessid,10), count(*), count(distinct sessid), count(distinct visitip) from visitlogs where ((dt='2012-11-10' and vhour>=13) or (dt='2012-11-11' and vhour<13) )
    > AND ((class_name='goods' AND method_name='goods_poster' and uri like '%page=0%') OR (class_name='goods' AND method_name='hot'  and parse_url(concat('http://www.meilishuo.com',uri),'QUERY','page') is NULL) ) AND not is_spam(dt,sessid,'SESSID') group by sessidmodex(sessid, 10);
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.opAnd(PcrExprProcFactory.java:128)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory$GenericFuncExprProcessor.process(PcrExprProcFactory.java:267)
     at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.walkExprTree(PcrExprProcFactory.java:450)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:149)
     at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PartitionConditionRemover.transform(PartitionConditionRemover.java:78)
     at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:87)
     at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7306)
     at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
     at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
     at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:616)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

3. Array和Map的常用方法:

A[n] A is an Array and n is an int Returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' 
M[key] M is a Map<K, V> and key has type K Returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'  

array<K> map_keys(Map<K.V>) Returns an unordered array containing the keys of the input map 
array<V> map_values(Map<K.V>) Returns an unordered array containing the values of the input map 

boolean array_contains(Array<T>, value) 
Returns TRUE if the array contains value

简单demo:查询qzone渠道来过的所有用户的记录
select * from user_session_stat where dt='2012-10-15' and array_contains(map_keys(market_from),'tx_qzone') limit 10;

visitips格式{"172.0.0.1":100,"172.0.0.1":20,"172.0.0.2":5} #IP:访问量, 按访问量排序

map_keys(visitips)[0]:获取访问量最多的IP

4. explode 打散数组和字典
select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 2;

select tmp.cc,count(*) from (select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 10) tmp group by cc;

详细:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

第一区间:9.8~9.18
第二区间:10.2~10.12
第三区间:10.21~11.01
第四区间:11.07~11.17
select tmp2.cc,count(*) from (
select explode(map_keys(uss.market_from)) as cc from (
select sessid from user_session_stat where dt='2012-09-18' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt>='2012-09-08' and uss.dt<='2012-09-17' and uss.is_spam=0 ) tmp2 group by tmp2.cc;

PS:
select explode(map_keys(uss.market_from)) as cc,uss.sessid from (
select sessid from user_session_stat where dt='2012-11-01' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt='2012-10-30' and uss.dt<='2012-10-31' and uss.is_spam=0
不支持:Error in semantic analysis: 1:51 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sessid'

5. HIVE 的 LATERAL VIEW
===============================

http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-hive

This is the below Hive Table

CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable
(
USER_ID BIGINT,
NEW_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
And this is the data in the above table-
1015826235    
 
[{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"}]
Is there any way I can get the below output from the HiveQL after exploding the array?
**USER_ID**  |  **PRODUCT_ID**  |   **TIMESTAMPS**
 ------------+------------------+----------------
1015826235      220003038067       1340321132000
1015826235      300003861266       1340271857000
SELECT
   user_id,
   prod_and_ts.product_id as product_id,
   prod_and_ts.timestamps as timestamps
FROM 
   SampleTable 
   LATERAL VIEW explode(new_item) exploded_table as prod_and_ts;======================


select * from user_session_stat LATERAL VIEW  explode(map_keys(market_from))  channel_cc as prod_and_ts  where dt='2012-12-23'  limit 10;

000048693e1e5f7cd12113011144502a     2012-11-30          ["0"]     ["","tq_shoppingbag1105d_121223","out_link"]     {"123.171.149.104":21,"182.35.12.232":9}     /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213          Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER     8     organic     norefer     0     SHANDONG     JINAN     3079     {"norefer":1,"tx_weibo":1}     2012-12-23     norefer
000048693e1e5f7cd12113011144502a     2012-11-30          ["0"]     ["","tq_shoppingbag1105d_121223","out_link"]     {"123.171.149.104":21,"182.35.12.232":9}     /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213          Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER     8     organic     norefer     0     SHANDONG     JINAN     3079     {"norefer":1,"tx_weibo":1}     2012-12-23     tx_weibo

select sessid,sub_channel,market_from,prod_and_ts from user_session_stat LATERAL VIEW  explode(map_keys(market_from))  channel_cc as prod_and_ts  where dt='2012-12-23';

00002b681cb0e0ad81212232313359e3     norefer     {"norefer":1}     norefer
0000480e5b010d80c121223160448bdb     norefer     {"norefer":1}     norefer
000048693e1e5f7cd12113011144502a     norefer     {"norefer":1,"tx_weibo":1}     norefer
000048693e1e5f7cd12113011144502a     norefer     {"norefer":1,"tx_weibo":1}     tx_weibo
00007a56dbbe2cb1e1212230031058e6     norefer     {"norefer":1}     norefer
00007be99d947ee54121223135026115     norefer     {"norefer":1}     norefer

6. 替换Mysql-->show tables like '%%’
    show tables '*cpc*';

7. regexp_extract(uri, '/share/([0-9]+)', 1) 正则截取



官方UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


8、hive本身强大的正则:

实例:表A,id= 110,    字段json

{"cnt":1,"value":[{"time":1390543558,"index":1,"level":190,"name":"A97M/AccessiV.B virus"},{"time":0,"index":0,"level":0,"name":"               "},{"time":0,"index":0,"level":0,"name":"               "}]} 

查询, value第一个值里的 name :

select id, get_json_object(A.json, '$.value[0].name') from A id=110;

get_json_object

A limited version of JSONPath is supported:

  • $ : Root object

  • . : Child operator

  • [] : Subscript operator for array

  • * : Wildcard for []

Syntax not supported that's worth noticing:

  • : Zero length string as key

  • .. : Recursive descent

  • @ : Current object/element

  • () : Script expression

  • ?() : Filter (script) expression.

  • [,] : Union operator

  • [start:end.step] : array slice operator

Example: src_json table is a single column (json), single row table:

+----+
                                json
+----+
{ "store" :
   { "fruit" :\[{ "weight" : 8 , "type" : "apple" },{ "weight" : 9 , "type" : "pear" }],
    "bicycle" :{ "price" : 19.95 , "color" : "red" }
   },
  "email" : "amy@only_for_json_udf_test.net" ,
  "owner" : "amy"
}
+----+

The fields of the json object can be extracted using these queries:

hive> SELECT get_json_object(src_json.json,  '$.owner' ) FROM src_json;
amy
 
hive> SELECT get_json_object(src_json.json,  '$.store.fruit\[0]' ) FROM src_json;
{ "weight" : 8 , "type" : "apple" }
 
hive> SELECT get_json_object(src_json.json,  '$.non_exist_key' ) FROM src_json;
NULL



展开阅读全文
打赏
0
2 收藏
分享
加载中
更多评论
打赏
0 评论
2 收藏
0
分享
返回顶部
顶部