文档章节

hive 常用小技巧

王二铁
 王二铁
发布于 2014/05/12 19:30
字数 1402
阅读 2427
收藏 2

1. parse_url如果查不到会返回 NULL

parse_url用来解析url中的数据。常用的就是解析HOST和QUERY。

string

parse_url(string urlString, string partToExtract [, string keyToExtract])

Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. e.g. parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') returns 'facebook.com'. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, e.g. parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'.

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page')<=0 limit 10;   

Total MapReduce CPU Time Spent: 2 seconds 440 msec
OK
Time taken: 38.136 seconds

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 660 msec
OK
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
Time taken: 40.133 seconds

select parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') from class_method_map where parse_url('http://www.meilishuo.com/guang/hot','QUERY','page') is not NULL limit 10;
Total MapReduce CPU Time Spent: 2 seconds 530 msec
OK
Time taken: 34.121 seconds

2. 默认分区会导致查询出错,删掉时要确认表中(show partitions visitlogs)是否还有。尤其是使用外部表时要注意删除数据。

ALTER TABLE visitlogs DROP PARTITION (dt='2012-12-25', vhour='__HIVE_DEFAULT_PARTITION__');

hive>
    > select sessidmodex(sessid,10), count(*), count(distinct sessid), count(distinct visitip) from visitlogs where ((dt='2012-11-10' and vhour>=13) or (dt='2012-11-11' and vhour<13) )
    > AND ((class_name='goods' AND method_name='goods_poster' and uri like '%page=0%') OR (class_name='goods' AND method_name='hot'  and parse_url(concat('http://www.meilishuo.com',uri),'QUERY','page') is NULL) ) AND not is_spam(dt,sessid,'SESSID') group by sessidmodex(sessid, 10);
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.opAnd(PcrExprProcFactory.java:128)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory$GenericFuncExprProcessor.process(PcrExprProcFactory.java:267)
     at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrExprProcFactory.walkExprTree(PcrExprProcFactory.java:450)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:149)
     at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
     at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
     at org.apache.hadoop.hive.ql.optimizer.pcr.PartitionConditionRemover.transform(PartitionConditionRemover.java:78)
     at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:87)
     at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7306)
     at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
     at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
     at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
     at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
     at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:616)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

3. Array和Map的常用方法:

A[n] A is an Array and n is an int Returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' 
M[key] M is a Map<K, V> and key has type K Returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'  

array<K> map_keys(Map<K.V>) Returns an unordered array containing the keys of the input map 
array<V> map_values(Map<K.V>) Returns an unordered array containing the values of the input map 

boolean array_contains(Array<T>, value) 
Returns TRUE if the array contains value

简单demo:查询qzone渠道来过的所有用户的记录
select * from user_session_stat where dt='2012-10-15' and array_contains(map_keys(market_from),'tx_qzone') limit 10;

visitips格式{"172.0.0.1":100,"172.0.0.1":20,"172.0.0.2":5} #IP:访问量, 按访问量排序

map_keys(visitips)[0]:获取访问量最多的IP

4. explode 打散数组和字典
select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 2;

select tmp.cc,count(*) from (select explode(map_keys(market_from)) as cc from user_session_stat where dt='2012-11-28' and size(map_keys(market_from))>1 limit 10) tmp group by cc;

详细:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

第一区间:9.8~9.18
第二区间:10.2~10.12
第三区间:10.21~11.01
第四区间:11.07~11.17
select tmp2.cc,count(*) from (
select explode(map_keys(uss.market_from)) as cc from (
select sessid from user_session_stat where dt='2012-09-18' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt>='2012-09-08' and uss.dt<='2012-09-17' and uss.is_spam=0 ) tmp2 group by tmp2.cc;

PS:
select explode(map_keys(uss.market_from)) as cc,uss.sessid from (
select sessid from user_session_stat where dt='2012-11-01' and is_spam=0 and sub_channel='norefer') tmp join user_session_stat uss on tmp.sessid=uss.sessid where uss.dt='2012-10-30' and uss.dt<='2012-10-31' and uss.is_spam=0
不支持:Error in semantic analysis: 1:51 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sessid'

5. HIVE 的 LATERAL VIEW
===============================

http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-hive

This is the below Hive Table

CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable
(
USER_ID BIGINT,
NEW_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
And this is the data in the above table-
1015826235    
 
[{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"}]
Is there any way I can get the below output from the HiveQL after exploding the array?
**USER_ID**  |  **PRODUCT_ID**  |   **TIMESTAMPS**
 ------------+------------------+----------------
1015826235      220003038067       1340321132000
1015826235      300003861266       1340271857000
SELECT
   user_id,
   prod_and_ts.product_id as product_id,
   prod_and_ts.timestamps as timestamps
FROM 
   SampleTable 
   LATERAL VIEW explode(new_item) exploded_table as prod_and_ts;======================


select * from user_session_stat LATERAL VIEW  explode(map_keys(market_from))  channel_cc as prod_and_ts  where dt='2012-12-23'  limit 10;

000048693e1e5f7cd12113011144502a     2012-11-30          ["0"]     ["","tq_shoppingbag1105d_121223","out_link"]     {"123.171.149.104":21,"182.35.12.232":9}     /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213          Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER     8     organic     norefer     0     SHANDONG     JINAN     3079     {"norefer":1,"tx_weibo":1}     2012-12-23     norefer
000048693e1e5f7cd12113011144502a     2012-11-30          ["0"]     ["","tq_shoppingbag1105d_121223","out_link"]     {"123.171.149.104":21,"182.35.12.232":9}     /paipaiguang/%E7%BE%BD%E7%BB%92%E6%9C%8D?PTAG=20036.9.213          Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER     8     organic     norefer     0     SHANDONG     JINAN     3079     {"norefer":1,"tx_weibo":1}     2012-12-23     tx_weibo

select sessid,sub_channel,market_from,prod_and_ts from user_session_stat LATERAL VIEW  explode(map_keys(market_from))  channel_cc as prod_and_ts  where dt='2012-12-23';

00002b681cb0e0ad81212232313359e3     norefer     {"norefer":1}     norefer
0000480e5b010d80c121223160448bdb     norefer     {"norefer":1}     norefer
000048693e1e5f7cd12113011144502a     norefer     {"norefer":1,"tx_weibo":1}     norefer
000048693e1e5f7cd12113011144502a     norefer     {"norefer":1,"tx_weibo":1}     tx_weibo
00007a56dbbe2cb1e1212230031058e6     norefer     {"norefer":1}     norefer
00007be99d947ee54121223135026115     norefer     {"norefer":1}     norefer

6. 替换Mysql-->show tables like '%%’
    show tables '*cpc*';

7. regexp_extract(uri, '/share/([0-9]+)', 1) 正则截取



官方UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


8、hive本身强大的正则:

实例:表A,id= 110,    字段json

{"cnt":1,"value":[{"time":1390543558,"index":1,"level":190,"name":"A97M/AccessiV.B virus"},{"time":0,"index":0,"level":0,"name":"               "},{"time":0,"index":0,"level":0,"name":"               "}]} 

查询, value第一个值里的 name :

select id, get_json_object(A.json, '$.value[0].name') from A id=110;

get_json_object

A limited version of JSONPath is supported:

  • $ : Root object

  • . : Child operator

  • [] : Subscript operator for array

  • * : Wildcard for []

Syntax not supported that's worth noticing:

  • : Zero length string as key

  • .. : Recursive descent

  • @ : Current object/element

  • () : Script expression

  • ?() : Filter (script) expression.

  • [,] : Union operator

  • [start:end.step] : array slice operator

Example: src_json table is a single column (json), single row table:

+----+
                                json
+----+
{ "store" :
   { "fruit" :\[{ "weight" : 8 , "type" : "apple" },{ "weight" : 9 , "type" : "pear" }],
    "bicycle" :{ "price" : 19.95 , "color" : "red" }
   },
  "email" : "amy@only_for_json_udf_test.net" ,
  "owner" : "amy"
}
+----+

The fields of the json object can be extracted using these queries:

hive> SELECT get_json_object(src_json.json,  '$.owner' ) FROM src_json;
amy
 
hive> SELECT get_json_object(src_json.json,  '$.store.fruit\[0]' ) FROM src_json;
{ "weight" : 8 , "type" : "apple" }
 
hive> SELECT get_json_object(src_json.json,  '$.non_exist_key' ) FROM src_json;
NULL



© 著作权归作者所有

王二铁

王二铁

粉丝 89
博文 62
码字总数 41123
作品 0
北京
后端工程师
私信 提问
大数据经典学习路线(及供参考)之 一

1.Linux基础和分布式集群技术 学完此阶段可掌握的核心能力: 熟练使用Linux,熟练安装Linux上的软件,了解熟悉负载均衡、高可靠等集群相关概念,搭建互联网高并发、高可靠的服务架构; 学完此...

柯西带你学编程
2018/05/22
0
0
Hive客户端工具

Hive客户端工具 {流水理鱼|wwek}2017-12-271 阅读 hivehadoopv 前沿 Hive有多种Hive客户端工具 Hive命令行工具只适合执行已经编写好的HSQL语句,或者执行较为简单 […] 点赞 hivehadoopv 作者...

{流水理鱼|wwek}
2017/12/27
0
0
大数据入门与实战-Hive 常见SQL、技巧与问题

1 常见SQL 取数据库前100条数据 -随机取100条数据 2 常见方法 宏的使用 宏可以看做是一个简短的函数,或者是对一个表达式取别名,同时可以将这个表达式中的一些值做成变量调用时传入,比较适...

致Great
05/28
0
0
Quicksql能加速hive的查询速度吗?

在CentOS 7.X系统上部署的hive集群,如果不用什么特殊技巧的话,看着hive查询进度条简直是能急死人。

kppom
01/10
264
0
Hive 常见问题与技巧【Updating】

1Q: 是否有像类似于phpmyadmin一样的hive查询客户端,能以界面的方式查询hive语句和导出数据 A: 有的,客户端的话可以使用squirrel来连接hive,squirrel是一个通用的数据库查询客户端,还有有...

大数据之路
2013/09/26
1K
0

没有更多内容

加载失败,请刷新页面

加载更多

最简单的获取相机拍照的图片

  import android.content.Intent;import android.graphics.Bitmap;import android.os.Bundle;import android.os.Environment;import android.provider.MediaStore;import andr......

MrLins
57分钟前
4
0
说好不哭!数据可视化深度干货,前端开发下一个涨薪点在这里~

随着互联网在各行各业的影响不断深入,数据规模越来越大,各企业也越来越重视数据的价值。作为一家专业的数据智能公司,个推从消息推送服务起家,经过多年的持续耕耘,积累沉淀了海量数据,在...

个推
59分钟前
8
0
第三方支付-返回与回调注意事项

不管是支付宝,微信,还是其它第三方支付,第四方支付,支付机构服务商只要涉及到钱的交易都要进行如下校验,全部成功了才视为成功订单 1.http请求是否成功 2.校验商户号 3.校验订单号及状态...

Shingfi
今天
4
0
简述Java内存分配和回收策略以及Minor GC 和 Major GC(Full GC)

内存分配: 1. 栈区:栈可分为Java虚拟机和本地方法栈 2. 堆区:堆被所有线程共享,在虚拟机启动时创建,是唯一的目的是存放对象实例,是gc的主要区域。通常可分为两个区块年轻代和年老代。更...

DustinChan
今天
6
0
Excel插入批注:可在批注插入文字、形状、图片

1.批注一直显示:审阅选项卡-------->勾选显示批注选项: 2.插入批注快捷键:Shift+F2 组合键 3.在批注中插入图片:鼠标右键点击批注框的小圆点【重点不可以在批注文本框内点击】----->调出批...

东方墨天
今天
6
1

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部