文档章节

Hive 在多维统计分析中的应用 & 技巧总结

大数据之路
 大数据之路
发布于 2013/04/11 02:30
字数 1610
阅读 1.1W
收藏 6

3 月,跳不动了?>>>

多维统计一般分两种,我们看看 Hive 中如何解决:

1、同属性的多维组合统计

(1)问题:
有如下数据,字段内容分别为:url, catePath0, catePath1, catePath2, unitparams


https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}

(2)需求:
计算 catePath0, catePath1, catePath2 这三种维度组合下,各个 url 对应的 pv、uv,如:


0 1 23 1 1
0 1 25 1 1
0 1 8 1 1
0 1 ALL 3 3
0 3 8 1 1
0 3 98 1 1
0 3 ALL 2 1
0 5 118 1 1
0 5 18 1 1
0 5 81 1 1
0 5 ALL 3 2
0 ALL ALL 8 3
ALL ALL ALL 8 3

(3)解决思路:
hive 中同属性多维统计问题通常用 union all 组合出各种维度然后 group by 进行求解:

create EXTERNAL table IF NOT EXISTS t_log (
	url string, c0 string, c1 string, c2 string, unitparams string
)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/tmp/decli/1';

select * from (
		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test;

select c0, c1, c2, count(host) PV, count(distinct(host)) UV from (
		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test group by c0, c1, c2;

2、不同属性的多维组合统计

这种场景下我们一般选择 Multi Table/File Inserts,下面选自《programming hive》P124

Making Multiple Passes over the Same Data
Hive has a special syntax for producing multiple aggregations from a single pass
through a source of data, rather than rescanning it for each aggregation. This change
can save considerable processing time for large input data sets. We discussed the details
previously in Chapter 5.
For example, each of the following two queries creates a table from the same source
table, history:
hive> INSERT OVERWRITE TABLE sales
    > SELECT * FROM history WHERE action='purchased';
hive> INSERT OVERWRITE TABLE credits
    > SELECT * FROM history WHERE action='returned';
This syntax is correct, but inefficient. The following rewrite achieves the same thing,
but using a single pass through the source history table:
hive> FROM history
    > INSERT OVERWRITE sales   SELECT * WHERE action='purchased'
    > INSERT OVERWRITE credits SELECT * WHERE action='returned';

FROM pv_users
    INSERT OVERWRITE TABLE pv_gender_sum
        SELECT pv_users.gender, count_distinct(pv_users.userid)
        GROUP BY pv_users.gender

    INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
        SELECT pv_users.age, count_distinct(pv_users.userid)
        GROUP BY pv_users.age;
https://cwiki.apache.org/confluence/display/Hive/Tutorial


注意事项以及一些小技巧:

1、hive union all 的用法:不支持 top level,以及各个select字段名称、属性必须严格一致

2、结果的顺序问题,可以自己加字符控制排序

3、多重insert和union all一样也只扫描一次,但因为要insert到多个分区,所以做了很多其他的事情,导致消耗的时间非常长,其会产生多个job,union all 本身只有一个job

关于 insert overwrite 产生多 job 并行执行的问题:

set hive.exec.parallel=true;   //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度,默认为8。
http://superlxw1234.iteye.com/blog/1703713

4、当前HIVE 不支持 not in 中包含查询子句的语法,形如如下的HQ语句是不被支持的: 
查询在key字段在a表中,但不在b表中的数据
select a.key from a where key not in(select key from b)  该语句在hive中不支持
可以通过left outer join进行查询,(假设B表中包含另外的一个字段 key1 
select a.key from a left outer join b on a.key=b.key where b.key1 is null

5、left out join 不能连续3个以上使用,必须2个一组,2个一组包装起来使用。

select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from (
	select * from (
		select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis  
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1
		left outer join 
		(
		select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis 
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi 
		) p2 on p1.ssi=p2.ssi
	) p3
	left outer join
	(
		select ssi, count(1) nuv from FactClickAnalysis 
		where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi 
	) p4 on p3.ssi=p4.ssi
) p

6、hive本地执行mr

http://superlxw1234.iteye.com/blog/1703546

7、hive动态分区创建过多遇到的一个错误

http://superlxw1234.iteye.com/blog/1677938

8、hive中巧用正则表达式的贪婪匹配

http://superlxw1234.iteye.com/blog/1751216

9、hive匹配全中文字段

用java中匹配中文的正则即可:

name rlike '^[\\u4e00-\\u9fa5]+$'

判断一个字段是否全数字:

select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\\d+$' limit 50;  

10、hive中使用sql window函数 LAG/LEAD/FIRST/LAST

http://superlxw1234.iteye.com/blog/1600323

http://www.shaoqun.com/a/18839.aspx

11、hive优化之------控制hive任务中的map数和reduce数

http://superlxw1234.iteye.com/blog/1582880

12、hive中转义$等特殊字符

http://superlxw1234.iteye.com/blog/1568739

13、日期处理:

查看N天前的日期:

select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;  

获取两个日期之间的天数/秒数/分钟数等等:

select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400  from t_lxw_test limit 1;  

14、删除 Hive 临时文件 hive.exec.scratchdir

http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576


REF:

http://superlxw1234.iteye.com/blog/1536440
http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/
http://blog.csdn.net/azhao_dn/article/details/6921429

http://superlxw1234.iteye.com/category/228899

© 著作权归作者所有

大数据之路
粉丝 1631
博文 514
码字总数 330865
作品 0
武汉
架构师
私信 提问
加载中

评论(0)

Hive 在多维统计分析中的应用 & 技巧总结

本文原地址:https://my.oschina.net/leejun2005/blog/121945 多维统计一般分两种,我们看看 Hive 中如何解决: 1、同属性的多维组合统计 (1)问题: 有如下数据,字段内容分别为:url, ca...

SimplePoint
2017/04/18
0
0
大数据下的数据分析平台架构

时间:2011-08-15 14:59 作者:谢超 随着互联网、移动互联网和物联网的发展,谁也无法否认,我们已经切实地迎来了一个海量数据的时代,数据调查公司IDC预计2011年的数据总量将达到1.8万亿GB,...

长征2号
2017/04/07
0
0
hadoop 统计分析oracle数据库的数据

最近要做个应用功能的探讨,把oracle的大量数据分布式统计分析。。查了几天资料,有些说使用sqoop导出oracle数据为hdfs文件,但是文件名却不能指定,到时候mapreduce统计分析的时候怎么读取指...

黑帽子
2014/05/16
1.9K
2
大数据教程(11.8)Hive1.2.2简介&初体验

上一篇文章分析了Hive1.2.2的安装,本节博主将分享Hive的体验&Hive服务端和客户端的使用方法。 一、Hive与hadoop直接的关系 Hive利用HDFS存储数据,利用MapReduce查询数据。 二、Hive与传统数...

em_aaron
2019/01/22
44
0
初识Hive:3张图了解Hive的数据类型、架构图!

Hive是一个构建在Hadoop上的数据仓库框架,可以将结构化的数据文件,映射为一张数据库表,并提供完整的sql查询功能。可以将sql语句,转换为MapReduce任务进行运行。 Hive选择Hadoop来存储和处...

zlw东南风
2017/10/18
0
0

没有更多内容

加载失败,请刷新页面

加载更多

什么是反射,为什么有用? - What is reflection and why is it useful?

问题: What is reflection, and why is it useful? 什么是反射,为什么有用? I'm particularly interested in Java, but I assume the principles are the same in any language. 我对Jav......

技术盛宴
51分钟前
19
0
SSM框架整合

mybatis逆向工程 mybatis-generator生成pojo、mapper接口及映射文件 mapper放到e3-manager-dao层中 导入sql到数据库中; 导入逆向工程工具,配置xml文件 运行main方法 重复运行main不会覆盖!...

七宝1
今天
30
0
OSChina 周日乱弹 —— 和网友的第一次开房经历

Osc乱弹歌单(2020)请戳(这里) 【今日歌曲】 @薛定谔的兄弟 :分享洛神有语创建的歌单「我喜欢的音乐」: 《Ljósið》- Ólafur Arnalds 手机党少年们想听歌,请使劲儿戳(这里) @xiaos...

小小编辑
今天
54
0
程序员职业生涯指引

程序员应该尽早规划自己的职业生涯 为什么写 众所周知 IT 这一行到了一定的年龄、大部分人都或多或少有危机感,特别是今年全国乃至全球发生的疫情、导致整体经济受到很大的影响、这次的疫情影...

科比可比克
今天
11
0
JVM调优实战分析

一、查看服务器项目JVM参数以及参数分析 1、jps 命令 : 列出系统中所有的 Java 应用程序以及PID 如下图所示,26647就是我部署在服务器的一个小项目的 PID 2、jmap命令:查看堆的使用情况 如...

IT-Mamba
今天
46
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部