测试hive

原创
2017/02/15 19:29
阅读数 630

一、准备数据

 

造数据脚本 : gendata.sh

#!/bin/bash

file=$1

s=$2

touch $file

for((i=0;i<10000000;i++))

do

str=','$s;

name=${i}${str}${i}

#echo $name

echo  $name>> $file

done

 

echo 'show testdata'

head $file

 

造数据:

先造十个小文件,每个1000w记录:

bash gendata.sh  name.txt name ; bash gendata.sh  zhuzhi.txt zhuzhi ; bash gendata.sh minzu.txt minzu ; bash gendata.sh  jg.txt jg ;bash gendata.sh gj.txt gj ; bash gendata.sh dz.txt dz ; bash gendata.sh abcd.txt abcd ; bash gendata.sh efgh.txt efgh ; bash gendata.sh  xyz.txt xyz ;bash gendata.sh  opq.txt opq

 

total 1.8G

-rw-r--r-- 1 root root 189M Feb  9 10:35 abcd.txt

-rw-r--r-- 1 root root 170M Feb  9 10:32 dz.txt

-rw-r--r-- 1 root root 189M Feb  9 10:38 efgh.txt

-rw-r--r-- 1 root root 170M Feb  9 10:28 gj.txt

-rw-r--r-- 1 root root 170M Feb  9 10:25 jg.txt

-rw-r--r-- 1 root root 199M Feb  9 10:22 minzu.txt

-rw-r--r-- 1 root root 189M Feb  9 10:08 name.txt

-rw-r--r-- 1 root root 180M Feb  9 10:49 opq.txt

-rw-r--r-- 1 root root 180M Feb  9 10:41 xyz.txt

-rw-r--r-- 1 root root 208M Feb  9 10:19 zhuzhi.txt

 

大文件,1亿记录

 bash gendata.sh  name1000.txt name

 

-rw-r--r--  1 root root 2.1G Feb  9 10:50 name1000.txt

 

二、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G ,不做任何优化的数据分析

 

hive中建表:

 

create table hyl_test_par(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

手动建立分区文件夹:

 

hadoop  fs -mkdir -p /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003

 

上传数据、修改权限:

 

 hadoop  fs -put *.txt /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003/

 hadoop  fs -chown -R hive /apps/hive/warehouse/hyl_test_par/

 

修复分区信息:

 

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+------------+--+

| partition  |

+------------+--+

+------------+--+

No rows selected (0.125 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> msck repair table hyl_test_par;

No rows affected (0.551 seconds)

0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par;

+--------------------------------+--+

|           partition            |

+--------------------------------+--+

| sys_sj=20170209/sys_type=2003  |

+--------------------------------+--+

1 row selected (0.123 seconds)

 

 

测试:

 

select count(*) from hyl_test_par where name <> ' ' ;

 

第一次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (189.116 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (118.107 seconds)

 

 

第三次运行:

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (117.551 seconds)

 

第四次运行:

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (117.44 seconds)

 

第五次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (113.291 seconds)

 

======================================莫名的分割线=======================================================

下午重新跑,性能块了10倍以上!!!

第一次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.274 seconds)

 

 

第二次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.525 seconds)

 

 

第三次运行

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.11 seconds)

 

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.722 seconds)

 

 

三、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G,经过analysis分区信息的表

 

同理,创建测试analysis的表,并导入数据:

 

create table hyl_test_par_ana(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

hadoop  fs -mkdir -p /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003

hadoop  fs -put *.txt /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003/

hadoop  fs -chown -R hive /apps/hive/warehouse/hyl_test_par_ana/

 

show partitions hyl_test_par_ana;

msck repair table hyl_test_par_ana;

show partitions hyl_test_par_ana;

 

分区信息分析:

0: jdbc:hive2://cluster09.hzhz.co:10000> analyze table hyl_test_par_ana partition(sys_sj=20170209,sys_type=2003) compute statistics ;

INFO  : Session is already open

INFO  : Dag name: analyze table hyl_test_par_ana ...statistics(Stage-0)

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0021)

.

.

.

INFO  : Partition default.hyl_test_par_ana{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (126.81 seconds)

 

测试:

select count(*) from hyl_test_par_ana;

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_ana;

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (0.086 seconds)

 

换sql:

select count(*) from hyl_test_par_ana where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (118.239 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (121.687 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (121.319 seconds)

 

======================================莫名的分割线=======================================================

下午重新跑,性能块了10倍以上!!!

第一次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (10.923 seconds)

 

第二次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.058 seconds)

 

第三次运行

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.45 seconds)

 

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.218 seconds)

 

 

三、测试1个大文件,2.0G大小,总1亿记录,没有任何优化的表

创建使用一个大文件的同结构表,并上传数据:

 

 

create table hyl_test_par_big(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

hadoop  fs -mkdir -p /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003

hadoop  fs -put name1000.txt /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003/

hadoop  fs -chown -R hive /apps/hive/warehouse/hyl_test_par_big/

 

[hdfs@cluster13 tmp]$ hadoop  fs -du -h /apps/hive/warehouse/hyl_test_par_big/

2.0 G  /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209

 

show partitions hyl_test_par_big;

msck repair table hyl_test_par_big;

show partitions hyl_test_par_big;

 

测试:

select count(*) from hyl_test_par_big where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.356 seconds)

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.3 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.861 seconds)

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.675 seconds)

 

第五次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.814 seconds)

 

======================================莫名的分割线=======================================================

下午测试:

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.933 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.435 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.868 seconds)

 

第四次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.403 seconds)

 

 

第五次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.814 seconds)

 

四、测试10个小文件,180MB大小,总1亿记录,是经过insert into方式建立的表

 

通过insert into 从第一张表中导入数据(这个过程中hive会自动analysis信息)到新表,测试:

 

create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);

 

0: jdbc:hive2://cluster09.hzhz.co:10000> create table hyl_test_par_auto as  select * from hyl_test_par distribute by rand(123);

INFO  : Session is already open

INFO  : Dag name: create table hyl_test_par_auto a...rand(123)(Stage-1)

INFO  : Tez session was closed. Reopening...

INFO  : Session re-established.

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

 

INFO  : Map 1: -/-    Reducer 2: 0/10   

INFO  : Map 1: 0/119    Reducer 2: 0/10   

.

.

.

INFO  : Moving data to directory hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto from hdfs://myBigdata/apps/hive/warehouse/.hive-staging_hive_2017-02-09_13-05-59_036_2906983779886430780-1/-ext-10001

INFO  : Table default.hyl_test_par_auto stats: [numFiles=10, numRows=100000000, totalSize=3327777800, rawDataSize=3227777800]

 

[hdfs@cluster13 tmp]$ hadoop  fs -ls -h  hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003

Found 10 items

-rwxrwxrwx   2 hive hdfs    183.9 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000000_0

-rwxrwxrwx   2 hive hdfs    183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000001_0

-rwxrwxrwx   2 hive hdfs    183.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000002_0

-rwxrwxrwx   2 hive hdfs    184.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000003_0

-rwxrwxrwx   2 hive hdfs    183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000004_0

-rwxrwxrwx   2 hive hdfs    183.3 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000005_0

-rwxrwxrwx   2 hive hdfs    184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000006_0

-rwxrwxrwx   2 hive hdfs    184.0 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000007_0

-rwxrwxrwx   2 hive hdfs    184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000008_0

-rwxrwxrwx   2 hive hdfs    183.4 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000009_0

 

 

测试:

select count(*) from hyl_test_par_auto where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (14.653 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (13.989 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (9.236 seconds)

 

 

drop table hyl_test_auto;

 

create table hyl_test_par_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO  : Session is already open

INFO  : Dag name: insert into table hyl_test_par_a...rand(123)(Stage-1)

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)

 

INFO  : Map 1: 0/119    Reducer 2: 0/10  

.

.

INFO  : Loading data to table default.hyl_test_par_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_13-26-47_915_3059621581501435386-1/-ext-10000

INFO  : Partition default.hyl_test_par_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (135.325 seconds)

 

 

测试:

select count(*) from hyl_test_par_auto where name <> ' ';

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.303 seconds)

 

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.56 seconds)

 

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.446 seconds)

 

第四次:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.643 seconds)

 

0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_auto ;

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (0.098 seconds)

 

 

======================================莫名的分割线=======================================================

下午测试:

第一次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.282 seconds)

 

第二次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.414 seconds)

 

第三次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.452 seconds)

 

第四次运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.24 seconds)

 

 

五、统计:

 

耗时\条件 10*180MB*NONE 10*180MB*analyze 1*2GB*none 10*180MB*insert into
第一次运行 11.274 10.923 11.933 11.282
第二次运行 11.525 6.058 4.435 3.414
第三次运行 11.11 6.45 5.868 6.452
第四次运行 11.722 6.218 3.403 6.24
平均时间 11.40775 7.41225 6.40975 6.847
热数据平均时间(去掉第一次)

             11.45233333

                                    6.242

                 4.568666667

                        5.368666667

 

 

六、继续寻找优化项:

 

单个文件,通过insert into方式插入数据的表:

 

set mapred.reduce.tasks=1;

create table hyl_test_par_big_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;

 

insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123);

INFO  : Tez session hasn't been created yet. Opening session

INFO  : Dag name: insert into table hyl_test_par_b...rand(123)(Stage-1)

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id application_1486351392526_0023)

 

INFO  : Map 1: -/-    Reducer 2: 0/1

INFO  : Map 1: 0/119    Reducer 2: 0/1

.

.

.

INFO  : Loading data to table default.hyl_test_par_big_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_big_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_14-30-05_480_3967529948260649900-1/-ext-10000

INFO  : Partition default.hyl_test_par_big_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=1, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800]

No rows affected (104.637 seconds)

 

测试:

select count(*) from hyl_test_par_big_auto where name <> ' ';

运行:

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.744 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.188 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.041 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.198 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.788 seconds)

 

 

耗时\条件 10*180MB*NONE 10*180MB*analyze 1*2GB*none 10*180MB*insert into 1*2GB*none*insert into
第一次运行 11.274 10.923 11.933 11.282 11.744
第二次运行 11.525 6.058 4.435 3.414 4.188
第三次运行 11.11 6.45 5.868 6.452 4.041
第四次运行 11.722 6.218 3.403 6.24 5.198
平均时间 11.40775 7.41225 6.40975 6.847 6.29275
热数据平均时间(去掉第一次)

11.45233333

 6.242

4.56866666

5.368666667

  4.475666667

 

优化,列信息统计分析

 

analyze table hyl_test_par_ana compute statistics for columns;

 

select count(*) from hyl_test_par_ana where name <> ' ';

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.519 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

 

1 row selected (7.688 seconds)

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (6.456 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (5.651 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.413 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.62 seconds)

 

 

 

 

矢量查询(Vectorized query) 每次处理数据时会将1024行数据组成一个batch进行处理,而不是一行一行进行处理:

 

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

 

select count(*) from hyl_test_par where name <> ' ';

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.54 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (3.859 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (4.134 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.844 seconds)

 

+------------+--+

|    _c0     |

+------------+--+

| 100000000  |

+------------+--+

1 row selected (11.089 seconds)

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部