# 大数据从业者必知必会的Hive SQL调优技巧 | 京东云技术团队

2023/12/25 10:57

1. 引言

2. 先做个自我反思

﻿

3 查询优化

3.1 尽量原子化操作

3.2 使用合适的数据类型

SELECT * FROM table WHERE age = '30'; 

SELECT * FROM table WHERE age = 30;

3.3 避免全表扫描

Select * from table where dt<=’{TX_DATE}’

--优化前副表的过滤条件写在where后面，会导致先全表关联再过滤分区。

select a.* from test1 a left join test2 b on a.uid = b.uid where a.ds='2020-08-10' and b.ds='2020-08-10' 

select a.* from test1 a left join test2 b on (b.uid is not null and a.uid = b.uid and b.ds='2020-08-10') where a.ds='2020-08-10'

Select max(dt) from table

使用自定义(show partition 或 hdfs dfs –ls )的方式替代max(dt)

3.4 使用分区

SELECT * FROM table WHERE date = '2021-01-01' AND region = 'A';

SELECT * FROM table WHERE partition_date = '2021-01-01' AND partition_region = 'A';

Select * from table where src_mark=’23’ and dt between ‘2020-05-16’ and ‘{TX_DATE}’ and scr_code is not null

3.5 使用索引

SELECT * FROM table WHERE region = 'A' AND status = 'ACTIVE';

CREATE INDEX idx_region_status ON table (region, status);
SELECT * FROM table WHERE region = 'A' AND status = 'ACTIVE';

3.6 查询重写

SELECT * FROM table1 WHERE id IN (SELECT id FROM table2 WHERE region = 'A');

SELECT * FROM table1 t1 JOIN (SELECT id FROM table2 WHERE region = 'A') t2 ON t1.id = t2.id;

3.7 谓词下推

select a.*,b.* from a join b  on a.name=b.name where a.age>30

SELECT a.*, b.* FROM ( SELECT * FROM a WHERE age > 30 ) a JOIN b ON a.name = b.name

3.8 不要用COUNT DISTINCT

COUNT DISTINCT操作需要用一个Reduce Task来完成，这一个Reduce需要处理的数据量太大，就会导致整个Job很难完成，一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换，虽然会多用一个Job来完成，但在数据量大的情况下，这个绝对是值得的。

select count(distinct uid) from test where ds='2020-08-10' and uid is not null

select count(a.uid) from (select uid from test where uid is not null and ds = '2020-08-10' group by uid) a

3.9 使用with as

select a.* from test1 a left join test2 b on a.uid = b.uid where a.ds='2020-08-10' and b.ds='2020-08-10'

with b as select uid from test2 where ds = '2020-08-10' and uid is not null select a.* from test1 a left join b on a.uid = b.uid where a.ds='2020-08-10' and a.uid is not null

3.10 大表Join小表

--设置自动选择Mapjoin

set hive.auto.convert.join = true; 默认为true

--大表小表的阈值设置（默认25M以下认为是小表）：

set hive.mapjoin.smalltable.filesize=25000000;

3.11 大表Join大表

3.11.1 空key过滤

3.11.2 空key转换

3.12 避免笛卡尔积

SELECT * FROM A, B; 

--在优化前的SQL代码中，使用了隐式的内连接（JOIN），没有明确指定连接条件，导致产生了笛卡尔积

SELECT * FROM A CROSS JOIN B;

4. 数据加载和转换

4.1 使用压缩格式

LOAD DATA INPATH '/path/to/data' INTO TABLE table;

LOAD DATA INPATH '/path/to/data' INTO TABLE table STORED AS ORC;

4.2 数据转换和过滤

SELECT * FROM table WHERE name LIKE '%John%'; 

SELECT * FROM table WHERE name = 'John';

4.3 多次INSERT单次扫描表

INSERT INTO temp_table_20201115 SELECT * FROM my_table WHERE dt ='2020-11-15';
INSERT INTO temp_table_20201116 SELECT * FROM my_table WHERE dt ='2020-11-16';

FROM my_table
INSERT INTO temp_table_20201115 SELECT * WHERE dt ='2020-11-15'
INSERT INTO temp_table_20201116 SELECT * WHERE dt ='2020-11-16'

5. 性能评估和优化

5.1 使用EXPLAIN命令

EXPLAIN SELECT * FROM table WHERE age = 30; 

EXPLAIN SELECT * FROM table WHERE age = 30 AND partition = 'partition1'; 

5.2 调整并行度和资源配置

SET hive.exec.parallel=true; 

SET hive.exec.parallel=false; SET hive.exec.reducers.max=10;

6. 数据倾斜

6.1 空值引发的数据倾斜

6.2 不同数据类型引发的数据倾斜

6.3 不可拆分大文件引发的数据倾斜

6.4 数据膨胀引发的数据倾斜

6.5 表连接时引发的数据倾斜

6.6 确实无法减少数据量引发的数据倾斜

7. 合并小文件

7.1 Hive引擎合并小文件参数

--是否和并Map输出文件，默认true

set hive.merge.mapfiles = true;

--是否合并 Reduce 输出文件，默认false

set hive.merge.mapredfiles = true;

--合并文件的大小，默认256000000字节

set hive.merge.size.per.task = 256000000;

--当输出文件的平均大小小于该值时，启动一个独立的map-reduce任务进行文件merge，默认16000000字节

set hive.merge.smallfiles.avgsize = 256000000;

7.2 Spark引擎合并小文件参数，所以尽量将MR切换成Spark

--是否合并小文件，默认true

conf spark.sql.hive.mergeFiles=true;

8. 结论

﻿

SELECT * FROM user_data WHERE user_id IN (SELECT user_id FROM order_data WHERE order_date >= '2022-01-01')

SELECT u.* FROM user_data u JOIN (SELECT user_id FROM order_data WHERE order_date >= '2022-01-01') o ON u.user_id = o.user_id

0 评论
1 收藏
0