HiveSQL优化

下面列举一些常用的SQL优化方案

sql引起的数据倾斜

数据倾斜会导致某个Reduce运行过慢影响到整体的运行时长。通常在join和group by时，会出现这样的问题

join引起的数据倾斜,下面操作会将一个job变为两个job执行HQL

#如果是join过程出现倾斜，应该设置为true
set hive.optimize.skewjoin=true;
#这个是join的键对应的记录条数超过这个值则会进行优化
set hive.skewjoin.key=100000;

group by key引起的数据倾斜

# 如果group by过程出现倾斜应该设置为true
set hive.groupby.skewindata=true;
#这个是group的键对应的记录条数超过这个值则会进行优化
set hive.groupby.mapaggr.checkinterval=100000;

mapjoin(map端执行join）

针对应用场景合理使用MapJoin也很重要。Map Join可以解决数据倾斜问题，因为没有Reduce Task了;只运行Map Task，相比多运行Reduce Task来说省时间。

启动方式一：(自动)

1
2
3

set.hive.auto.convert.join = true;
#默认值是25mb小表小于25mb自动启动mapjoin 
hive.mapjoin.smalltable.filesize=25000000

启动方式二：(手动）

1	`select /+mapjoin(A)/ f.a,f.b from A t join B f on (f.a=t.a)`

bucketjoin

合理利用桶分区很重要，因为它可以避免全表检索，在大数据场景中全表检索意味着什么应该可以想象…

在满足下面两个情况时使用：

1.两个表以相同方式划分桶
2.两个表的桶个数是倍数关系

create table order(cid int,price float) clustered by(cid)   into 32 buckets;
create table customer(id int,first string) clustered by(id)   into 32/64 buckets;

select price from order t join customer s on t.cid=s.id;

where条件优化

尽可能早的筛掉更多的数据。

优化前（关系数据库不用考虑会自动优化）：

1	`select m.cid,u.id from order m join customer u on m.cid =u.id where m.dt='2019-08-18';`

优化后(where条件在map端执行而不是在reduce端执行）：

1	`select m.cid,u.id from （select * from order where dt='2019-08-18'） m join customer u on m.cid =u.id;`

count distinct优化

只有一个reduce，先去重再count负担比较大;解决方案是：启动两个job，一个job负责子查询(可以有多个reduce)，另一个job负责count(1)

优化前：

1	`select count(distinct id) from tablename;`

优化后：

select count(1) from (select distinct id from tablename) tmp;

select count(1) from (select id from tablename group by id) tmp;

set mapred.reduce.tasks=3;

合理使用union all

优化前：

1	`select a,sum(b),count(distinct c),count(distinct d) from test group by a;`

优化后：

select a,sum(b) as b,count(c) as c,count(d) as d from
（
select a, 0 as b,c,null as d from test group by a,c
union all
select a,0 as b, null as c,d from test group by a,d
union all
select a, b,null as c ,null as d from test) tmp group by a;

大数据

HiveSQL优化

http://example.com/2019/08/17/2019–08-17-HiveSQL优化/

Author

Hoey

Posted on

August 17, 2019

Licensed under

神经网络：基本分类[转] Previous

HDMI连接Raspberry Next