数据处理平台从oracle迁移到hadoop平台后,原有的数据处理过程需要改写为sparkSql。由于sparkSql执行环境和数据的不确定,经常有资源竞争导致处理过程意外停止,数据倾斜导致任务执行失败。 为减少出错概率,需要对sparkSql进行规范与优化。
通常数据倾斜有以下几种现象:
1.检查group by 字段是否全为一个值或空, 从group中移除只用一个值得字段
2. sql使用大表反复关联小表如:
) aleft join (select m.aaa102,m.aaa103 from database.aa11 m where m.aaa100 = ‘AAC004‘) con a.AAC004 = c.aaa102left join (select m.aaa102,m.aaa103 from database.aa11 m where m.aaa100 = ‘AAC011‘) don a.aac011 = d.aaa102left join (select m.aaa102,m.aaa103 from database.aa11 m where m.aaa100 = ‘YAB019‘) eon a.yab019 = e.aaa102left join (select m.aaa102,m.aaa103 from database.aa11 m where m.aaa100 = ‘AAB019‘) fon a.aab019 = f.aaa102left join (select m.aaa102,m.aaa103 from database.aa11 m where m.aaa100 = ‘YAB003‘) gon a.aab301 = g.aaa102left join (select m.aaa102,m.aaa103 from database.aa11 m where m.aaa100 = ‘AAB301‘) hon a.yac145 = h.aaa102
默认小于150M的表会广播(broacast),如上处理后将不能被广播, 修改为:
) aleft join database.aa11 con a.AAC004 = c.aaa102 and c.aaa100 = ‘AAC004‘left join database.aa11 don a.aac011 = d.aaa102 and d.aaa100 = ‘AAC011‘left join database.aa11 eon a.yab019 = e.aaa102 and e.aaa100 = ‘YAB019‘left join database.aa11 fon a.aab019 = f.aaa102 and f.aaa100 = ‘AAB019‘left join database.aa11 gon a.aab301 = g.aaa102 and g.aaa100 = ‘YAB003‘left join database.aa11 hon a.yac145 = h.aaa102 and h.aaa100 = ‘AAB301‘