首页标签分类
hive sql技巧
2025-08-14 · 更新 2026-03-03约 2 分钟 · 292 字
大数据杂文记
000

目录

hive sql技巧
1.单行多字段转多行
2.单行多属性值转多行
3.多行多列转一行多列
4.多行单列转单行多列
5.单行多列转多行一列
6. 谓词下推

hive sql技巧

1.单行多字段转多行

sql
自动换行:关
放大阅读
展开代码
select * from test; +----+------+------+------+ | id | col1 | col2 | col3 | +----+------+------+------+ | 1 | "a" | "b" | "c" | | 2 | "x" | "y" | "z" | +----+------+------+------+ <!-- more --> select id, case mark when 1 then col1 when 2 then col2 else col3 end col from order_detail od lateral view explode(array(1,2,3)) t as mark; +----+-----+ | id | col | +----+-----+ | 1 | a | | 1 | b | | 1 | c | | 2 | x | | 2 | y | | 2 | z | +----+-----+

2.单行多属性值转多行

sql
自动换行:关
放大阅读
展开代码
+----+---------+ | id | values | +----+---------+ | 1 | [a,b,c] | | 2 | [x,y] | +----+---------+ SELECT id, value FROM mytable LATERAL VIEW explode(values) t1 AS value; +----+-------+ | id | value | +----+-------+ | 1 | "a" | | 1 | "b" | | 1 | "c" | | 2 | "x" | | 2 | "y" | +----+-------+

3.多行多列转一行多列

sql
自动换行:关
放大阅读
展开代码
+----+------+------+------+ | id | col1 | col2 | col3 | +----+------+------+------+ | 1 | "a" | "b" | "c" | | 1 | "d" | "e" | "f" | | 2 | "x" | "y" | "z" | | 2 | "m" | "n" | "p" | +----+------+------+------+ SELECT id, concat_ws(',', collect_set(col1)) AS col1s, concat_ws(',', collect_set(col2)) AS col2s, concat_ws(',', collect_set(col3)) AS col3s FROM mytable GROUP BY id; +----+-------+-------+-------+ | id | col1s | col2s | col3s | +----+-------+-------+-------+ | 1 | "a,d" | "b,e" | "c,f" | | 2 | "x,m" | "y,n" | "z,p" | +----+-------+-------+-------+

4.多行单列转单行多列

sql
自动换行:关
放大阅读
展开代码
+----+---------+ | id | name | +----+---------+ | 1 | Alice | | 1 | Bob | | 3 | Charlie | +----+---------+ | 3 | mark | +----+---------+ SELECT id, MAX(CASE WHEN rn = 1 THEN name END) AS name1, MAX(CASE WHEN rn = 2 THEN name END) AS name2 FROM ( SELECT id, name, ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) AS rn FROM your_table ) t GROUP BY id; +----+---------+---------+ | id | name1 | name2 | +----+---------+---------+ | 1 | Alice | Bob | | 3 | Charlie | mark | +----+---------+---------+

5.单行多列转多行一列

sql
自动换行:关
放大阅读
展开代码
+----+---------+------------+---------+--------+ | id | name | skill1 | skill2 | skill3 | +----+---------+------------+---------+--------+ | 1 | Alice | Java | Python | C++ | | 2 | Bob | JavaScript | Python | NULL | | 3 | Charlie | Python | C# | NULL | +----+---------+------------+---------+--------+ select id,name, case value when 1 then skill1 when 1 then skill2 else skill3 end skill from employee e lateral view explode(array(1,2,3)) t1 as value group by e.id, e.name; +----+---------+--------+ | id | name | skill | +----+---------+--------+ | 1 | Alice | Java | | 1 | Alice | Python | | 1 | Alice | C++ | | 2 | Bob | Java | | 2 | Bob | Python | | 3 | Charlie | Python | | 3 | Charlie | C# | | 3 | Charlie | Java | +----+---------+--------+

6. 谓词下推

sql
自动换行:关
放大阅读
展开代码
-- 使用WITH子句模拟t1和t2表数据 WITH t1 AS ( SELECT 1 AS id, 'A' AS value, '2022-01-23' AS pt UNION ALL SELECT 2 AS id, 'B' AS value, '2022-01-23' AS pt UNION ALL SELECT 3 AS id, 'C' AS value, '2022-01-23' AS pt ), t2 AS ( SELECT 1 AS id, 'active' AS status, '2022-01-23' AS pt UNION ALL SELECT 2 AS id, 'inactive' AS status, '2022-01-24' AS pt -- 注意:pt日期不同 UNION ALL SELECT 4 AS id, 'pending' AS status, '2022-01-23' AS pt ) -- ================================================================= -- 写法1:过滤条件放在 ON 子句中 -- ================================================================= SELECT '写法1' AS source_query, t1.*, t2.status FROM t1 LEFT JOIN t2 ON t1.id = t2.id AND t2.pt = '2022-01-23' WHERE t1.pt = '2022-01-23' UNION ALL -- ================================================================= -- 写法2:在子查询中提前过滤 -- ================================================================= SELECT '写法2' AS source_query, t1.*, t3.status FROM t1 LEFT JOIN ( SELECT * FROM t2 WHERE t2.pt = '2022-01-23' ) t3 ON t1.id = t3.id WHERE t1.pt = '2022-01-23' UNION ALL -- ================================================================= -- 写法3:错误写法 -- ================================================================= select '写法3' AS source_query, t1.*, t2.status from t1 left join t2 on t1.id = t2.id where t1.pt = '2022-01-23' and t2.pt = '2022-01-23'; -- << 条件放在了WHERE中

在现代的SQL引擎(如 Spark SQL、Hive、Presto 等)中,这两种写法的执行计划和性能基本上也是相同的

这得益于查询优化器的一项关键技术——谓词下推(Predicate Pushdown)

  • 对于写法1,虽然过滤条件 t2.pt = '2022-01-23' 表面上写在 ON 子句里,但查询优化器足够智能,它会识别出这是一个可以提前在读取 t2 表时就执行的过滤条件。
  • 因此,优化器会将这个条件“下推”到数据扫描阶段,也就是在Join操作发生之前,就只读取 t2 表中 pt = '2022-01-23' 的分区数据。
  • 这个优化后的执行计划,就和写法2的执行计划完全一样了。

所以,无论你采用哪种写法,Spark SQL 最终都会先分别过滤 t1t2 的分区,然后再对这两个经过过滤的小数据集执行Join,从而保证了高效的性能。

image-20250814163051101

本文作者:hedeoer

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!