sql自动换行:关放大阅读展开代码select * from test; +----+------+------+------+ | id | col1 | col2 | col3 | +----+------+------+------+ | 1 | "a" | "b" | "c" | | 2 | "x" | "y" | "z" | +----+------+------+------+ <!-- more --> select id, case mark when 1 then col1 when 2 then col2 else col3 end col from order_detail od lateral view explode(array(1,2,3)) t as mark; +----+-----+ | id | col | +----+-----+ | 1 | a | | 1 | b | | 1 | c | | 2 | x | | 2 | y | | 2 | z | +----+-----+
sql自动换行:关放大阅读展开代码+----+---------+ | id | values | +----+---------+ | 1 | [a,b,c] | | 2 | [x,y] | +----+---------+ SELECT id, value FROM mytable LATERAL VIEW explode(values) t1 AS value; +----+-------+ | id | value | +----+-------+ | 1 | "a" | | 1 | "b" | | 1 | "c" | | 2 | "x" | | 2 | "y" | +----+-------+
sql自动换行:关放大阅读展开代码+----+------+------+------+ | id | col1 | col2 | col3 | +----+------+------+------+ | 1 | "a" | "b" | "c" | | 1 | "d" | "e" | "f" | | 2 | "x" | "y" | "z" | | 2 | "m" | "n" | "p" | +----+------+------+------+ SELECT id, concat_ws(',', collect_set(col1)) AS col1s, concat_ws(',', collect_set(col2)) AS col2s, concat_ws(',', collect_set(col3)) AS col3s FROM mytable GROUP BY id; +----+-------+-------+-------+ | id | col1s | col2s | col3s | +----+-------+-------+-------+ | 1 | "a,d" | "b,e" | "c,f" | | 2 | "x,m" | "y,n" | "z,p" | +----+-------+-------+-------+
sql自动换行:关放大阅读展开代码+----+---------+ | id | name | +----+---------+ | 1 | Alice | | 1 | Bob | | 3 | Charlie | +----+---------+ | 3 | mark | +----+---------+ SELECT id, MAX(CASE WHEN rn = 1 THEN name END) AS name1, MAX(CASE WHEN rn = 2 THEN name END) AS name2 FROM ( SELECT id, name, ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) AS rn FROM your_table ) t GROUP BY id; +----+---------+---------+ | id | name1 | name2 | +----+---------+---------+ | 1 | Alice | Bob | | 3 | Charlie | mark | +----+---------+---------+
sql自动换行:关放大阅读展开代码+----+---------+------------+---------+--------+ | id | name | skill1 | skill2 | skill3 | +----+---------+------------+---------+--------+ | 1 | Alice | Java | Python | C++ | | 2 | Bob | JavaScript | Python | NULL | | 3 | Charlie | Python | C# | NULL | +----+---------+------------+---------+--------+ select id,name, case value when 1 then skill1 when 1 then skill2 else skill3 end skill from employee e lateral view explode(array(1,2,3)) t1 as value group by e.id, e.name; +----+---------+--------+ | id | name | skill | +----+---------+--------+ | 1 | Alice | Java | | 1 | Alice | Python | | 1 | Alice | C++ | | 2 | Bob | Java | | 2 | Bob | Python | | 3 | Charlie | Python | | 3 | Charlie | C# | | 3 | Charlie | Java | +----+---------+--------+
sql自动换行:关放大阅读展开代码-- 使用WITH子句模拟t1和t2表数据 WITH t1 AS ( SELECT 1 AS id, 'A' AS value, '2022-01-23' AS pt UNION ALL SELECT 2 AS id, 'B' AS value, '2022-01-23' AS pt UNION ALL SELECT 3 AS id, 'C' AS value, '2022-01-23' AS pt ), t2 AS ( SELECT 1 AS id, 'active' AS status, '2022-01-23' AS pt UNION ALL SELECT 2 AS id, 'inactive' AS status, '2022-01-24' AS pt -- 注意:pt日期不同 UNION ALL SELECT 4 AS id, 'pending' AS status, '2022-01-23' AS pt ) -- ================================================================= -- 写法1:过滤条件放在 ON 子句中 -- ================================================================= SELECT '写法1' AS source_query, t1.*, t2.status FROM t1 LEFT JOIN t2 ON t1.id = t2.id AND t2.pt = '2022-01-23' WHERE t1.pt = '2022-01-23' UNION ALL -- ================================================================= -- 写法2:在子查询中提前过滤 -- ================================================================= SELECT '写法2' AS source_query, t1.*, t3.status FROM t1 LEFT JOIN ( SELECT * FROM t2 WHERE t2.pt = '2022-01-23' ) t3 ON t1.id = t3.id WHERE t1.pt = '2022-01-23' UNION ALL -- ================================================================= -- 写法3:错误写法 -- ================================================================= select '写法3' AS source_query, t1.*, t2.status from t1 left join t2 on t1.id = t2.id where t1.pt = '2022-01-23' and t2.pt = '2022-01-23'; -- << 条件放在了WHERE中
在现代的SQL引擎(如 Spark SQL、Hive、Presto 等)中,这两种写法的执行计划和性能基本上也是相同的。
这得益于查询优化器的一项关键技术——谓词下推(Predicate Pushdown)。
t2.pt = '2022-01-23' 表面上写在 ON 子句里,但查询优化器足够智能,它会识别出这是一个可以提前在读取 t2 表时就执行的过滤条件。t2 表中 pt = '2022-01-23' 的分区数据。所以,无论你采用哪种写法,Spark SQL 最终都会先分别过滤 t1 和 t2 的分区,然后再对这两个经过过滤的小数据集执行Join,从而保证了高效的性能。
本文作者:hedeoer
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!