三种DuckDB电子表格插件的union all查询性能对比
我选取了最稳定、兼容性最好的三种:官方excel对应函数read_xlsx()、官方spatial对应函数st_read()、rusty_sheet对应函数read_sheet。
1.建立两个包含前50万和后54万的xlsx文件,用于比较。利用官方excel的copy()to进行。
D copy (from v1 order by l_orderkey,l_partkey,l_suppkey limit 500000)to '50_1.xlsx'WITH (HEADER 1);
Run Time (s): real 22.406 user 24.578125 sys 0.609375
D select count(*) from read_sheet('50_1.xlsx');
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 500000 │
└──────────────┘
Run Time (s): real 6.725 user 6.359375 sys 0.343750
D copy (from v1 order by l_orderkey,l_partkey,l_suppkey offset 500000)to '50_2.xlsx';
Run Time (s): real 22.259 user 24.625000 sys 0.484375
D select count(*) from read_sheet('50_2.xlsx',header=0);
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 548575 │
└──────────────┘
Run Time (s): real 7.590 user 7.296875 sys 0.250000
2.分别比较不同读取函数的单文件、双文件union all、更改列数、识别错误的st_read()更改数据类型的执行时间。
D select sum(l_orderkey) from(from read_sheet('50_1.xlsx') union all from read_sheet('50_2.xlsx',header=0));
┌──────────────────┐
│ sum(l_orderkey) │
│ int128 │
├──────────────────┤
│ 549457354624 │
│ (549.46 billion) │
└──────────────────┘
Run Time (s): real 13.159 user 13.875000 sys 0.531250
D select sum(l_orderkey) from read_sheet('exli2.xlsx');
┌──────────────────┐
│ sum(l_orderkey) │
│ int128 │
├──────────────────┤
│ 549457354624 │
│ (549.46 billion) │
└──────────────────┘
Run Time (s): real 15.224 user 14.453125 sys 0.609375D select sum(l_orderkey) from(from st_read('50_1.xlsx') union all from st_read('50_2.xlsx'));
┌──────────────────┐
│ sum(l_orderkey) │
│ int128 │
├──────────────────┤
│ 549457354624 │
│ (549.46 billion) │
└──────────────────┘
Run Time (s): real 13.869 user 17.875000 sys 0.343750
D select sum(l_orderkey) from st_read('exli2.xlsx' );
┌──────────────────┐
│ sum(l_orderkey) │
│ int128 │
├──────────────────┤
│ 549457354624 │
│ (549.46 billion) │
└──────────────────┘
Run Time (s): real 19.012 user 18.750000 sys 0.250000D select sum(l_orderkey) from read_xlsx('exli2.xlsx' );
┌──────────────────┐
│ sum(l_orderkey) │
│ double │
├──────────────────┤
│ 549457354624.0 │
│ (549.46 billion) │
└──────────────────┘
Run Time (s): real 7.982 user 7.593750 sys 0.375000
D select sum(l_orderkey) from(from read_xlsx('50_1.xlsx') union all from read_xlsx('50_2.xlsx'));
┌──────────────────┐
│ sum(l_orderkey) │
│ double │
├──────────────────┤
│ 549457354624.0 │
│ (549.46 billion) │
└──────────────────┘
Run Time (s): real 7.190 user 6.656250 sys 0.515625
D select sum(l_orderkey::int64) from(from read_xlsx('50_1.xlsx') union all from read_xlsx('50_2.xlsx'));
┌─────────────────────────────────┐
│ sum(CAST(l_orderkey AS BIGINT)) │
│ int128 │
├─────────────────────────────────┤
│ 549457354624 │
│ (549.46 billion) │
└─────────────────────────────────┘
Run Time (s): real 4.234 user 6.906250 sys 0.421875
D select sum(l_orderkey::int64) from read_xlsx('exli2.xlsx' );
┌─────────────────────────────────┐
│ sum(CAST(l_orderkey AS BIGINT)) │
│ int128 │
├─────────────────────────────────┤
│ 549457354624 │
│ (549.46 billion) │
└─────────────────────────────────┘
Run Time (s): real 7.900 user 7.515625 sys 0.375000D select sum(l_orderkey::int64),sum(l_partkey::int64),sum(l_suppkey::int64) from(from read_xlsx('50_1.xlsx') union all from read_xlsx('50_2.xlsx'));
┌─────────────────────────────────┬────────────────────────────────┬────────────────────────────────┐
│ sum(CAST(l_orderkey AS BIGINT)) │ sum(CAST(l_partkey AS BIGINT)) │ sum(CAST(l_suppkey AS BIGINT)) │
│ int128 │ int128 │ int128 │
├─────────────────────────────────┼────────────────────────────────┼────────────────────────────────┤
│ 549457354624 │ 104867343654 │ 5246691219 │
│ (549.46 billion) │ (104.87 billion) │ (5.25 billion) │
└─────────────────────────────────┴────────────────────────────────┴────────────────────────────────┘
Run Time (s): real 4.183 user 6.906250 sys 0.406250
D select sum(l_orderkey),sum(l_partkey),sum(l_suppkey) from(from read_xlsx('50_1.xlsx') union all from read_xlsx('50_2.xlsx'));
┌──────────────────┬──────────────────┬────────────────┐
│ sum(l_orderkey) │ sum(l_partkey) │ sum(l_suppkey) │
│ double │ double │ double │
├──────────────────┼──────────────────┼────────────────┤
│ 549457354624.0 │ 104867343654.0 │ 5246691219.0 │
│ (549.46 billion) │ (104.87 billion) │ (5.25 billion) │
└──────────────────┴──────────────────┴────────────────┘
Run Time (s): real 7.217 user 6.750000 sys 0.421875
D select sum(l_orderkey),sum(l_partkey),sum(l_suppkey) from(from st_read('50_1.xlsx') union all from st_read('50_2.xlsx'));
┌──────────────────┬──────────────────┬────────────────┐
│ sum(l_orderkey) │ sum(l_partkey) │ sum(l_suppkey) │
│ int128 │ int128 │ int128 │
├──────────────────┼──────────────────┼────────────────┤
│ 549457354624 │ 104867343654 │ 5246691219 │
│ (549.46 billion) │ (104.87 billion) │ (5.25 billion) │
└──────────────────┴──────────────────┴────────────────┘
Run Time (s): real 13.719 user 17.859375 sys 0.234375
D select sum(l_orderkey),sum(l_partkey),sum(l_suppkey) from(from read_sheet('50_1.xlsx') union all from read_sheet('50_2.xlsx'));
┌──────────────────┬──────────────────┬────────────────┐
│ sum(l_orderkey) │ sum(l_partkey) │ sum(l_suppkey) │
│ int128 │ int128 │ int128 │
├──────────────────┼──────────────────┼────────────────┤
│ 549456854941 │ 104867204040 │ 5246689091 │
│ (549.46 billion) │ (104.87 billion) │ (5.25 billion) │
└──────────────────┴──────────────────┴────────────────┘
Run Time (s): real 13.495 user 14.000000 sys 0.812500
想从执行计划看不同插件并行度差异的原因,完全没有这方面的信息
D explain analyze select sum(l_orderkey::int64),sum(l_partkey::int64),sum(l_suppkey::int64) from(from read_xlsx('50_1.xlsx') union all from read_xlsx('50_2.xlsx'));
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain analyze select sum(l_orderkey::int64),sum(l_partkey::int64),sum(l_suppkey::int64) from(from read_xlsx('50_1.xlsx') union all from read_xlsx('50_2.xlsx'));
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 4.21s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE │
│ ──────────────────── │
│ Aggregates: │
│ sum(#0) │
│ sum(#1) │
│ sum(#2) │
│ │
│ 1 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ CAST(l_orderkey AS BIGINT)│
│ CAST(l_partkey AS BIGINT) │
│ CAST(l_suppkey AS BIGINT) │
│ │
│ 1048575 Rows │
│ (0.01s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNION │
│ ──────────────────── │
│ 0 Rows ├──────────────┐
│ (0.00s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION │
│ ──────────────────── ││ ──────────────────── │
│ l_orderkey ││ A1 │
│ l_partkey ││ B1 │
│ l_suppkey ││ C1 │
│ ││ │
│ 500000 Rows ││ 548575 Rows │
│ (0.00s) ││ (0.00s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ TABLE_SCAN ││ TABLE_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Function: READ_XLSX ││ Function: READ_XLSX │
│ ││ │
│ 500000 Rows ││ 548575 Rows │
│ (3.15s) ││ (3.49s) │
└───────────────────────────┘└───────────────────────────┘
Run Time (s): real 4.214 user 7.515625 sys 0.500000
D explain analyze select sum(l_orderkey),sum(l_partkey),sum(l_suppkey) from(from read_sheet('50_1.xlsx') union all fromread_sheet('50_2.xlsx'));
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain analyze select sum(l_orderkey),sum(l_partkey),sum(l_suppkey) from(from read_sheet('50_1.xlsx') union all from read_sheet('50_2.xlsx'));
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 13.04s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE │
│ ──────────────────── │
│ Aggregates: │
│ sum(#0) │
│ sum(#1) │
│ sum(#2) │
│ │
│ 1 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ l_orderkey │
│ l_partkey │
│ l_suppkey │
│ │
│ 1048574 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNION │
│ ──────────────────── │
│ 0 Rows ├──────────────┐
│ (0.00s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION │
│ ──────────────────── ││ ──────────────────── │
│ l_orderkey ││ 499683 │
│ l_partkey ││ 139614 │
│ l_suppkey ││ 2128 │
│ ││ │
│ 500000 Rows ││ 548574 Rows │
│ (0.00s) ││ (0.00s) │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ TABLE_SCAN ││ TABLE_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Function: READ_SHEET ││ Function: READ_SHEET │
│ ││ │
│ 500000 Rows ││ 548574 Rows │
│ (1.17s) ││ (1.26s) │
└───────────────────────────┘└───────────────────────────┘
Run Time (s): real 13.212 user 13.968750 sys 0.500000