DuckDB新版rusty_sheet 插件测试
又是全网率先拿到张泽鹏先生自研xml解析器的新版rusty_sheet插件,第一时间展开测试。
1.测试数据集共有3个
1.1 DuckDB tpch插件生成的104万行16列lineitem表,用duckdb excel插件导出的无sharedstring格式xlsx文件。
1.2 美国纽约交管所100万行41列NYC数据,包含采用sharedstring的wps版本和无sharedstring的duckdb excel插件导出版本。
1.3 开源OpenXLSX软件生成的104万行8列Demo05随机小整数数据,采用sharedstring的wps版本。
文件大小如下
-rw-r--r-- 1 root root 35603350 Aug 25 10:17 Demo05.xlsx
-rw-rw-r-- 1 1000 1000 249566356 Sep 1 08:23 dknyc.xlsx
-rw-r--r-- 1 root root 91671805 Feb 6 2025 exli2.xlsx
-rw-rw-r-- 1 1000 1000 179289736 Sep 1 08:24 wpnyc.xlsx
2.测试新改的analyze_sheet函数
D from analyze_sheet('exli2.xlsx');
┌─────────────────┬─────────────┐
│ column_name │ column_type │
│ varchar │ varchar │
├─────────────────┼─────────────┤
│ l_orderkey │ bigint │
│ l_partkey │ bigint │
│ l_suppkey │ bigint │
│ l_linenumber │ bigint │
│ l_quantity │ bigint │
│ l_extendedprice │ double │
│ l_discount │ double │
│ l_tax │ double │
│ l_returnflag │ varchar │
│ l_linestatus │ varchar │
│ l_shipdate │ date │
│ l_commitdate │ date │
│ l_receiptdate │ date │
│ l_shipinstruct │ varchar │
│ l_shipmode │ varchar │
│ l_comment │ varchar │
├─────────────────┴─────────────┤
│ 16 rows 2 columns │
└───────────────────────────────┘
Run Time (s): real 0.166 user 0.004000 sys 0.000000
D from analyze_sheet('wpnyc.xlsx');
┌────────────────────────────────┬─────────────┐
│ column_name │ column_type │
│ varchar │ varchar │
├────────────────────────────────┼─────────────┤
│ Unique Key │ bigint │
│ Created Date │ varchar │
│ Closed Date │ varchar │
│ Agency │ varchar │
│ Agency Name │ varchar │
│ Complaint Type │ varchar │
│ Descriptor │ varchar │
│ Location Type │ varchar │
│ Incident Zip │ bigint │
│ Incident Address │ varchar │
│ Street Name │ varchar │
│ Cross Street 1 │ varchar │
│ Cross Street 2 │ varchar │
│ Intersection Street 1 │ varchar │
│ Intersection Street 2 │ varchar │
│ Address Type │ varchar │
│ City │ varchar │
│ Landmark │ varchar │
│ Facility Type │ varchar │
│ Status │ varchar │
│ Due Date │ varchar │
│ Resolution Description │ varchar │
│ Resolution Action Updated Date │ varchar │
│ Community Board │ varchar │
│ BBL │ bigint │
│ Borough │ varchar │
│ X Coordinate (State Plane) │ bigint │
│ Y Coordinate (State Plane) │ bigint │
│ Open Data Channel Type │ varchar │
│ Park Facility Name │ varchar │
│ Park Borough │ varchar │
│ Vehicle Type │ varchar │
│ Taxi Company Borough │ varchar │
│ Taxi Pick Up Location │ varchar │
│ Bridge Highway Name │ varchar │
│ Bridge Highway Direction │ varchar │
│ Road Ramp │ varchar │
│ Bridge Highway Segment │ varchar │
│ Latitude │ double │
│ Longitude │ double │
│ Location │ varchar │
├────────────────────────────────┴─────────────┤
│ 41 rows 2 columns │
└──────────────────────────────────────────────┘
Run Time (s): real 0.039 user 0.004000 sys 0.000000
D from analyze_sheet('Demo05.xlsx');
┌─────────────┬─────────────┐
│ column_name │ column_type │
│ varchar │ varchar │
├─────────────┼─────────────┤
│ 64 │ bigint │
│ 41 │ bigint │
│ 88 │ bigint │
│ 80 │ bigint │
│ 21 │ bigint │
│ 46 │ bigint │
│ 93 │ bigint │
│ 13 │ bigint │
└─────────────┴─────────────┘
Run Time (s): real 0.040 user 0.004000 sys 0.000000
D from analyze_sheet('Demo05.xlsx',header=0);
┌─────────────┬─────────────┐
│ column_name │ column_type │
│ varchar │ varchar │
├─────────────┼─────────────┤
│ A │ bigint │
│ B │ bigint │
│ C │ bigint │
│ D │ bigint │
│ E │ bigint │
│ F │ bigint │
│ G │ bigint │
│ H │ bigint │
└─────────────┴─────────────┘
Run Time (s): real 0.076 user 0.004000 sys 0.000000
都在毫秒级别返回结果,比起分析也要全表扫描的第一版,提高了无数倍。仅举例比较一个表的结果
D from analyze_sheet('Demo05.xlsx',header=0);
┌─────────────┬─────────────┐
│ column_name │ column_type │
│ varchar │ varchar │
├─────────────┼─────────────┤
│ column1 │ bigint │
│ column2 │ bigint │
│ column3 │ bigint │
│ column4 │ bigint │
│ column5 │ bigint │
│ column6 │ bigint │
│ column7 │ bigint │
│ column8 │ bigint │
└─────────────┴─────────────┘
Run Time (s): real 10.228 user 8.976000 sys 1.244000
对于无标题行数据,系统默认的标题由Column1、Column2改为A、B…,与Excel表格的系统列名保持一致。
3.测试全表扫描性能
用如下3个语句,分别在新版rusty_sheet插件、第一版rusty_sheet插件、excel插件(需要改函数名)中查询。
create or replace table t as from read_sheet('Demo05.xlsx',header=0);
-- 计算单元格个数和总和
with t2 as(UNPIVOT (select sum(columns(*)),count(columns(*))AS "cnt_\0" from read_sheet('Demo05.xlsx',header=0))tON COLUMNS(*)INTONAME CVALUE V)select case when instr(c,'cnt')>0 then 'cnt' else 'sum' end ty ,sum(v) from t2 group by ty;create or replace table t as from read_sheet('wpnyc.xlsx',header=1);create or replace table t as from read_sheet('dknyc.xlsx',header=1);create or replace table t as from read_sheet('exli2.xlsx',header=1);
3.1 新版rusty_sheet
D load '3/rusty_sheet.duckdb_extension';
D .timer on
D .read test_full.sqlRun Time (s): real 9.326 user 16.712000 sys 1.268000┌─────────┬───────────┐
│ ty │ sum(v) │
│ varchar │ int128 │
├─────────┼───────────┤
│ cnt │ 8388608 │
│ sum │ 415367454 │
└─────────┴───────────┘
Run Time (s): real 8.876 user 16.008000 sys 0.872000Run Time (s): real 44.992 user 79.248000 sys 5.720000Run Time (s): real 44.075 user 72.136000 sys 4.396000Run Time (s): real 20.760 user 23.016000 sys 0.896000
可见有无sharedstring的xlsx文件,尽管原文件大小差别较大,读取时间基本一样,行数不变时,读取时间和列数基本成线性增长。
3.2 第一版rusty_sheet
D load rusty_sheet;
D .timer on
D .read test_full.sqlRun Time (s): real 11.176 user 12.828000 sys 0.120000
┌─────────┬───────────┐
│ ty │ sum(v) │
│ varchar │ int128 │
├─────────┼───────────┤
│ sum │ 415367454 │
│ cnt │ 8388608 │
└─────────┴───────────┘
Run Time (s): real 19.684 user 21.136000 sys 0.192000Run Time (s): real 62.728 user 55.912000 sys 6.396000Run Time (s): real 56.815 user 69.960000 sys 3.996000Run Time (s): real 33.759 user 28.164000 sys 1.996000
第一版的时间要新版的1.5到2倍。而且从real和user时间对比来看,新版利用duckdb的并行更好。
3.3 duckdb Excel插件
D load excel;
D .timer on
D .read test_fullx.sqlRun Time (s): real 8.396 user 16.352000 sys 0.068000┌─────────┬─────────────┐
│ ty │ sum(v) │
│ varchar │ double │
├─────────┼─────────────┤
│ cnt │ 8388608.0 │
│ sum │ 415367454.0 │
└─────────┴─────────────┘
Run Time (s): real 8.384 user 15.872000 sys 0.092000Run Time (s): real 3.529 user 6.788000 sys 0.108000
Invalid Input Error:
read_xlsx: Failed to parse cell 'N4': Could not convert string 'WILLOUGHBY AVENUE' to DOUBLE
Run Time (s): real 0.139 user 0.224000 sys 0.004000
Invalid Input Error:
read_xlsx: Failed to parse cell 'N4': Could not convert string 'WILLOUGHBY AVENUE' to DOUBLERun Time (s): real 23.053 user 45.484000 sys 0.200000
因为Excel插件分析数据类型错误,有两个查询没有完成。
4.比较测试range和limit的性能
采用如下语句查询
-- 用limit 取1-10000行
create or replace table t as from read_sheet('wpnyc.xlsx',header=1) limit 10000;
-- 用limit 取900001-910000行
create or replace table t as from read_sheet('wpnyc.xlsx',header=1) limit 10000 offset 900001;-- 用range 取1-10000行
create or replace table t as from read_sheet('wpnyc.xlsx',header=1,range='1:10001') ;
-- 用range 取900001-910000行,union all前面的查询是为了取标题行的列名,可以不写,用column指定create or replace table t as from read_sheet('wpnyc.xlsx',header=1,range='1:10')where 1=0 union all from read_sheet('wpnyc.xlsx',header=0,range='900002:910001');
-- 用range 取900001-910000行,无标题行
create or replace table t as from read_sheet('wpnyc.xlsx',header=0,range='900002:910001');
4.1 新版rusty_sheet
D .read test_range.sqlRun Time (s): real 28.064 user 54.080000 sys 1.236000Run Time (s): real 33.218 user 65.196000 sys 0.804000
Run Time (s): real 2.009 user 4.076000 sys 0.068000Run Time (s): real 62.855 user 84.120000 sys 0.848000
Run Time (s): real 67.021 user 86.316000 sys 2.736000
可见提取前部数据,range写法比limit …offset写法要快,提取后边的数据,则不如limit …offset写法。
4.2 第一版rusty_sheet
Run Time (s): real 36.832 user 36.416000 sys 0.568000Run Time (s): real 48.820 user 58.700000 sys 0.440000
Run Time (s): real 38.643 user 38.236000 sys 0.536000
Run Time (s): real 80.287 user 75.772000 sys 4.036000
Run Time (s): real 45.546 user 40.828000 sys 4.640000
前4个查询新版rusty_sheet都比第一版快,第5个查询,似乎新版rusty_sheet比第一版慢,需要调查。
关于这项功能,张先生的说法是:range其实就是limit和offset功能,因为 DuckDB 的 limit 和 offset 没有下推到 table function 里,我只能用这个『外挂』了