当前位置: 首页 > news >正文

利用DuckDB rusty_sheet插件0.2版在xlsx文件中测试tpch

测试流程是:利用tpch插件生成数据表,将其转换成8个xlsx文件,删除数据表,利用excel插件的read_xlsx函数创建视图,执行tpch查询。
再利用rusty_sheet插件0.2版的read_sheet函数替换原视图,重新执行tpch查询。

load tpch;
load excel;
call dbgen(sf=0.1);
copy lineitem to 'lineitem.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'lineitem');
copy customer to 'customer.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'customer');
copy nation to 'nation.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'nation');
copy orders to 'orders.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'orders');
copy part to 'part.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'part');
copy partsupp to 'partsupp.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'partsupp');
copy region to 'region.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'region');
copy supplier to 'supplier.xlsx'  WITH (FORMAT xlsx, HEADER true,SHEET 'supplier');drop table customer;
drop table lineitem;
drop table nation;
drop table orders;
drop table part;
drop table partsupp;
drop table region;
drop table supplier;create view lineitem as from read_xlsx('lineitem.xlsx',header=1);
create view customer as from read_xlsx('customer.xlsx',header=1);
create view nation as from read_xlsx('nation.xlsx',header=1);
create view orders as from read_xlsx('orders.xlsx',header=1);
create view part as from read_xlsx('part.xlsx',header=1);
create view partsupp as from read_xlsx('partsupp.xlsx',header=1);
create view region as from read_xlsx('region.xlsx',header=1);
create view supplier as from read_xlsx('supplier.xlsx',header=1);pragma tpch(1);
pragma tpch(2);
pragma tpch(3);
pragma tpch(4);
pragma tpch(6);
pragma tpch(7);
pragma tpch(8);
pragma tpch(9);
pragma tpch(10);
pragma tpch(11);
pragma tpch(12);
pragma tpch(13);
pragma tpch(14);
pragma tpch(15);
pragma tpch(16);
pragma tpch(17);
pragma tpch(18);
pragma tpch(19);
pragma tpch(20);
pragma tpch(21);
pragma tpch(22);load rusty_sheet;create or replace view lineitem as from read_sheet('lineitem.xlsx',header=1);
create or replace view customer as from read_sheet('customer.xlsx',header=1);
create or replace view nation as from read_sheet('nation.xlsx',header=1);
create or replace view orders as from read_sheet('orders.xlsx',header=1);
create or replace view part as from read_sheet('part.xlsx',header=1);
create or replace view partsupp as from read_sheet('partsupp.xlsx',header=1);
create or replace view region as from read_sheet('region.xlsx',header=1);
create or replace view supplier as from read_sheet('supplier.xlsx',header=1);create or replace view lineitem as from read_sheet('tpchsf01.xlsx',sheet='lineitem_*',header=1);
create or replace view customer as from read_sheet('tpchsf01.xlsx',sheet='customer_*',header=1);
create or replace view nation as from read_sheet('tpchsf01.xlsx',sheet='nation_*',header=1);
create or replace view orders as from read_sheet('tpchsf01.xlsx',sheet='orders_*',header=1);
create or replace view part as from read_sheet('tpchsf01.xlsx',sheet='part_*',header=1);
create or replace view partsupp as from read_sheet('tpchsf01.xlsx',sheet='partsupp_*',header=1);
create or replace view region as from read_sheet('tpchsf01.xlsx',sheet='region_*',header=1);
create or replace view supplier as from read_sheet('tpchsf01.xlsx',sheet='supplier_*',header=1);select query FROM tpch_queries() where query_nr=4;
--query 4
SELECTo_orderpriority,count(*) AS order_count
FROMorders
WHEREo_orderdate >= CAST('1993-07-01' AS date)AND o_orderdate < CAST('1993-10-01' AS date)AND EXISTS (SELECT*FROMlineitemWHEREl_orderkey = o_orderkeyAND l_commitdate < l_receiptdate)
GROUP BYo_orderpriority
ORDER BYo_orderpriority;select query FROM tpch_queries() where query_nr=7;
-- query 7
SELECTsupp_nation,cust_nation,l_year,sum(volume) AS revenue
FROM (SELECTn1.n_name AS supp_nation,n2.n_name AS cust_nation,extract(year FROM l_shipdate) AS l_year,l_extendedprice * (1 - l_discount) AS volumeFROMsupplier,lineitem,orders,customer,nation n1,nation n2WHEREs_suppkey = l_suppkeyAND o_orderkey = l_orderkeyAND c_custkey = o_custkeyAND s_nationkey = n1.n_nationkeyAND c_nationkey = n2.n_nationkeyAND ((n1.n_name = 'FRANCE'AND n2.n_name = 'GERMANY')OR (n1.n_name = 'GERMANY'AND n2.n_name = 'FRANCE'))AND l_shipdate BETWEEN CAST('1995-01-01' AS date)AND CAST('1996-12-31' AS date)) AS shipping
GROUP BYsupp_nation,cust_nation,l_year
ORDER BYsupp_nation,cust_nation,l_year;

测试结果如下,其中rusty_sheet 0.1版在DuckDB 1.3.2上,测试多文件;rusty_sheet 0.2版在DuckDB 1.4.1上,分别测试多文件和单文件。

-- v1.3.2(excel)
D .timer on
D create view lineitem as from read_xlsx('lineitem.xlsx',header=1);
Run Time (s): real 0.359 user 0.308000 sys 0.052000
D create view customer as from read_xlsx('customer.xlsx',header=1);
Run Time (s): real 0.042 user 0.004000 sys 0.004000
D create view nation as from read_xlsx('nation.xlsx',header=1);
Run Time (s): real 0.001 user 0.000000 sys 0.000000
D create view orders as from read_xlsx('orders.xlsx',header=1);
Run Time (s): real 0.097 user 0.052000 sys 0.016000
D create view part as from read_xlsx('part.xlsx',header=1);
Run Time (s): real 0.030 user 0.012000 sys 0.000000
D create view partsupp as from read_xlsx('partsupp.xlsx',header=1);
Run Time (s): real 0.034 user 0.032000 sys 0.000000
D create view region as from read_xlsx('region.xlsx',header=1);
Run Time (s): real 0.001 user 0.000000 sys 0.000000
D create view supplier as from read_xlsx('supplier.xlsx',header=1);
Run Time (s): real 0.002 user 0.000000 sys 0.004000D PRAGMA tpch(4);
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────┬─────────────┐
│ o_orderpriority │ order_count │
│     varchar     │    int64    │
├─────────────────┼─────────────┤
│ 1-URGENT        │         999 │
│ 2-HIGH          │         997 │
│ 3-MEDIUM        │        1031 │
│ 4-NOT SPECIFIED │         989 │
│ 5-LOW           │        1077 │
└─────────────────┴─────────────┘
Run Time (s): real 14.354 user 14.444000 sys 0.116000
D PRAGMA tpch(7);
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────┬─────────────┬────────┬───────────────────┐
│ supp_nation │ cust_nation │ l_year │      revenue      │
│   varchar   │   varchar   │ int64  │      double       │
├─────────────┼─────────────┼────────┼───────────────────┤
│ FRANCE      │ GERMANY     │   1995 │ 4637235.150099999 │
│ FRANCE      │ GERMANY     │   1996 │ 5224779.573599999 │
│ GERMANY     │ FRANCE      │   1995 │      6232818.7037 │
│ GERMANY     │ FRANCE      │   1996 │ 5557312.112100002 │
└─────────────┴─────────────┴────────┴───────────────────┘
Run Time (s): real 14.325 user 16.508000 sys 0.136000
-- v1.3.2(rusty_sheet 0.1)D load rusty_sheet;
Run Time (s): real 0.173 user 0.068000 sys 0.020000
D create or replace view lineitem as from read_sheet('lineitem.xlsx',header=1);
Run Time (s): real 14.574 user 11.516000 sys 2.852000
D create or replace view customer as from read_sheet('customer.xlsx',header=1);
Run Time (s): real 0.176 user 0.144000 sys 0.012000
D create or replace view nation as from read_sheet('nation.xlsx',header=1);
Run Time (s): real 0.015 user 0.000000 sys 0.000000
D create or replace view orders as from read_sheet('orders.xlsx',header=1);
Run Time (s): real 1.669 user 1.592000 sys 0.044000
D create or replace view part as from read_sheet('part.xlsx',header=1);
Run Time (s): real 0.234 user 0.208000 sys 0.000000
D create or replace view partsupp as from read_sheet('partsupp.xlsx',header=1);
Run Time (s): real 0.488 user 0.472000 sys 0.012000
D create or replace view region as from read_sheet('region.xlsx',header=1);
Run Time (s): real 0.010 user 0.004000 sys 0.000000
D create or replace view supplier as from read_sheet('supplier.xlsx',header=1);
Run Time (s): real 0.539 user 0.008000 sys 0.000000
D PRAGMA tpch(4);
┌─────────────────┬─────────────┐
│ o_orderpriority │ order_count │
│     varchar     │    int64    │
├─────────────────┼─────────────┤
│ 1-URGENT        │         999 │
│ 2-HIGH          │         997 │
│ 3-MEDIUM        │        1031 │
│ 4-NOT SPECIFIED │         989 │
│ 5-LOW           │        1077 │
└─────────────────┴─────────────┘
Run Time (s): real 20.560 user 22.140000 sys 1.992000-- v1.4.1(rusty_sheet 0.2)
D .timer on
D create or replace view lineitem as from read_sheet('lineitem.xlsx',header=1);
Run Time (s): real 9.541 user 9.288000 sys 0.164000
D create or replace view customer as from read_sheet('customer.xlsx',header=1);
Run Time (s): real 0.189 user 0.136000 sys 0.016000
D create or replace view nation as from read_sheet('nation.xlsx',header=1);
Run Time (s): real 0.001 user 0.000000 sys 0.000000
D create or replace view orders as from read_sheet('orders.xlsx',header=1);
Run Time (s): real 1.399 user 1.392000 sys 0.008000
D create or replace view part as from read_sheet('part.xlsx',header=1);
Run Time (s): real 0.197 user 0.196000 sys 0.000000
D create or replace view partsupp as from read_sheet('partsupp.xlsx',header=1);
Run Time (s): real 0.487 user 0.440000 sys 0.008000
D create or replace view region as from read_sheet('region.xlsx',header=1);
Run Time (s): real 0.001 user 0.004000 sys 0.000000
D create or replace view supplier as from read_sheet('supplier.xlsx',header=1);
Run Time (s): real 0.009 user 0.008000 sys 0.000000--query 4
┌─────────────────┬─────────────┐
│ o_orderpriority │ order_count │
│     varchar     │    int64    │
├─────────────────┼─────────────┤
│ 1-URGENT        │         999 │
│ 2-HIGH          │         997 │
│ 3-MEDIUM        │        1031 │
│ 4-NOT SPECIFIED │         989 │
│ 5-LOW           │        1077 │
└─────────────────┴─────────────┘
Run Time (s): real 11.473 user 11.604000 sys 0.224000--query 7
┌─────────────┬─────────────┬────────┬───────────────────┐
│ supp_nation │ cust_nation │ l_year │      revenue      │
│   varchar   │   varchar   │ int64  │      double       │
├─────────────┼─────────────┼────────┼───────────────────┤
│ FRANCE      │ GERMANY     │   1995 │ 4637235.150099999 │
│ FRANCE      │ GERMANY     │   1996 │ 5224779.573599999 │
│ GERMANY     │ FRANCE      │   1995 │      6232818.7037 │
│ GERMANY     │ FRANCE      │   1996 │ 5557312.112100002 │
└─────────────┴─────────────┴────────┴───────────────────┘
Run Time (s): real 11.777 user 12.032000 sys 0.176000
-- python 合并多个文件为1个文件
time python3 ../mergexlsx8.py -i "{'*.xlsx':['*']}" -o tpchsf01.xlsx
输入文件: ['orders.xlsx', 'supplier.xlsx', 'part.xlsx', 'lineitem.xlsx', 'region.xlsx', 'partsupp.xlsx', 'nation.xlsx', 'customer.xlsx']
Sheet配置: {'orders': ['*'], 'supplier': ['*'], 'part': ['*'], 'lineitem': ['*'], 'region': ['*'], 'partsupp': ['*'], 'nation': ['*'], 'customer': ['*']}
合并完成: tpchsf01.xlsxreal	0m18.217s
user	0m16.308s
sys	0m1.460s-- v1.4.1(rusty_sheet 0.2 one xlsx file 8 sheets)
D .timer on
D create or replace view lineitem as from read_sheet('tpchsf01.xlsx',sheet='lineitem_*',header=1);
Run Time (s): real 9.409 user 9.160000 sys 0.244000
D create or replace view customer as from read_sheet('tpchsf01.xlsx',sheet='customer_*',header=1);
Run Time (s): real 0.184 user 0.128000 sys 0.056000
D create or replace view nation as from read_sheet('tpchsf01.xlsx',sheet='nation_*',header=1);
Run Time (s): real 0.032 user 0.012000 sys 0.020000
D create or replace view orders as from read_sheet('tpchsf01.xlsx',sheet='orders_*',header=1);
Run Time (s): real 1.461 user 1.424000 sys 0.040000
D create or replace view part as from read_sheet('tpchsf01.xlsx',sheet='part_*',header=1);
Run Time (s): real 0.223 user 0.200000 sys 0.020000
D create or replace view partsupp as from read_sheet('tpchsf01.xlsx',sheet='partsupp_*',header=1);
Run Time (s): real 0.452 user 0.416000 sys 0.036000
D create or replace view region as from read_sheet('tpchsf01.xlsx',sheet='region_*',header=1);
Run Time (s): real 0.043 user 0.008000 sys 0.036000
D create or replace view supplier as from read_sheet('tpchsf01.xlsx',sheet='supplier_*',header=1);
Run Time (s): real 0.052 user 0.012000 sys 0.040000-- query 4┌─────────────────┬─────────────┐
│ o_orderpriority │ order_count │
│     varchar     │    int64    │
├─────────────────┼─────────────┤
│ 1-URGENT        │         999 │
│ 2-HIGH          │         997 │
│ 3-MEDIUM        │        1031 │
│ 4-NOT SPECIFIED │         989 │
│ 5-LOW           │        1077 │
└─────────────────┴─────────────┘
Run Time (s): real 11.381 user 11.580000 sys 0.240000
-- query 7
┌─────────────┬─────────────┬────────┬───────────────────┐
│ supp_nation │ cust_nation │ l_year │      revenue      │
│   varchar   │   varchar   │ int64  │      double       │
├─────────────┼─────────────┼────────┼───────────────────┤
│ FRANCE      │ GERMANY     │   1995 │ 4637235.150099999 │
│ FRANCE      │ GERMANY     │   1996 │ 5224779.573599999 │
│ GERMANY     │ FRANCE      │   1995 │      6232818.7037 │
│ GERMANY     │ FRANCE      │   1996 │ 5557312.112100002 │
└─────────────┴─────────────┴────────┴───────────────────┘
Run Time (s): real 11.734 user 11.888000 sys 0.288000

create view的速度excel插件较优,它明显是知道了在这个语句中不需要全表扫描,而rusty_sheet不知道。
rusty_sheet 2.0读取速度最快。单文件和多文件没有明显差别。所有查询时间都差不多,主要耗在读取文件上。实际工作中的多次查询还是应该先把各表保存到数据库。
额外再测试select count(*),create table as和summarize操作的性能。

.timer on
create or replace table lineitem as from read_sheet('lineitem.xlsx',header=1);select count(*) from read_sheet('lineitem.xlsx',header=1);summarize (from read_sheet('lineitem.xlsx',header=1));

测试结果如下

-- v1.3.2(excel)
duckdb132 -cmd "load tpch;load excel"D create table lineitem as from read_xlsx('lineitem.xlsx',header=1);
Run Time (s): real 12.872 user 24.644000 sys 0.508000
D select count(*) from read_xlsx('lineitem.xlsx',header=1);
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    600572    │
└──────────────┘
Run Time (s): real 12.033 user 23.588000 sys 0.072000
D summarize (from read_xlsx('lineitem.xlsx',header=1));
Run Time (s): real 14.342 user 27.720000 sys 0.152000-- v1.3.2(rusty_sheet 0.1)
D load rusty_sheet;
Run Time (s): real 0.036 user 0.040000 sys 0.008000
D create or replace table lineitem as from read_sheet('lineitem.xlsx',header=1);
Run Time (s): real 18.536 user 18.988000 sys 3.208000
D select count(*) from read_sheet('lineitem.xlsx',header=1);
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    600572    │
└──────────────┘
Run Time (s): real 14.019 user 15.720000 sys 0.724000
D summarize (from read_sheet('lineitem.xlsx',header=1));
Run Time (s): real 29.309 user 32.816000 sys 1.780000-- v1.4.1(rusty_sheet 0.2)
duckdb141 -unsigned -cmd "load '/par/14/3/rusty_sheet.duckdb_extension'"
D .timer on
D create or replace table lineitem as from read_sheet('lineitem.xlsx',header=1);
Run Time (s): real 10.849 user 11.892000 sys 0.224000
D select count(*) from read_sheet('lineitem.xlsx',header=1);
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    600572    │
└──────────────┘
Run Time (s): real 9.716 user 9.544000 sys 0.180000
D summarize (from read_sheet('lineitem.xlsx',header=1));
Run Time (s): real 21.961 user 24.584000 sys 0.376000

rusty_sheet 0.2 count和CTAS都最快,但summarize速度excel插件较优,从rusty_sheet的用时来看,好像读取了两次xlsx文件,需要进一步研究。

http://www.dtcms.com/a/511119.html

相关文章:

  • 设计模式之:单例模式
  • 第一章 不可变的变量
  • AUTOSAR 中 Trusted Platform(可信平台)详解
  • 2510rs,rust清单2
  • PINN物理信息神经网络股票价格预测模型Matlab实现
  • 2510rs,rust清单3
  • 用ps做网站方法茂名建站模板搭建
  • 怎么建设vip电影网站wordpress轮播图设置
  • docker 更新layer
  • 基于卷积神经网络的香蕉成熟度识别系统,resnet50,vgg16,resnet34【pytorch框架,python代码】
  • 深度学习YOLO实战:6、通过视频案例,解析YOLO模型的能力边界与选型策略
  • C# 识别图片中是否有人
  • [Power BI] 漏斗图(Funnel Chart)
  • 做网站优化响应式网站 企业模版
  • 视觉学习篇——图像存储格式
  • GB28181视频服务wvp搭建(二)
  • Spring Boot安全配置全解析
  • EasyGBS如何通过流媒体技术提升安防监控效率?
  • 做展览的网站国家免费职业培训平台
  • 农业技术网站建设原则曲阜网站建设
  • 【python】基于 生活方式与健康数据预测数据集(Lifestyle and Health Risk Prediction)的可视化练习,附数据集源文件。
  • C#WPF如何实现登录页面跳转
  • 健康与生活方式数据库编程手册(Python方向教学2025年4月)
  • HarmonyOS测试与上架:单元测试、UI测试与App Gallery Connect发布实战
  • 以太网学习理解
  • 微算法科技(NASDAQ MLGO)标准化API驱动多联邦学习系统模型迁移技术
  • 【Redis】三种缓存问题(穿透、击穿、双删)的 Golang 实践
  • 第1部分-并发编程基础与线程模型
  • 【含文档+PPT+源码】基于SSM的智能驾校预约管理系统
  • python股票交易数据管理系统 金融数据 分析可视化 Django框架 爬虫技术 大数据技术 Hadoop spark(源码)✅