当前位置: 首页 > news >正文

calamine读取xlsx文件的方法比较

calamine是纯rust编写的电子表格文件读取库,支持Excel(xls, xlsx, xlsm, xlsb, xla, xlam)、opendocument电子表格(ods)。
它自带的示例程序excel_to_csv.rs采用workbook.worksheet_range来读取,很多AI和网上的介绍文章都说,处理大文件workbook.worksheet_range_ref更快,我实际测试了一下。

测试步骤
0.准备测试xlsx文件
在duckdb中生成tpch数据,用excel插件写入xlsx文件。

duckdb
load tpch;
load excel;
call dbgen(sf=1);
copy (from lineitem limit 1048575)to 'exli2.xlsx' WITH (FORMAT xlsx, HEADER true) ;

1.拉取calamine源码

git clone --depth=1 https://gitclone.com/github.com/tafia/calamine
Cloning into 'calamine'...
remote: 对象计数中: 133, 完成.
remote: 压缩对象中: 100% (109/109), 完成.
remote: Total 133 (delta 21), reused 133 (delta 21)
Receiving objects: 100% (133/133), 5.08 MiB | 48.00 KiB/s, done.
Resolving deltas: 100% (21/21), done.

2.编译执行

export PATH=/par/mold240/bin:$PATHcd /par/calamine
cargo build --release --example excel_to_csvtime /par/calamine/target/release/examples/excel_to_csv /par/exli2.xlsx Sheet1real	0m21.959s
user	0m19.332s
sys	0m2.180s

3.修改excel_to_csv.rs为excel_to_csv_ref.rs
基本上就是把函数worksheet_range改为worksheet_range_ref,加入use calamine::ReaderRef;和use calamine::DataRef;数据类型Data修改为DataRef<'>,再根据编译器提示加入DataRef::<'>::SharedString的处理。

// SPDX-License-Identifier: MIT
//
// Copyright 2016-2025, Johann Tuffe.//! An example for using the `calamine` crate to convert an Excel file to CSV.
//!
//! Converts XLSX, XLSM, XLSB, and XLS files. The filename and sheet name must
//! be specified as command line arguments. The output CSV will be written to a
//! file with the same name as the input file, but with a `.csv` extension.use std::env;
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::PathBuf;use calamine::{open_workbook_auto, Data, Range, Reader};
use calamine::ReaderRef;
use calamine::DataRef;
// usage: cargo run --example excel_to_csv file.xls[xmb] sheet_name
//
// Where:
// - `file.xls[xmb]` is the Excel file to convert. Required.
// - `sheet_name` is the name of the sheet to convert. Required.
//
// The output will be written to a file with the same name as the input file,
// including the path, but with a `.csv` extension.
//
fn main() {let excel_file = env::args().nth(1).expect("Please provide an excel file to convert");let sheet_name = env::args().nth(2).expect("Expecting a sheet name as second argument");let excel_path = PathBuf::from(excel_file);match excel_path.extension().and_then(|s| s.to_str()) {Some("xlsx") | Some("xlsm") | Some("xlsb") | Some("xls") => (),_ => panic!("Expecting an excel file"),}let csv_path = excel_path.with_extension("csv");let mut csv_file = BufWriter::new(File::create(csv_path).unwrap());let mut workbook = open_workbook_auto(&excel_path).unwrap();let range = workbook.worksheet_range_ref(&sheet_name).unwrap();write_to_csv(&mut csv_file, &range).unwrap();
}// Write the Excel data as strings to a CSV file. Uses a semicolon (`;`) as the
// field separator.
//
// Note, this is a simplified version of CSV and doesn't handle quoting of
// separators or other special cases. See the `csv.rs` crate for a more robust
// solution.
fn write_to_csv<W: Write>(output_file: &mut W, range: &Range<DataRef<'_>>) -> std::io::Result<()> {let max_column = range.get_size().1 - 1;for rows in range.rows() {for (col_number, cell_data) in rows.iter().enumerate() {match *cell_data {DataRef::<'_>::Empty => Ok(()),DataRef::<'_>::Int(ref i) => write!(output_file, "{i}"),DataRef::<'_>::Bool(ref b) => write!(output_file, "{b}"),DataRef::<'_>::Error(ref e) => write!(output_file, "{e:?}"),DataRef::<'_>::Float(ref f) => write!(output_file, "{f}"),DataRef::<'_>::DateTime(ref d) => write!(output_file, "{}", d.as_f64()),DataRef::<'_>::String(ref s) | DataRef::<'_>::DateTimeIso(ref s) | DataRef::<'_>::DurationIso(ref s) => {write!(output_file, "{s}")} , DataRef::<'_>::SharedString(s) => write!(output_file, "{s}")}?;// Write the field separator except for the last column.if col_number != max_column {write!(output_file, ";")?;}}write!(output_file, "\r\n")?;}Ok(())
}

4.编译执行

cargo build --release --example excel_to_csv_refmv /par/exli2.csv /par/exli2.csv.old
time /par/calamine/target/release/examples/excel_to_csv_ref /par/exli2.xlsx Sheet1real	0m22.062s
user	0m19.412s
sys	0m2.592s
md5sum /par/exli2.csv*
05bcc52a09f1ee2bbb7c9b08172fb7ae  /par/exli2.csv
05bcc52a09f1ee2bbb7c9b08172fb7ae  /par/exli2.csv.old

结论:worksheet_range和worksheet_range_ref读取速度和读取结果完全一样。
polars和pandas的read_excel函数都支持使用calamine引擎,polars目前还是默认使用calamine引擎,测试结果如下

import polars as plpl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",) 
shape: (1_048_575, 16)
┌────────────┬───────────┬───────────┬──────────────┬───┬───────────────┬───────────────────┬────────────┬─────────────────────────────────┐
│ l_orderkey ┆ l_partkey ┆ l_suppkey ┆ l_linenumber ┆ … ┆ l_receiptdate ┆ l_shipinstruct    ┆ l_shipmode ┆ l_comment                       │
│ ---        ┆ ---       ┆ ---       ┆ ---          ┆   ┆ ---           ┆ ---               ┆ ---        ┆ ---                             │
│ i64        ┆ i64       ┆ i64       ┆ i64          ┆   ┆ date          ┆ str               ┆ str        ┆ str                             │
╞════════════╪═══════════╪═══════════╪══════════════╪═══╪═══════════════╪═══════════════════╪════════════╪═════════════════════════════════╡
│ 115519077061            ┆ … ┆ 1996-03-22    ┆ DELIVER IN PERSON ┆ TRUCK      ┆ to beans x-ray carefull         │
│ 16731073112            ┆ … ┆ 1996-04-20    ┆ TAKE BACK RETURN  ┆ MAIL       ┆  according to the final foxes.… │
│ 16370037013            ┆ … ┆ 1996-01-31    ┆ TAKE BACK RETURN  ┆ REG AIR    ┆ ourts cajole above the furiou   │
│ 1213246334            ┆ … ┆ 1996-05-16    ┆ NONE              ┆ AIR        ┆ s cajole busily above t         │
│ 12402715345            ┆ … ┆ 1996-04-01    ┆ NONE              ┆ FOB        ┆  the regular, regular pa        │
│ …          ┆ …         ┆ …         ┆ …            ┆ … ┆ …             ┆ …                 ┆ …          ┆ …                               │
│ 10484826322482375            ┆ … ┆ 1992-05-17    ┆ NONE              ┆ MAIL       ┆  final accounts haggle about t… │
│ 10484838136013611            ┆ … ┆ 1994-08-30    ┆ NONE              ┆ AIR        ┆ ely around the regula           │
│ 10484831005250552            ┆ … ┆ 1994-07-08    ┆ DELIVER IN PERSON ┆ MAIL       ┆ of the bli                      │
│ 104848312950995103            ┆ … ┆ 1994-05-16    ┆ TAKE BACK RETURN  ┆ RAIL       ┆ ests impress. sly, even i       │
│ 10484835007025764            ┆ … ┆ 1994-08-29    ┆ DELIVER IN PERSON ┆ REG AIR    ┆ regularly expre                 │
└────────────┴───────────┴───────────┴──────────────┴───┴───────────────┴───────────────────┴────────────┴─────────────────────────────────┘
import time
t=time.time();x=pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",);print(time.time()-t)
23.807374477386475
t=time.time();x.write_csv("plexli2.csv", separator=",");print(time.time()-t)
0.6844663619995117
t=time.time();pl.read_excel(source="exli2.xlsx",sheet_name="Sheet1",).write_csv("plexli.csv", separator=",");print(time.time()-t)
22.80883002281189import pandas as pd
import time
t=time.time();pd.read_excel("exli2.xlsx",engine="calamine").to_csv("pdexl2.csv", index=False, header=True);print(time.time()-t)
84.7169542312622md5sum p*exl2.csv
814e86b96b4e15ae43bf005ca0f5c8fe  pdexl2.csv
md5sum plexli2.csv
814e86b96b4e15ae43bf005ca0f5c8fe  plexli2.csv

虽然通过python中转,polars与rust中直接转的时间差距微小,而pandas则慢了很多。

http://www.dtcms.com/a/326047.html

相关文章:

  • Spring Boot 2.0 升级至 3.5 JDK 1.8 升级至 17 全面指南
  • 计算机视觉CS231n学习(7)
  • 【Altium designer】解决报错“Access violation at address...“
  • 【代码随想录day 17】 力扣 617.合并二叉树
  • python魔法方法__str__()介绍
  • 【Lua】题目小练9
  • 从零构建自定义Spring Boot Starter:打造你的专属开箱即用组件
  • 爬虫与数据分析入门:从中国大学排名爬取到数据可视化全流程
  • Go语言构建高性能AI分析网关:原理与实战
  • 设计模式笔记_结构型_组合模式
  • React(四):事件总线、setState的细节、PureComponent、ref
  • Jenkins 搭建鸿蒙打包
  • 【k8s】k8s中的几个概念性问题
  • day48 力扣739. 每日温度 力扣496.下一个更大元素 I 力扣503.下一个更大元素II
  • 轻量级解决方案:如何高效处理Word转PDF?
  • k8s的calico是什么作用,举例一下
  • 【2025最新版】PDF24 Creator,PDF编辑,合并分割,格式转换全能工具箱,本地离线版本,完全免费!
  • 【C语言强化训练16天】--从基础到进阶的蜕变之旅:Day1
  • 【12-激活函数】
  • 【PRML】分类
  • 普通大学本科生如何入门强化学习?
  • 算法73. 矩阵置零
  • MySQL权限管理和MySQL备份
  • 银行客户经营:用企业微信精准破解触达断层、效率瓶颈、数据孤岛三个痛点
  • GPT-5 全面解析与最佳实践指南
  • 容器 K8S Docker Kata 学习(一)
  • idea三个环境变量设置
  • Spring Boot Excel数据导入数据库实现详解
  • Spring-Security-5.7.11升级6.5.2
  • Socket(套接字)网络编程