当前位置：首页 > news >正文

【工具】python汇总发票（含源码）

news 2025/7/28 7:26:05

#创作灵感#

最近整理发票给财务，发票有点多，自己用excel表格统计的话太费时间，而且容易出错和遗漏。想到之后也会遇到处理发票的事情，索性就做一个批处理的小工具，一劳永逸，这里分享给有需要的小伙伴。

#正文#

情况说明

这里所说的发票为pdf格式的普通电子发票，专票可以在该方法的基础上进行修改。发票中有许多字段，这里需要的字段有开票日期、开票公司、开票公司的识别号、服务项目以及票额。如果需要其他项目，请根据代码自行调整。另外这里的服务项目只针对单条服务项目的普通发票。

功能说明

将某个路径下的所有pdf格式普通发票中感兴趣的字段提取出来，并保存到指定的新建excel文件中，最后print出总的发票数量和总金额，方便自己核对。

现有pdf库

Python有现有的pdf库可以直接使用，这里我们选用的是pdfplumber库：

pdf库
库	说明
PyMuPDF	通用文本读取
pdfplumber	表格、结构化数据
PyPDF2	不推荐

完整代码

代码复制，直接可用，可在该代码基础上进行扩展。

import os
import re
import pdfplumber
import pandas as pd# 匹配第一个满足要求的
def re_search(bt, text):result = re.search(bt, text)if result is not None:return re_replace(result[0])return None# 匹配所有满足要求的
def re_findall(bt, text):result = bt.findall(text)# print(result)if result is not None:return re_replace(result[1])return None# 去掉无关字符
def re_replace(text):return text.replace(' ', '').replace('　', '').replace('）', '').replace(')', '').replace('：', ':')# 获取路径下所有pdf格式的发票文件
def get_files_name(dir_path):files_name = []for root, sub_dir, file_names in os.walk(dir_path):for name in file_names:if name.endswith('.pdf'):filepath = os.path.join(root, name)files_name.append(filepath)return files_name# 读取pdf发票文件中的内容，指定保存的excel文件和发票文件路径
def read_invoice(excel_file, invoice_pdf_path):# 想要读取的字段内容fields_you_need = {"开票日期": [],"开票公司": [],"纳税人识别号": [],"服务项目": [],"票额": [],}filesname = get_files_name(invoice_pdf_path)for filename in filesname:# print(f"正在读取：{filename}")with pdfplumber.open(filename) as pdf:page0 = pdf.pages[0]invoice_text = page0.extract_text()  invoice_date = re_search(re.compile(r'开票日期(.*)'), invoice_text)if invoice_date:fields_you_need["开票日期"].append(invoice_date.replace("开票日期:", ""))seller_name = re_findall(re.compile(r'名\s*称\s*[:：]\s*([\u4e00-\u9fa5]+)'), invoice_text)if seller_name:fields_you_need["开票公司"].append(seller_name.replace("名称:", ""))seller_number = re_findall(re.compile(r'纳税人识别号\s*[:：]\s*([a-zA-Z0-9]+)'), invoice_text)if seller_number:fields_you_need["纳税人识别号"].append(seller_number.replace("纳税人识别号:", ""))description = re_search(re.compile(r'\*(.*)\*'), invoice_text) # 类别if description:fields_you_need["服务项目"].append(description)total = re_search(re.compile(r'小写.*(.*[0-9.]+)'), invoice_text)if total:fields_you_need["票额"].append(total.replace("小写¥", ""))df = pd.DataFrame(fields_you_need, index = range(len(filesname)))df.index += 1df['票额'] = pd.to_numeric(df['票额'], errors='coerce') # Excel单元格内容转换成数值with pd.ExcelWriter(excel_file) as writer:df.to_excel(writer)all_total = df['票额'].sum()print("一共%d张发票，总额为%.2f元。" %(len(filesname), all_total)) # 打印出所有发票总额returninvoice_pdf_path = r"./INVOICE_PDF"
excel_file = r"./INVOICE_PDF/InvoiceTotal.xlsx"read_invoice(excel_file, invoice_pdf_path)

#总结#

其他pdf也可以使用该思路读取，可以先读取print出来，寻找规律，然后构建正则表达式来提取关键信息，最后就可以将需要的信息整理，并存储到新建文件中。

觉得有用的化，可以点个赞和收藏再走！

查看全文

http://www.dtcms.com/a/300999.html