当前位置：首页 > wzjs >正文

做网页设计卖钱的网站网站用视频做背景

wzjs 2025/9/7 7:20:40

做网页设计卖钱的网站,网站用视频做背景,wordpress 网站底部美化,天津关键词优化网站*用Python将 PDF 中的表格提取为 Excel/CSV，*支持文本型 PDF 和扫描件/图片型 PDF（需 OCR 识别）。程序包含以下功能： 1.自动检测 PDF 类型（文本 or 扫描件） 2.提取表格数据并保存为 Excel/CSV 3.处理多页…

*用Python将 PDF 中的表格提取为 Excel/CSV，*支持文本型 PDF 和扫描件/图片型 PDF（需 OCR 识别）。程序包含以下功能：

1.自动检测 PDF 类型（文本 or 扫描件）
2.提取表格数据并保存为 Excel/CSV
3.处理多页 PDF
4.命令行交互 & 图形界面（可选）

1. 安装依赖库

运行前，先安装所需库：

pip install tabula-py pandas pytesseract pdf2image opencv-python pillow

2. 完整代码

导入相关模块

import os
import pandas as pd
import tabula
from pdf2image import convert_from_path
import pytesseract
import cv2
import tempfile
import argparse

定义函数

def pdf_to_excel(pdf_path, output_path, use_ocr=False):

    """将 PDF 中的表格转换为 Excel 文件:param pdf_path: PDF 文件路径:param output_path: 输出 Excel/CSV 路径:param use_ocr: 是否强制使用 OCR（针对扫描件）"""try:# 检查输出格式file_ext = os.path.splitext(output_path)[1].lower()if file_ext not in ['.xlsx', '.csv']:raise ValueError("输出文件格式必须是 .xlsx 或 .csv")

        # 尝试直接提取文本表格（非扫描件）if not use_ocr:try:print("尝试提取文本表格...")dfs = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)if not dfs:raise RuntimeError("未检测到表格，可能为扫描件图片。")# 合并所有表格页combined_df = pd.concat(dfs, ignore_index=True)if file_ext == '.xlsx':combined_df.to_excel(output_path, index=False)else:combined_df.to_csv(output_path, index=False)print(f"转换成功！结果已保存至: {output_path}")returnexcept Exception as e:print(f"文本提取失败（可能为扫描件），尝试 OCR: {e}")use_ocr = True

        # OCR 处理扫描件/图片if use_ocr:print("正在使用 OCR 识别扫描件...")with tempfile.TemporaryDirectory() as temp_dir:# 将 PDF 转换为图片images = convert_from_path(pdf_path, output_folder=temp_dir)all_text = []for i, img in enumerate(images):img_path = os.path.join(temp_dir, f"page_{i+1}.jpg")img.save(img_path, 'JPEG')# 使用 OpenCV 增强图像（可选）img_cv = cv2.imread(img_path)gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]# OCR 识别text = pytesseract.image_to_string(thresh, config='--psm 6')all_text.append(text)

                # 将识别结果保存为表格text_combined = "\n".join(all_text)lines = [line.split() for line in text_combined.split('\n') if line.strip()]df = pd.DataFrame(lines)if file_ext == '.xlsx':df.to_excel(output_path, index=False, header=False)else:df.to_csv(output_path, index=False, header=False)print(f"OCR 转换完成！结果已保存至: {output_path}")

    except Exception as e:print(f"转换失败: {e}")

if __name__ == "__main__":# 命令行参数解析parser = argparse.ArgumentParser(description="PDF 表格提取工具")parser.add_argument("pdf_path", help="输入的 PDF 文件路径")parser.add_argument("output_path", help="输出的 Excel/CSV 文件路径")parser.add_argument("--ocr", action="store_true", help="强制使用 OCR（针对扫描件）")args = parser.parse_args()# 运行转换pdf_to_excel(args.pdf_path, args.output_path, args.ocr)

命令行运行

# 默认自动检测 PDF 类型
python pdf_to_excel.py input.pdf output.xlsx# 强制使用 OCR（针对扫描件）
python pdf_to_excel.py scanned.pdf output.csv --ocr

直接调用函数

pdf_to_excel("input.pdf", "output.xlsx", use_ocr=False)

重点说明：
文本型 PDF：使用 tabula-py 直接提取表格结构。
扫描件/图片 PDF：
通过 pdf2image 将 PDF 转为图片。
使用 OpenCV 对图像预处理（二值化、去噪）。
调用 pytesseract（Tesseract OCR）识别文字并生成表格。

扫描件质量：OCR 精度受图片清晰度影响，建议高分辨率 PDF。

复杂表格：若表格有合并单元格，可能需要手动调整输出结果。

中文支持：确保 Tesseract 安装了中文语言包（chi_sim）。

如果需要进一步优化（如自定义表格解析逻辑），可以在此基础上扩展！

文章转载自：

http://W7RWdyPX.nwzcf.cn
http://tJBhaMTf.nwzcf.cn
http://Pxi9QzrC.nwzcf.cn
http://e2BoH1lJ.nwzcf.cn
http://7Iv44bz3.nwzcf.cn
http://gJwTX2vY.nwzcf.cn
http://M3VBKOaD.nwzcf.cn
http://ZfRKL1jE.nwzcf.cn
http://Ql1coUwR.nwzcf.cn
http://ksE0miSV.nwzcf.cn
http://km2KpV8F.nwzcf.cn
http://uE5XRq6r.nwzcf.cn
http://vjpEuc8O.nwzcf.cn
http://35Db8ja3.nwzcf.cn
http://NnqBpS8S.nwzcf.cn
http://wKeu5UzM.nwzcf.cn
http://kVYeJtTk.nwzcf.cn
http://kWWBGoz3.nwzcf.cn
http://ogrN4GUC.nwzcf.cn
http://cA9PdBB6.nwzcf.cn
http://FGnYbFde.nwzcf.cn
http://N2a4Zavm.nwzcf.cn
http://NPE132iX.nwzcf.cn
http://kaa8AvIo.nwzcf.cn
http://Iu8kkDWV.nwzcf.cn
http://p1y8PNZc.nwzcf.cn
http://Z1BGzePg.nwzcf.cn
http://PNAfPGG4.nwzcf.cn
http://C4eVlp0K.nwzcf.cn
http://9xnu1eNT.nwzcf.cn

查看全文

http://www.dtcms.com/wzjs/640304.html

在万网上域名了怎么做网站ui设计培训班的学费一般是多少钱?

长沙做网站的公司哪家最好义乌网站建设公司排名

合肥工程建设网站oa系统办公软件排名

商河网站建设网站建设飠金手指排名十五

山东官方网站建设导航栏网站建站

湖北网站设计湖南常德文理学院

WordPress网站积分系统棋牌app开发软件

网站系统建设管理制度WordPress里面自定义功能

建设个人银行网站怎么提高网站加载速度慢

淘宝店招免费做的网站有汕头市道路建设网站

北京seo网站优化公司有哪些网站教做吃的

微网站网页工程信息造价

天津网站优化怎么样工作场所的职业病危害因素强度或者浓度应当符合

做qq图片的网站吗网站建设的pest分析

信用网站建设设计师培训多久

北京网站建公司新闻中文html网站模板下载

校园兼职网站建设什么叫手机网站

如何用word做网站青锐成长计划网站开发人员

电信网站备案百度如何注册公司网站

seo网站优化培训价格怎么制作自己的网页网站首页

深圳南山做网站在线做数据图的网站有哪些问题

品牌排名网站怎么删除ghost wordpress

做网站需要知道的问题上海搬家公司电话查询

长沙制作手机网站企业手机网站建设方案

网站后台排版工具网站建设与网页制作技术

新科网站建设做创新方法工作的网站

搜索网站不显示图片设计一个企业官网的栏目

广东高端网站设计公司价格网站建设氺首选金手指14

深圳做营销网站建设做网站必需要在工商局备案吗

定制化网站建设网站服务器在哪租

1. 安装依赖库

2. 完整代码

相关文章：