文件解析:doc、docx、pdf
1.doc解析
ubuntu/debian系统应先安装工具
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
pip install textract
解析:
import textract
text = textract.process(doc_file, input_encoding='utf-8')
text_str = str(text, 'utf-8')
print(text_str)
2.docx解析
pip install python-docx
from docx import Document
import docx2txt
def read_docx(docx_file):
doc = Document(docx_file)
text = []
for paragraph in doc.paragraphs:
text.append(paragraph.text)
return '\n'.join(text)
read_docx('path.docx')