当前位置: 首页 > news >正文

学做烘焙网站网站建设合同要交印花税吗

学做烘焙网站,网站建设合同要交印花税吗,下载app免费下载软件,宁波软件开发公司排名通过对dify源码的解析,用户上传的文档首先经过api处理后传递给文件处理服务层,对于知识管理,上传的 PDF 通过 IndexingRunnerindexing_runner.py进入索引管道。 这个过程通常通过 Celery tasksdocument_indexing_task.py 异步执行。ExtractPr…

通过对dify源码的解析,用户上传的文档首先经过api处理后传递给文件处理服务层,对于知识管理,上传的 PDF 通过 IndexingRunnerindexing_runner.py进入索引管道。 这个过程通常通过 Celery tasksdocument_indexing_task.py 异步执行。ExtractProcessor作为文档处理的中央处理器根据文档的格式选择具体的ExtractorPdfExtractor 类专门用于 PDF 文件,利用使用 pypdfium2 这个高效的 PDF 解析库,按页读取 PDF 内容。
在这里插入图片描述

PDF格式文档解析工作流程

User"Upload API""FileService""IndexingRunner""ExtractProcessor""PdfExtractor""CleanProcessor"Upload PDF fileValidate and save fileReturn file metadataTrigger indexing processExtract text from documentProcess PDF fileUse pypdfium2 to extract textReturn extracted textClean extracted textReturn processed textStore in knowledge baseUser"Upload API""FileService""IndexingRunner""ExtractProcessor""PdfExtractor""CleanProcessor"

Dify的文件解析功能是一个分层的系统架构,主要通过以下几个核心组件来实现:

核心架构

1. 基础抽象类

Dify定义了一个抽象基类BaseExtractor,为所有文件提取器提供统一的接口

"""Abstract interface for document loader implementations."""from abc import ABC, abstractmethodclass BaseExtractor(ABC):"""Interface for extract files."""@abstractmethoddef extract(self):raise NotImplementedError

2. 中央处理器

ExtractProcessor类作为核心协调器,负责根据文件类型选择合适的提取器来处理不同格式的文件 。
主要方法说明

  • load_from_upload_file
    输入:UploadFile 对象
    功能:从上传的文件中抽取内容。可以返回 Document 列表,也可以只返回文本内容。
  • load_from_url
    输入:文件/网页的 URL
    功能:通过 ssrf_proxy 获取远程文件内容,自动推断文件类型,保存到本地临时文件后进行抽取。
  • extract
    输入:ExtractSetting(抽取设置),可选 file_path
    功能:根据数据源类型(本地文件、Notion、网站)和文件类型,选择合适的抽取器并执行抽取,返回 Document 列表。
SUPPORT_URL_CONTENT_TYPES = ["application/pdf", "text/plain", "application/json"]
USER_AGENT = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124"" Safari/537.36"
)class ExtractProcessor:@classmethoddef load_from_upload_file(cls, upload_file: UploadFile, return_text: bool = False, is_automatic: bool = False) -> Union[list[Document], str]:extract_setting = ExtractSetting(datasource_type="upload_file", upload_file=upload_file, document_model="text_model")if return_text:delimiter = "\n"return delimiter.join([document.page_content for document in cls.extract(extract_setting, is_automatic)])else:return cls.extract(extract_setting, is_automatic)@classmethoddef load_from_url(cls, url: str, return_text: bool = False) -> Union[list[Document], str]:response = ssrf_proxy.get(url, headers={"User-Agent": USER_AGENT})with tempfile.TemporaryDirectory() as temp_dir:suffix = Path(url).suffixif not suffix and suffix != ".":# get content-typeif response.headers.get("Content-Type"):suffix = "." + response.headers.get("Content-Type").split("/")[-1]else:content_disposition = response.headers.get("Content-Disposition")filename_match = re.search(r'filename="([^"]+)"', content_disposition)if filename_match:filename = unquote(filename_match.group(1))match = re.search(r"\.(\w+)$", filename)if match:suffix = "." + match.group(1)else:suffix = ""# FIXME mypy: Cannot determine type of 'tempfile._get_candidate_names' better not use it herefile_path = f"{temp_dir}/{next(tempfile._get_candidate_names())}{suffix}"  # type: ignorePath(file_path).write_bytes(response.content)extract_setting = ExtractSetting(datasource_type="upload_file", document_model="text_model")if return_text:delimiter = "\n"return delimiter.join([document.page_contentfor document in cls.extract(extract_setting=extract_setting, file_path=file_path)])else:return cls.extract(extract_setting=extract_setting, file_path=file_path)@classmethoddef extract(cls, extract_setting: ExtractSetting, is_automatic: bool = False, file_path: Optional[str] = None) -> list[Document]:if extract_setting.datasource_type == DatasourceType.FILE.value:with tempfile.TemporaryDirectory() as temp_dir:if not file_path:assert extract_setting.upload_file is not None, "upload_file is required"upload_file: UploadFile = extract_setting.upload_filesuffix = Path(upload_file.key).suffix# FIXME mypy: Cannot determine type of 'tempfile._get_candidate_names' better not use it herefile_path = f"{temp_dir}/{next(tempfile._get_candidate_names())}{suffix}"  # type: ignorestorage.download(upload_file.key, file_path)input_file = Path(file_path)file_extension = input_file.suffix.lower()etl_type = dify_config.ETL_TYPEextractor: Optional[BaseExtractor] = Noneif etl_type == "Unstructured":unstructured_api_url = dify_config.UNSTRUCTURED_API_URL or ""unstructured_api_key = dify_config.UNSTRUCTURED_API_KEY or ""if file_extension in {".xlsx", ".xls"}:extractor = ExcelExtractor(file_path)elif file_extension == ".pdf":extractor = PdfExtractor(file_path)elif file_extension in {".md", ".markdown", ".mdx"}:extractor = (UnstructuredMarkdownExtractor(file_path, unstructured_api_url, unstructured_api_key)if is_automaticelse MarkdownExtractor(file_path, autodetect_encoding=True))/**选择具体文档提取类**/elif file_extension == ".epub":extractor = UnstructuredEpubExtractor(file_path, unstructured_api_url, unstructured_api_key)else:# txtextractor = TextExtractor(file_path, autodetect_encoding=True)else:if file_extension in {".xlsx", ".xls"}:extractor = ExcelExtractor(file_path)elif file_extension == ".pdf":extractor = PdfExtractor(file_path)elif file_extension in {".md", ".markdown", ".mdx"}:extractor = MarkdownExtractor(file_path, autodetect_encoding=True)elif file_extension in {".htm", ".html"}:extractor = HtmlExtractor(file_path)elif file_extension == ".docx":extractor = WordExtractor(file_path, upload_file.tenant_id, upload_file.created_by)elif file_extension == ".csv":extractor = CSVExtractor(file_path, autodetect_encoding=True)elif file_extension == ".epub":extractor = UnstructuredEpubExtractor(file_path)else:# txtextractor = TextExtractor(file_path, autodetect_encoding=True)return extractor.extract()elif extract_setting.datasource_type == DatasourceType.NOTION.value:assert extract_setting.notion_info is not None, "notion_info is required"extractor = NotionExtractor(notion_workspace_id=extract_setting.notion_info.notion_workspace_id,notion_obj_id=extract_setting.notion_info.notion_obj_id,notion_page_type=extract_setting.notion_info.notion_page_type,document_model=extract_setting.notion_info.document,tenant_id=extract_setting.notion_info.tenant_id,)return extractor.extract()elif extract_setting.datasource_type == DatasourceType.WEBSITE.value:assert extract_setting.website_info is not None, "website_info is required"if extract_setting.website_info.provider == "firecrawl":extractor = FirecrawlWebExtractor(url=extract_setting.website_info.url,job_id=extract_setting.website_info.job_id,tenant_id=extract_setting.website_info.tenant_id,mode=extract_setting.website_info.mode,only_main_content=extract_setting.website_info.only_main_content,)return extractor.extract()elif extract_setting.website_info.provider == "watercrawl":extractor = WaterCrawlWebExtractor(url=extract_setting.website_info.url,job_id=extract_setting.website_info.job_id,tenant_id=extract_setting.website_info.tenant_id,mode=extract_setting.website_info.mode,only_main_content=extract_setting.website_info.only_main_content,)return extractor.extract()elif extract_setting.website_info.provider == "jinareader":extractor = JinaReaderWebExtractor(url=extract_setting.website_info.url,job_id=extract_setting.website_info.job_id,tenant_id=extract_setting.website_info.tenant_id,mode=extract_setting.website_info.mode,only_main_content=extract_setting.website_info.only_main_content,)return extractor.extract()else:raise ValueError(f"Unsupported website provider: {extract_setting.website_info.provider}")else:raise ValueError(f"Unsupported datasource type: {extract_setting.datasource_type}")

深度集成到Dify的RAG系统和工作流系统中,为知识库构建和文档处理提供了强大的基础能力。

http://www.dtcms.com/a/425663.html

相关文章:

  • 南沙开发区建设和交通局网站中华室内设计师
  • 如何让百度收录我的网站惠州百度seo在哪
  • 苏州网站开发建设方法广州公关公司招聘信息
  • 仿牌外贸网站旅游网站的网页设计素材
  • 山东网站备案公司晋江论坛晋江文学城网友留言区
  • 建电影网站广州网站推广软件
  • 做个简单的网站多少钱川畅咨询 的网络营销怎么做
  • 营销行网站建设购物类网站建设
  • 收录好的网站太原推广团队
  • 南宁有做网站的公司吗网站建设总费用
  • 舆情网站直接打开怎么弄哈巴河网站制作
  • 企业网站建设ppt创建网站需要哪些工作
  • 有没有教做网站的app口碑营销服务
  • 网站商城的意义电子商务网站开发设计报告
  • 中山做网站大埔建设工程交易中心网站
  • 五金配件东莞网站建设技术支持黄骅港信息贴吧
  • 中国网站建设市场分析报告word上下页边距不见了
  • 贵州网站建设设计公司哪家好产品营销方式有哪些
  • 网站建设中首页模板旅游网站开发设计文档
  • 常州网站建设最易wordpress 修改密码函数
  • 沈阳做网站哪好90设计网素材官网
  • 产品展示网站设计网站规划与建设实验心得体会
  • 公司网站建设周期及费用如何创建一个自己的平台
  • 菏泽市城乡建设局网站汉中做网站的公司电话
  • 五金企业网站模板出入长沙今天最新通知
  • 自我建设外贸网站做报纸能经常更新网站
  • 郑州网站建设特色竞价推广网络推广运营
  • 企业网站的推广方法福清市建设局官方网站
  • 深圳网站建设 网站制作 网站设计【迅美】旧版长沙网站建设及推广公司
  • 专业门户网站的规划与建设中国建设银银行招聘网站