当前位置：首页 > news >正文

【TIDE DIARY 1】dify日常试错； conda

news 2025/10/10 10:23:39

一、Conda安装和配置

在这里插入图片描述

二、基于Dify工作流实现CPG2PVG-AI SLOW模式的完整指南

本文详细介绍了如何使用Dify工作流实现临床指南到公众指南的智能转换，重点针对SLOW模式的长文本处理需求提供完整的解决方案。

项目背景

CPG2PVG-AI是香港中文大学（深圳）TIDE AI+医学项目组的重要子项目，旨在将专业的临床医学指南（Clinical Practice Guidelines, CPG）转化为普通公众易于理解的版本（Public Version Guidelines, PVG）。SLOW模式作为该项目的核心处理模式，专门针对长篇复杂临床指南，通过多级处理确保信息完整性和准确性。

Dify工作流整体架构

在这里插入图片描述

基于项目需求，我们设计了以下Dify工作流处理流程：

PDF输入 → 文本分割 → 编译预处理 → 知识过滤 → 术语转换 → LLM处理 → 内容整合 → 最终输出

详细节点配置与代码实现

1. 输入节点配置

# Dify工作流输入节点配置
节点ID: input_node
节点类型: File Input
配置参数:input_variable: input_file (string)file_types: ["pdf"]max_file_size: 10MBoutput_variable: format_file (string)description: "接收临床指南PDF文件输入"

变量映射：

输入：input_file (string) - 用户上传的PDF文件
输出：format_file (string) - 格式化后的文件路径

2. 文本分割节点

import PyPDF2
from langchain.text_splitter import RecursiveCharacterTextSplitter
import redef process_pdf_segmentation(input_file):"""处理PDF分割的核心函数"""# 读取PDF内容pdf_reader = PyPDF2.PdfReader(input_file)text_content = ""for page in pdf_reader.pages:text_content += page.extract_text() + "\n"# 文本预处理text_content = re.sub(r'\s+', ' ', text_content)text_content = re.sub(r'\n+', '\n', text_content)# 使用递归字符分割器text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200,length_function=len,separators=["\n\n", "\n", ". ", "。", "！", "？", "；"])# 分割文本chunks = text_splitter.split_text(text_content)# 为每个块添加元数据chunk_data = []for i, chunk in enumerate(chunks):chunk_data.append({"chunk_id": f"chunk_{i}","content": chunk,"length": len(chunk),"has_medical_terms": check_medical_terms(chunk)})return {"text_chunks": chunk_data,"total_chunks": len(chunk_data),"original_length": len(text_content)}def check_medical_terms(text):"""检查文本是否包含医学术语"""medical_keywords = ["治疗", "诊断", "药物", "手术", "症状", "检查", "指南", "临床"]return any(keyword in text for keyword in medical_keywords)# 主执行函数
def execute_segmentation(format_file):return process_pdf_segmentation(format_file)

变量映射：

输入：format_file (string) - 来自输入节点的文件路径
输出：
- text_chunks (array[object]) - 文本块列表，每个对象包含：
  - chunk_id (string)
  - content (string)
  - length (number)
  - has_medical_terms (boolean)
- total_chunks (number) - 总块数
- original_length (number) - 原始文本长度

3. 编译预处理节点

import jieba
import jieba.posseg as pseg
import redef compile_preprocessing(text_chunks_data):"""文本编译预处理"""compiled_chunks = []for chunk in text_chunks_data["text_chunks"]:processed_chunk = standardize_text(chunk["content"])medical_entities = extract_medical_entities(processed_chunk)compiled_chunks.append({"chunk_id": chunk["chunk_id"],"original_content": chunk["content"],"processed_content": processed_chunk,"medical_entities": medical_entities,"readability_score": calculate_readability(processed_chunk)})return {"compiled_chunks": compiled_chunks,"total_medical_entities": sum(len(chunk["medical_entities"]) for chunk in compiled_chunks)}def standardize_text(text):"""文本标准化处理"""# 统一标点符号text = text.replace('"', '"').replace('"', '"').replace(''', "'").replace(''', "'")# 去除特殊字符但保留医学相关符号text = re.sub(r'[^\w\s\u4e00-\u9fff%°±×÷≈≠≤≥→←↑↓↔↕↨∈∉⊂⊃∪∩∞∝√∛∜∫∬∭∮∇∂∆∏∑∟∠∥⊥⊙⌒πτφωαβγδεζηθικλμνξοπρστυφχψΓΔΘΛΞΠΣΦΨΩ]', '', text)return text.strip()def extract_medical_entities(text):"""提取医学实体"""medical_terms = []words = pseg.cut(text)medical_pos_tags = ['n', 'vn', 'an']for word, pos in words:if pos in medical_pos_tags and len(word) > 1:if any(med_indicator in word for med_indicator in ['病', '症', '炎', '瘤', '药', '疗', '诊', '检']):medical_terms.append({"term": word,"position": text.find(word),"type": "medical_entity"})return medical_termsdef calculate_readability(text):"""计算文本可读性分数"""words = list(jieba.cut(text))sentences = re.split(r'[。！？；]', text)sentences = [s for s in sentences if s.strip()]if not sentences:return 0avg_sentence_length = len(words) / len(sentences)avg_word_length = sum(len(word) for word in words) / len(words)readability = 100 - (avg_sentence_length * 0.5 + avg_word_length * 10)return max(0, min(100, readability))def execute_compilation(text_chunks):return compile_preprocessing(text_chunks)

变量映射：

输入：text_chunks (object) - 来自分割节点的完整输出对象
输出：
- compiled_chunks (array[object]) - 编译后的文本块列表，每个对象包含：
  - chunk_id (string)
  - original_content (string)
  - processed_content (string)
  - medical_entities (array[object]) - 医学实体列表
  - readability_score (number)
- total_medical_entities (number) - 总医学实体数

4. 知识过滤节点

# 医学无关词汇库
IRRELEVANT_TERMS = {"普通", "一般", "常见", "其他", "等等", "例如", "包括", "相关", "各种","不同", "一些", "多个", "少数", "多数", "大部分", "小部分", "通常","往往", "总是", "从不", "很少", "经常", "有时", "可能", "大概","估计", "约", "大约", "左右", "上下", "前后", "之内", "之间"
}def filter_irrelevant_content(compiled_chunks_data):"""过滤非相关词汇和内容"""filtered_chunks = []for chunk in compiled_chunks_data["compiled_chunks"]:filtered_content = apply_content_filtering(chunk["processed_content"])relevance_score = calculate_relevance_score(filtered_content)if relevance_score > 0.3:  # 相关性阈值filtered_chunks.append({"chunk_id": chunk["chunk_id"],"original_content": chunk["original_content"],"filtered_content": filtered_content,"relevance_score": relevance_score,"removed_terms": get_removed_terms(chunk["processed_content"], filtered_content)})return {"filtered_chunks": filtered_chunks,"filtered_count": len(filtered_chunks),"average_relevance": sum(chunk["relevance_score"] for chunk in filtered_chunks) / len(filtered_chunks) if filtered_chunks else 0}def apply_content_filtering(text):"""应用内容过滤"""words = jieba.cut(text)filtered_words = [word for word in words if word not in IRRELEVANT_TERMS and len(word) > 1and not word.isdigit()]return ' '.join(filtered_words)def calculate_relevance_score(text):"""计算内容相关性分数"""medical_keywords = ["治疗", "诊断", "药物", "剂量", "手术", "症状", "检查", "指标", "指南", "推荐"]word_count = len(list(jieba.cut(text)))if word_count == 0:return 0medical_word_count = sum(1 for word in jieba.cut(text) if word in medical_keywords)return medical_word_count / word_countdef get_removed_terms(original, filtered):"""获取被移除的词汇"""original_terms = set(jieba.cut(original))filtered_terms = set(jieba.cut(filtered))return list(original_terms - filtered_terms)def execute_knowledge_filtering(compiled_chunks):return filter_irrelevant_content(compiled_chunks)

变量映射：

输入：compiled_chunks (object) - 来自编译节点的完整输出对象
输出：
- filtered_chunks (array[object]) - 过滤后的文本块列表，每个对象包含：
  - chunk_id (string)
  - original_content (string)
  - filtered_content (string)
  - relevance_score (number)
  - removed_terms (array[string]) - 被移除的词汇列表
- filtered_count (number) - 过滤后块数
- average_relevance (number) - 平均相关性分数

5. 专业术语转换节点

# 医学专业术语到公众术语的映射
MEDICAL_TERM_MAPPING = {"心肌梗死": "心脏病发作","高血压": "血压高","糖尿病": "血糖病","恶性肿瘤": "癌症","良性肿瘤": "非癌症肿瘤","冠状动脉": "心脏血管","心电图": "心脏检查","CT扫描": "CT检查","MRI": "磁共振检查","化疗": "化学药物治疗","放疗": "放射治疗","靶向治疗": "精准药物治疗","免疫治疗": "免疫系统治疗","预后": "恢复情况","发病率": "得病比例","死亡率": "死亡比例","并发症": "其他健康问题","禁忌症": "不适合使用的情况","适应证": "适合使用的情况"
}def convert_medical_terms(filtered_chunks_data):"""将专业医学术语转换为公众易于理解的术语"""converted_chunks = []for chunk in filtered_chunks_data["filtered_chunks"]:converted_content = apply_term_conversion(chunk["filtered_content"])conversion_stats = analyze_conversion_stats(chunk["filtered_content"], converted_content)converted_chunks.append({"chunk_id": chunk["chunk_id"],"original_medical_content": chunk["filtered_content"],"converted_public_content": converted_content,"conversion_stats": conversion_stats,"readability_improvement": calculate_readability_improvement(chunk["filtered_content"], converted_content)})return {"converted_chunks": converted_chunks,"total_conversions": sum(chunk["conversion_stats"]["converted_count"] for chunk in converted_chunks),"average_readability_improvement": sum(chunk["readability_improvement"] for chunk in converted_chunks) / len(converted_chunks) if converted_chunks else 0}def apply_term_conversion(text):"""应用术语转换"""converted_text = textfor medical_term, public_term in MEDICAL_TERM_MAPPING.items():converted_text = converted_text.replace(medical_term, public_term)return converted_textdef analyze_conversion_stats(original, converted):"""分析转换统计"""converted_terms = []for medical_term, public_term in MEDICAL_TERM_MAPPING.items():if medical_term in original and public_term in converted:converted_terms.append({"medical_term": medical_term,"public_term": public_term})return {"converted_count": len(converted_terms),"converted_terms": converted_terms}def calculate_readability_improvement(original, converted):"""计算可读性改进程度"""original_score = calculate_readability_simple(original)converted_score = calculate_readability_simple(converted)return max(0, converted_score - original_score)def calculate_readability_simple(text):"""简化的可读性计算"""sentences = re.split(r'[。！？；]', text)words = list(jieba.cut(text))if not sentences or not words:return 0avg_sentence_length = len(words) / len(sentences)if avg_sentence_length <= 15:return 80elif avg_sentence_length <= 25:return 60else:return 40def execute_term_conversion(filtered_chunks):return convert_medical_terms(filtered_chunks)

变量映射：

输入：filtered_chunks (object) - 来自知识过滤节点的完整输出对象
输出：
- converted_chunks (array[object]) - 转换后的文本块列表，每个对象包含：
  - chunk_id (string)
  - original_medical_content (string)
  - converted_public_content (string)
  - conversion_stats (object) - 转换统计信息
    - converted_count (number)
    - converted_terms (array[object])
  - readability_improvement (number)
- total_conversions (number) - 总转换术语数
- average_readability_improvement (number) - 平均可读性改进

6. LLM处理节点

import requests
import json
from datetime import datetimedef llm_language_conversion(converted_chunks_data):"""使用LLM进行最终的语言风格转换和优化"""llm_processed_chunks = []for chunk in converted_chunks_data["converted_chunks"]:prompt = build_public_guide_prompt(chunk["converted_public_content"])llm_response = call_llm_api(prompt)llm_processed_chunks.append({"chunk_id": chunk["chunk_id"],"input_content": chunk["converted_public_content"],"llm_output": llm_response,"processing_metadata": {"model_used": "deepseek-V3.1","timestamp": get_current_timestamp(),"prompt_version": "v1.0"}})return {"llm_processed_chunks": llm_processed_chunks,"processing_summary": {"total_chunks_processed": len(llm_processed_chunks),"success_rate": 1.0,"average_output_length": sum(len(chunk["llm_output"]) for chunk in llm_processed_chunks) / len(llm_processed_chunks)}}def build_public_guide_prompt(content):"""构建公众指南生成的提示词"""prompt_template = """你是一个专业的医学知识传播专家。请将以下临床医学内容转换为普通公众易于理解的健康指南。原始内容：{content}请按照以下要求进行转换：1. 使用通俗易懂的语言，避免专业术语2. 保持医学信息的准确性3. 结构清晰，分点说明4. 添加实用的建议和注意事项5. 使用友好的语气，给予读者信心请输出转换后的公众健康指南："""return prompt_template.format(content=content)def call_llm_api(prompt):"""调用LLM API（示例实现）"""api_url = "https://api.deepseek.com/v1/chat/completions"headers = {"Authorization": "Bearer YOUR_DEEPSEEK_API_KEY","Content-Type": "application/json"}payload = {"model": "deepseek-chat","messages": [{"role": "system", "content": "你是一个专业的医学知识传播助手。"},{"role": "user", "content": prompt}],"temperature": 0.7,"max_tokens": 2000}try:response = requests.post(api_url, headers=headers, json=payload)response_data = response.json()if response.status_code == 200:return response_data["choices"][0]["message"]["content"]else:return f"API调用失败: {response.status_code}"except Exception as e:return f"LLM处理错误: {str(e)}"def get_current_timestamp():"""获取当前时间戳"""return datetime.now().isoformat()def execute_llm_processing(converted_chunks):return llm_language_conversion(converted_chunks)

变量映射：

输入：converted_chunks (object) - 来自术语转换节点的完整输出对象
输出：
- llm_processed_chunks (array[object]) - LLM处理后的文本块列表，每个对象包含：
  - chunk_id (string)
  - input_content (string)
  - llm_output (string)
  - processing_metadata (object)
    - model_used (string)
    - timestamp (string)
    - prompt_version (string)
- processing_summary (object) - 处理摘要
  - total_chunks_processed (number)
  - success_rate (number)
  - average_output_length (number)

7. 内容整合节点

def integrate_final_output(llm_processed_data):"""整合所有处理结果，生成最终的公众指南"""integrated_content = integrate_llm_outputs(llm_processed_data["llm_processed_chunks"])quality_report = generate_quality_report(llm_processed_data)final_output = {"public_guide": {"title": generate_guide_title(integrated_content),"content": integrated_content,"sections": extract_guide_sections(integrated_content),"summary": generate_executive_summary(integrated_content)},"processing_metadata": {"processing_time": get_current_timestamp(),"total_chunks_processed": llm_processed_data["processing_summary"]["total_chunks_processed"],"success_rate": llm_processed_data["processing_summary"]["success_rate"]},"quality_assurance": quality_report}return final_outputdef integrate_llm_outputs(llm_chunks):"""整合LLM输出内容"""integrated_text = ""for chunk in llm_chunks:integrated_text += chunk["llm_output"] + "\n\n"integrated_text = re.sub(r'\n\s*\n', '\n\n', integrated_text)integrated_text = integrated_text.strip()return integrated_textdef generate_quality_report(processed_data):"""生成质量保证报告"""chunks = processed_data["llm_processed_chunks"]total_length = sum(len(chunk["llm_output"]) for chunk in chunks)avg_length = total_length / len(chunks) if chunks else 0return {"completeness_score": calculate_completeness_score(chunks),"readability_score": calculate_final_readability(chunks),"coherence_score": calculate_coherence_score(chunks),"total_output_length": total_length,"average_section_length": avg_length,"recommendations": generate_quality_recommendations(chunks)}def calculate_completeness_score(chunks):"""计算完整性分数"""total_content = " ".join(chunk["llm_output"] for chunk in chunks)essential_keywords = ["建议", "注意", "治疗", "预防", "症状"]coverage = sum(1 for keyword in essential_keywords if keyword in total_content)return min(100, (coverage / len(essential_keywords)) * 100)def calculate_final_readability(chunks):"""计算最终可读性分数"""all_content = " ".join(chunk["llm_output"] for chunk in chunks)return calculate_readability_simple(all_content)def calculate_coherence_score(chunks):"""计算连贯性分数"""transitions = ["首先", "其次", "另外", "此外", "同时", "因此", "所以", "总之"]total_transitions = 0for chunk in chunks:total_transitions += sum(1 for transition in transitions if transition in chunk["llm_output"])return min(100, total_transitions * 10)def generate_quality_recommendations(chunks):"""生成质量改进建议"""recommendations = []total_content = " ".join(chunk["llm_output"] for chunk in chunks)if len(total_content) < 500:recommendations.append("内容较短，建议补充更多详细信息")if calculate_final_readability(chunks) < 60:recommendations.append("可读性有待提高，建议使用更简单的语言")if calculate_coherence_score(chunks) < 70:recommendations.append("内容连贯性可以进一步优化")return recommendations if recommendations else ["内容质量良好，符合公众指南标准"]def generate_guide_title(content):"""生成指南标题"""first_paragraph = content.split('\n\n')[0] if '\n\n' in content else contentwords = jieba.cut(first_paragraph[:50])key_terms = [word for word in words if len(word) > 1]return f"公众健康指南：{''.join(key_terms[:3])}..."def extract_guide_sections(content):"""提取指南章节"""sections = content.split('\n\n')return [{"section_id": i, "content": section} for i, section in enumerate(sections) if section.strip()]def generate_executive_summary(content):"""生成执行摘要"""sentences = re.split(r'[。！？]', content)key_sentences = [s for s in sentences if any(keyword in s for keyword in ["建议", "重要", "注意", "避免"])]return "。".join(key_sentences[:3]) + "。"def execute_integration(llm_processed):return integrate_final_output(llm_processed)

变量映射：

输入：llm_processed (object) - 来自LLM处理节点的完整输出对象
输出：final_output (object) - 最终输出对象，包含：
- public_guide (object) - 公众指南
  - title (string)
  - content (string)
  - sections (array[object]) - 章节列表
  - summary (string)
- processing_metadata (object) - 处理元数据
  - processing_time (string)
  - total_chunks_processed (number)
  - success_rate (number)
- quality_assurance (object) - 质量保证报告
  - completeness_score (number)
  - readability_score (number)
  - coherence_score (number)
  - total_output_length (number)
  - average_section_length (number)
  - recommendations (array[string])

8. 结束节点配置

节点ID: end_node
节点类型: Output
配置参数:output_variables:- final_output (object)output_format: jsondescription: "输出最终的公众指南和处理报告"

完整变量映射链

input_file (string) ↓
format_file (string) ↓
text_chunks (object) ↓
compiled_chunks (object) ↓
filtered_chunks (object) ↓
converted_chunks (object) ↓
llm_processed_chunks (object) ↓
final_output (object)