当前位置: 首页 > news >正文

网页复杂文本信息解析提取-Crawl4AI+Ollama

Crawl4AI,开源LLM 友好型的网页爬虫和抓取工具,与Ollama无缝衔接。

基于Crawl4AI,文本信息的解析和提取,不需要编码,仅使用文本描述即可。

Crawl4AI无须 API 密钥,可轻松与Docker和云环境集成。

https://github.com/unclecode/crawl4ai

1 Crawl4AI安装

pip install -U crawl4ai

pip install "crawl4ai[ai]"
python -m playwright install --with-deps chromium

crawl4ai-setup

2 示例运行

url按爬取网页设置

import asyncio
from crawl4ai import *async def main():async with AsyncWebCrawler() as crawler:result = await crawler.arun(url="https://export.shobserver.com/baijiahao/html/960747.html",)print(result.markdown)if __name__ == "__main__":asyncio.run(main())

3 LLM辅助

Crawl4AI可以借助LLM模型提取提取信息。

假设本地ollama已经安装,具体过程参考

在mac m1基于ollama运行deepseek r1_mac m1 ollama-CSDN博客

mac m1计算慢,所以使用4b的小模型。

ollama pull qwen3:4b

以下Crawl4AI代码基于网络资料修改。

任务是从官网https://openai.com/api/pricing提取OpenAI各模型的调用价格。

import os
import json
import asynciofrom pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig, BrowserConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategyclass OpenAIModelFee(BaseModel):model_name: str = Field(..., description="Name of the OpenAI model.")input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict[str, str] = None
):print(f"\n--- Extracting Structured Data with {provider} ---")# if api_token is None and provider != "ollama":#     print(f"API token is required for {provider}. Skipping this example.")#     returnbrowser_config = BrowserConfig(headless=False)extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}if extra_headers:extra_args["extra_headers"] = extra_headerscrawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS,word_count_threshold=1,page_timeout=80000,extraction_strategy=LLMExtractionStrategy(llm_config=LLMConfig(provider=provider, api_token=api_token),schema=OpenAIModelFee.model_json_schema(),extraction_type="schema",instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.Do not miss any models in the entire content.""",extra_args=extra_args,),)async with AsyncWebCrawler(config=browser_config) as crawler:result = await crawler.arun(url="https://openai.com/api/pricing/", config=crawler_config)print(result.extracted_content)if __name__ == "__main__":asyncio.run(extract_structured_data_using_llm(provider="ollama/qwen3:4b", api_token=os.getenv("")))

提取结果

[
    {
        "model_name": "gpt-4o",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4o-mini",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4.1",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4.1-mini",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4.1-nano",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "o3",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "o3-deep-research",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "o3-pro-2025-06-10",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "o4-mini",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "o4-mini-deep-research",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "o1",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "o1-pro",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "codex-mini-latest",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "computer-use-preview",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4.5-preview",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-image-1",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4o-2024-05-13",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4o-mini",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-4.5-preview",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "gpt-image-1",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "codex-mini-latest",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    },
    {
        "model_name": "computer-use-preview",
        "input_fee": "$0.000263",
        "output_fee": "$1.25"
    }
]

reference

---

Browser, Crawler & LLM Configuration (Quick Overview)

https://docs.crawl4ai.com/core/browser-crawler-config/

craw4ai

https://github.com/unclecode/crawl4ai

在Crawl4AI中使用Ollama作为LLM后端的配置指南

在Crawl4AI中使用Ollama作为LLM后端的配置指南 - GitCode博客

如何使用 Crawl4AI 和 DeepSeek 构建 AI 爬虫

https://www.bright.cn/blog/web-data/crawl4ai-and-deepseek-web-scraping

入门Crawl4AI

https://geekdaxue.co/read/Crawl4AI/quickstart

使用crawl4ai+llm用自然语言获取想要爬取的网页内容

https://zhuanlan.zhihu.com/p/1914099510856119201

http://www.dtcms.com/a/326452.html

相关文章:

  • week1+2+3
  • Python自学05-分支结构
  • 2025年08月11日Github流行趋势
  • Deepoc如何让传统码垛机器人获得“类人决策力“​
  • python之浅拷贝深拷贝
  • 01-spring-手写spring-demo实现基础的功能
  • SAM2的应用
  • 机器学习中数据集的划分难点及实现
  • 比例份额调度
  • CV 医学影像分类、分割、目标检测,之【血红细胞分类】项目拆解
  • n8n中调用playwright-mcp 项目
  • LeetCode151~188题解
  • C++ 流式处理字符串
  • C语言变量的声明和定义有什么区别?
  • UE 手柄点击UI 事件
  • 长难句lesson1
  • PPIO上线智谱GLM-4.5V
  • 【stm32】EXTI外部中断
  • QT聊天项目DAY18
  • Prompt Engineering 最佳实践:让AI输出更精准的核心技巧
  • HIS系统:医院信息化建设的核心,采用Angular+Java技术栈,集成MySQL、Redis等技术,实现医院全业务流程管理。
  • LS1043A+AQR115C万兆网口调试
  • 机器学习第九课之DBSCAN算法
  • 下一代防火墙组网全解析
  • Linux下安装jdk
  • 从零构建企业级K8S:高可用集群部署指南
  • 简单了解MongoDB数据存储
  • 计算机网络---交换机
  • Excel导入mysql,带小数点如何解决?
  • 物联网通讯协议-MQTT、Modbus、OPC