网页复杂文本信息解析提取-Crawl4AI+Ollama
Crawl4AI,开源LLM 友好型的网页爬虫和抓取工具,与Ollama无缝衔接。
基于Crawl4AI,文本信息的解析和提取,不需要编码,仅使用文本描述即可。
Crawl4AI无须 API 密钥,可轻松与Docker和云环境集成。
https://github.com/unclecode/crawl4ai
1 Crawl4AI安装
pip install -U crawl4ai
pip install "crawl4ai[ai]"
python -m playwright install --with-deps chromiumcrawl4ai-setup
2 示例运行
url按爬取网页设置
import asyncio
from crawl4ai import *async def main():async with AsyncWebCrawler() as crawler:result = await crawler.arun(url="https://export.shobserver.com/baijiahao/html/960747.html",)print(result.markdown)if __name__ == "__main__":asyncio.run(main())
3 LLM辅助
Crawl4AI可以借助LLM模型提取提取信息。
假设本地ollama已经安装,具体过程参考
在mac m1基于ollama运行deepseek r1_mac m1 ollama-CSDN博客
mac m1计算慢,所以使用4b的小模型。
ollama pull qwen3:4b
以下Crawl4AI代码基于网络资料修改。
任务是从官网https://openai.com/api/pricing提取OpenAI各模型的调用价格。
import os
import json
import asynciofrom pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig, BrowserConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategyclass OpenAIModelFee(BaseModel):model_name: str = Field(..., description="Name of the OpenAI model.")input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict[str, str] = None
):print(f"\n--- Extracting Structured Data with {provider} ---")# if api_token is None and provider != "ollama":# print(f"API token is required for {provider}. Skipping this example.")# returnbrowser_config = BrowserConfig(headless=False)extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}if extra_headers:extra_args["extra_headers"] = extra_headerscrawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS,word_count_threshold=1,page_timeout=80000,extraction_strategy=LLMExtractionStrategy(llm_config=LLMConfig(provider=provider, api_token=api_token),schema=OpenAIModelFee.model_json_schema(),extraction_type="schema",instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.Do not miss any models in the entire content.""",extra_args=extra_args,),)async with AsyncWebCrawler(config=browser_config) as crawler:result = await crawler.arun(url="https://openai.com/api/pricing/", config=crawler_config)print(result.extracted_content)if __name__ == "__main__":asyncio.run(extract_structured_data_using_llm(provider="ollama/qwen3:4b", api_token=os.getenv("")))
提取结果
[
{
"model_name": "gpt-4o",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4o-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.1-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.1-nano",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o3",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o3-deep-research",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o3-pro-2025-06-10",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o4-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o4-mini-deep-research",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o1-pro",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "codex-mini-latest",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "computer-use-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.5-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-image-1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4o-2024-05-13",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4o-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.5-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-image-1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "codex-mini-latest",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "computer-use-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
}
]
reference
---
Browser, Crawler & LLM Configuration (Quick Overview)
https://docs.crawl4ai.com/core/browser-crawler-config/
craw4ai
https://github.com/unclecode/crawl4ai
在Crawl4AI中使用Ollama作为LLM后端的配置指南
在Crawl4AI中使用Ollama作为LLM后端的配置指南 - GitCode博客
如何使用 Crawl4AI 和 DeepSeek 构建 AI 爬虫
https://www.bright.cn/blog/web-data/crawl4ai-and-deepseek-web-scraping
入门Crawl4AI
https://geekdaxue.co/read/Crawl4AI/quickstart
使用crawl4ai+llm用自然语言获取想要爬取的网页内容
https://zhuanlan.zhihu.com/p/1914099510856119201