当前位置：首页 > news >正文

网页复杂文本信息解析提取-Crawl4AI+Ollama

news 2025/8/12 10:55:24

Crawl4AI，开源LLM 友好型的网页爬虫和抓取工具，与Ollama无缝衔接。

基于Crawl4AI，文本信息的解析和提取，不需要编码，仅使用文本描述即可。

Crawl4AI无须 API 密钥，可轻松与Docker和云环境集成。

https://github.com/unclecode/crawl4ai

1 Crawl4AI安装

pip install -U crawl4ai

pip install "crawl4ai[ai]"
python -m playwright install --with-deps chromium

crawl4ai-setup

2 示例运行

url按爬取网页设置

import asyncio
from crawl4ai import *async def main():async with AsyncWebCrawler() as crawler:result = await crawler.arun(url="https://export.shobserver.com/baijiahao/html/960747.html",)print(result.markdown)if __name__ == "__main__":asyncio.run(main())

3 LLM辅助

Crawl4AI可以借助LLM模型提取提取信息。

假设本地ollama已经安装，具体过程参考

在mac m1基于ollama运行deepseek r1_mac m1 ollama-CSDN博客

mac m1计算慢，所以使用4b的小模型。

ollama pull qwen3:4b

以下Crawl4AI代码基于网络资料修改。

任务是从官网https://openai.com/api/pricing提取OpenAI各模型的调用价格。

import os
import json
import asynciofrom pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig, BrowserConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategyclass OpenAIModelFee(BaseModel):model_name: str = Field(..., description="Name of the OpenAI model.")input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict[str, str] = None
):print(f"\n--- Extracting Structured Data with {provider} ---")# if api_token is None and provider != "ollama":#     print(f"API token is required for {provider}. Skipping this example.")#     returnbrowser_config = BrowserConfig(headless=False)extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}if extra_headers:extra_args["extra_headers"] = extra_headerscrawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS,word_count_threshold=1,page_timeout=80000,extraction_strategy=LLMExtractionStrategy(llm_config=LLMConfig(provider=provider, api_token=api_token),schema=OpenAIModelFee.model_json_schema(),extraction_type="schema",instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.Do not miss any models in the entire content.""",extra_args=extra_args,),)async with AsyncWebCrawler(config=browser_config) as crawler:result = await crawler.arun(url="https://openai.com/api/pricing/", config=crawler_config)print(result.extracted_content)if __name__ == "__main__":asyncio.run(extract_structured_data_using_llm(provider="ollama/qwen3:4b", api_token=os.getenv("")))

提取结果

[
{
"model_name": "gpt-4o",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4o-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.1-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.1-nano",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o3",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o3-deep-research",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o3-pro-2025-06-10",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o4-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o4-mini-deep-research",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "o1-pro",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "codex-mini-latest",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "computer-use-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.5-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-image-1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4o-2024-05-13",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4o-mini",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-4.5-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "gpt-image-1",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "codex-mini-latest",
"input_fee": "$0.000263",
"output_fee": "$1.25"
},
{
"model_name": "computer-use-preview",
"input_fee": "$0.000263",
"output_fee": "$1.25"
}
]