当前位置: 首页 > news >正文

langchain 中 RecursiveUrlLoader 使用

RecursiveUrlLoader 第一个例子

from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
)
# 同步加载
docs = loader.load()

# 查看第一个文档的元数据
print(docs[0].metadata)
d:\soft\anaconda\envs\chat_chain\Lib\site-packages\langchain_community\document_loaders\recursive_url_loader.py:43: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document.

Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor:

    from bs4 import XMLParsedAsHTMLWarning
    import warnings

    warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

  soup = BeautifulSoup(raw_html, "html.parser")


{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.21 Documentation', 'language': None}
print(docs[0].page_content)
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8" /><title>3.9.21 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
    
    <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script src="_static/jquery.js"></script>
    <script src="_static/underscore.js"></script>
    <script src="_static/doctools.js"></script>
    <script src="_static/language_data.js"></script>
    
    <script src="_static/sidebar.js"></script>
    
    <link rel="search" type="application/opensearchdescription+xml"
          title="Search within Python 3.9.21 documentation"
          href="_static/opensearch.xml"/>
    <link rel="author" title="About these documents" href="about.html" />
    <link rel="index" title="Index" href="genindex.html" />
    <link rel="search" title="Search" href="search.html" />
    <link rel="copyright" title="Copyright" href="copyright.html" />
    <link rel="canonical" href="https://docs.python.org/3/index.html" />
    
      
    

    
    <style>
      @media only screen {
        table.full-width-table {
            width: 100%;
        }
      }
    </style>
<link rel="shortcut icon" type="image/png" href="_static/py.svg" />
            <script type="text/javascript" src="_static/copybutton.js"></script>
            <script type="text/javascript" src="_static/menu.js"></script> 

  </head>
<body>
<div class="mobile-nav">
    <input type="checkbox" id="menuToggler" class="toggler__input" aria-controls="navigation"
           aria-pressed="false" aria-expanded="false" role="button" aria-label="Menu" />
    <label for="menuToggler" class="toggler__label">
        <span></span>
    </label>
    <nav class="nav-content" role="navigation">
         <a href="https://www.python.org/" class="nav-logo">
             <img src="_static/py.svg" alt="Logo"/>
         </a>
        <div class="version_switcher_placeholder"></div>
        <form role="search" class="search" action="search.html" method="get">
            <svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" class="search-icon">
                <path fill-rule="nonzero"
                        d="M15.5 14h-.79l-.28-.27a6.5 6.5 0 001.48-5.34c-.47-2.78-2.79-5-5.59-5.34a6.505 6.505 0 00-7.27 7.27c.34 2.8 2.56 5.12 5.34 5.59a6.5 6.5 0 005.34-1.48l.27.28v.79l4.25 4.25c.41.41 1.08.41 1.49 0 .41-.41.41-1.08 0-1.49L15.5 14zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z" fill="#444"></path>
            </svg>
            <input type="text" name="q" aria-label="Quick search"/>
            <input type="submit" value="Go"/>
        </form>
    </nav>
    <div class="menu-wrapper">
        <nav class="menu" role="navigation" aria-label="main navigation">
            <div class="language_switcher_placeholder"></div>


<h3>Download</h3>
<p><a href="download.html">Download these documents</a></p>


<h3>Docs by version</h3>
<ul>
  
  <li><a href="https://docs.python.org/3.14/">Python 3.14 (in development)</a></li>
  
  <li><a href="https://docs.python.org/3.13/">Python 3.13 (stable)</a></li>
  
  <li><a href="https://docs.python.org/3.12/">Python 3.12 (stable)</a></li>
  
  <li><a href="https://docs.python.org/3.11/">Python 3.11 (security-fixes)</a></li>
  
  <li><a href="https://docs.python.org/3.10/">Python 3.10 (security-fixes)</a></li>
  
  <li><a href="https://docs.python.org/3.9/">Python 3.9 (security-fixes)</a></li>
  
  <li><a href="https://docs.python.org/3.8/">Python 3.8 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.7/">Python 3.7 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.6/">Python 3.6 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.5/">Python 3.5 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.4/">Python 3.4 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.3/">Python 3.3 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.2/">Python 3.2 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.1/">Python 3.1 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.0/">Python 3.0 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/2.7/">Python 2.7 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/2.6/">Python 2.6 (EOL)</a></li>
  
  <li><a href="https://www.python.org/doc/versions/">All versions</a></li>
</ul>


<h3>Other resources</h3>
<ul>
  
  <li><a href="https://peps.python.org">PEP Index</a></li>
  <li><a href="https://wiki.python.org/moin/BeginnersGuide">Beginner's Guide</a></li>
  <li><a href="https://wiki.python.org/moin/PythonBooks">Book List</a></li>
  <li><a href="https://www.python.org/doc/av/">Audio/Visual Talks</a></li>
  <li><a href="https://devguide.python.org/">Python Developer’s Guide</a></li>
</ul>
        </nav>
    </div>
</div>

  
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>

          <li><img src="_static/py.svg" alt="python logo" style="vertical-align: middle; margin-top: -1px"/></li>
          <li><a href="https://www.python.org/">Python</a> &#187;</li>
          <li class="switchers">
            <div class="language_switcher_placeholder"></div>
            <div class="version_switcher_placeholder"></div>
          </li>
          <li>
              
          </li>
    <li id="cpython-language-and-version">
      <a href="#">3.9.21 Documentation</a> &#187;
    </li>

                <li class="right">
                    

    <div class="inline-search" role="search">
        <form class="inline-search" action="search.html" method="get">
          <input placeholder="Quick search" aria-label="Quick search" type="text" name="q" />
          <input type="submit" value="Go" />
          <input type="hidden" name="check_keywords" value="yes" />
          <input type="hidden" name="area" value="default" />
        </form>
    </div>
                     |
                </li>
            
      </ul>
    </div>    

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <h1>Python 3.9.21 documentation</h1>
  <p>
  Welcome! This is the official documentation for Python 3.9.21.
  </p>
  <p><strong>Parts of the documentation:</strong></p>
  <table class="contentstable" align="center"><tr>
    <td width="50%">
      <p class="biglink"><a class="biglink" href="whatsnew/3.9.html">What's new in Python 3.9?</a><br/>
        <span class="linkdescr"> or <a href="whatsnew/index.html">all "What's new" documents</a> since 2.0</span></p>
      <p class="biglink"><a class="biglink" href="tutorial/index.html">Tutorial</a><br/>
         <span class="linkdescr">start here</span></p>
      <p class="biglink"><a class="biglink" href="library/index.html">Library Reference</a><br/>
         <span class="linkdescr">keep this under your pillow</span></p>
      <p class="biglink"><a class="biglink" href="reference/index.html">Language Reference</a><br/>
         <span class="linkdescr">describes syntax and language elements</span></p>
      <p class="biglink"><a class="biglink" href="using/index.html">Python Setup and Usage</a><br/>
         <span class="linkdescr">how to use Python on different platforms</span></p>
      <p class="biglink"><a class="biglink" href="howto/index.html">Python HOWTOs</a><br/>
         <span class="linkdescr">in-depth documents on specific topics</span></p>
    </td><td width="50%">
      <p class="biglink"><a class="biglink" href="installing/index.html">Installing Python Modules</a><br/>
         <span class="linkdescr">installing from the Python Package Index &amp; other sources</span></p>
      <p class="biglink"><a class="biglink" href="distributing/index.html">Distributing Python Modules</a><br/>
         <span class="linkdescr">publishing modules for installation by others</span></p>
      <p class="biglink"><a class="biglink" href="extending/index.html">Extending and Embedding</a><br/>
         <span class="linkdescr">tutorial for C/C++ programmers</span></p>
      <p class="biglink"><a class="biglink" href="c-api/index.html">Python/C API</a><br/>
         <span class="linkdescr">reference for C/C++ programmers</span></p>
      <p class="biglink"><a class="biglink" href="faq/index.html">FAQs</a><br/>
         <span class="linkdescr">frequently asked questions (with answers!)</span></p>
    </td></tr>
  </table>

  <p><strong>Indices and tables:</strong></p>
  <table class="contentstable" align="center"><tr>
    <td width="50%">
      <p class="biglink"><a class="biglink" href="py-modindex.html">Global Module Index</a><br/>
         <span class="linkdescr">quick access to all modules</span></p>
      <p class="biglink"><a class="biglink" href="genindex.html">General Index</a><br/>
         <span class="linkdescr">all functions, classes, terms</span></p>
      <p class="biglink"><a class="biglink" href="glossary.html">Glossary</a><br/>
         <span class="linkdescr">the most important terms explained</span></p>
    </td><td width="50%">
      <p class="biglink"><a class="biglink" href="search.html">Search page</a><br/>
         <span class="linkdescr">search this documentation</span></p>
      <p class="biglink"><a class="biglink" href="contents.html">Complete Table of Contents</a><br/>
         <span class="linkdescr">lists all sections and subsections</span></p>
    </td></tr>
  </table>

  <p><strong>Meta information:</strong></p>
  <table class="contentstable" align="center"><tr>
    <td width="50%">
      <p class="biglink"><a class="biglink" href="bugs.html">Reporting bugs</a></p>
      <p class="biglink"><a class="biglink" href="https://devguide.python.org/docquality/#helping-with-documentation">Contributing to Docs</a></p>
      <p class="biglink"><a class="biglink" href="about.html">About the documentation</a></p>
    </td><td width="50%">
      <p class="biglink"><a class="biglink" href="license.html">History and License of Python</a></p>
      <p class="biglink"><a class="biglink" href="copyright.html">Copyright</a></p>
    </td></tr>
  </table>

          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">


<h3>Download</h3>
<p><a href="download.html">Download these documents</a></p>


<h3>Docs by version</h3>
<ul>
  
  <li><a href="https://docs.python.org/3.14/">Python 3.14 (in development)</a></li>
  
  <li><a href="https://docs.python.org/3.13/">Python 3.13 (stable)</a></li>
  
  <li><a href="https://docs.python.org/3.12/">Python 3.12 (stable)</a></li>
  
  <li><a href="https://docs.python.org/3.11/">Python 3.11 (security-fixes)</a></li>
  
  <li><a href="https://docs.python.org/3.10/">Python 3.10 (security-fixes)</a></li>
  
  <li><a href="https://docs.python.org/3.9/">Python 3.9 (security-fixes)</a></li>
  
  <li><a href="https://docs.python.org/3.8/">Python 3.8 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.7/">Python 3.7 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.6/">Python 3.6 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.5/">Python 3.5 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.4/">Python 3.4 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.3/">Python 3.3 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.2/">Python 3.2 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.1/">Python 3.1 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/3.0/">Python 3.0 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/2.7/">Python 2.7 (EOL)</a></li>
  
  <li><a href="https://docs.python.org/2.6/">Python 2.6 (EOL)</a></li>
  
  <li><a href="https://www.python.org/doc/versions/">All versions</a></li>
</ul>


<h3>Other resources</h3>
<ul>
  
  <li><a href="https://peps.python.org">PEP Index</a></li>
  <li><a href="https://wiki.python.org/moin/BeginnersGuide">Beginner's Guide</a></li>
  <li><a href="https://wiki.python.org/moin/PythonBooks">Book List</a></li>
  <li><a href="https://www.python.org/doc/av/">Audio/Visual Talks</a></li>
  <li><a href="https://devguide.python.org/">Python Developer’s Guide</a></li>
</ul>
        </div>
      </div>
      <div class="clearer"></div>
    </div>  
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>

          <li><img src="_static/py.svg" alt="python logo" style="vertical-align: middle; margin-top: -1px"/></li>
          <li><a href="https://www.python.org/">Python</a> &#187;</li>
          <li class="switchers">
            <div class="language_switcher_placeholder"></div>
            <div class="version_switcher_placeholder"></div>
          </li>
          <li>
              
          </li>
    <li id="cpython-language-and-version">
      <a href="#">3.9.21 Documentation</a> &#187;
    </li>

                <li class="right">
                    

    <div class="inline-search" role="search">
        <form class="inline-search" action="search.html" method="get">
          <input placeholder="Quick search" aria-label="Quick search" type="text" name="q" />
          <input type="submit" value="Go" />
          <input type="hidden" name="check_keywords" value="yes" />
          <input type="hidden" name="area" value="default" />
        </form>
    </div>
                     |
                </li>
            
      </ul>
    </div>  
    <div class="footer">
    &copy; <a href="copyright.html">Copyright</a> 2001-2024, Python Software Foundation.
    <br />
    This page is licensed under the Python Software Foundation License Version 2.
    <br />
    Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License.
    <br />
    See <a href="/license.html">History and License</a> for more information.<br />
    <br />

    The Python Software Foundation is a non-profit corporation.
<a href="https://www.python.org/psf/donations/">Please donate.</a>
<br />
    <br />

    Last updated on Dec 08, 2024.
    <a href="/bugs.html">Found a bug</a>?
    <br />

    Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 2.4.4.
    </div>

    <script type="text/javascript" src="_static/switchers.js"></script>
  </body>
</html>
print(len(docs))
24

RecursiveUrlLoader 中自定义提取器

import re
from bs4 import BeautifulSoup

def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/", 
    extractor=bs4_extractor
)
# 同步加载
docs = loader.load()

print(len(docs))
# 查看第一个文档的元数据
print(docs[0].metadata)
C:\Windows\Temp\ipykernel_12952\1217732938.py:5: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document.

Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor:

    from bs4 import XMLParsedAsHTMLWarning
    import warnings

    warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

  soup = BeautifulSoup(html, "html.parser")


24
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.21 Documentation', 'language': None}
print(docs[0].page_content)
3.9.21 Documentation

Download
Download these documents
Docs by version

Python 3.14 (in development)
Python 3.13 (stable)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (security-fixes)
Python 3.8 (EOL)
Python 3.7 (EOL)
Python 3.6 (EOL)
Python 3.5 (EOL)
Python 3.4 (EOL)
Python 3.3 (EOL)
Python 3.2 (EOL)
Python 3.1 (EOL)
Python 3.0 (EOL)
Python 2.7 (EOL)
Python 2.6 (EOL)
All versions

Other resources

PEP Index
Beginner's Guide
Book List
Audio/Visual Talks
Python Developer’s Guide

Navigation

index

modules |

Python »

3.9.21 Documentation »
    

                     |
                

Python 3.9.21 documentation

  Welcome! This is the official documentation for Python 3.9.21.
  
Parts of the documentation:

What's new in Python 3.9?
 or all "What's new" documents since 2.0
Tutorial
start here
Library Reference
keep this under your pillow
Language Reference
describes syntax and language elements
Python Setup and Usage
how to use Python on different platforms
Python HOWTOs
in-depth documents on specific topics

Installing Python Modules
installing from the Python Package Index & other sources
Distributing Python Modules
publishing modules for installation by others
Extending and Embedding
tutorial for C/C++ programmers
Python/C API
reference for C/C++ programmers
FAQs
frequently asked questions (with answers!)

Indices and tables:

Global Module Index
quick access to all modules
General Index
all functions, classes, terms
Glossary
the most important terms explained

Search page
search this documentation
Complete Table of Contents
lists all sections and subsections

Meta information:

Reporting bugs
Contributing to Docs
About the documentation

History and License of Python
Copyright

Download
Download these documents
Docs by version

Python 3.14 (in development)
Python 3.13 (stable)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (security-fixes)
Python 3.8 (EOL)
Python 3.7 (EOL)
Python 3.6 (EOL)
Python 3.5 (EOL)
Python 3.4 (EOL)
Python 3.3 (EOL)
Python 3.2 (EOL)
Python 3.1 (EOL)
Python 3.0 (EOL)
Python 2.7 (EOL)
Python 2.6 (EOL)
All versions

Other resources

PEP Index
Beginner's Guide
Book List
Audio/Visual Talks
Python Developer’s Guide

Navigation

index

modules |

Python »

3.9.21 Documentation »
    

                     |
                

    © Copyright 2001-2024, Python Software Foundation.
    
    This page is licensed under the Python Software Foundation License Version 2.
    
    Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License.
    
    See History and License for more information.

    The Python Software Foundation is a non-profit corporation.
Please donate.

    Last updated on Dec 08, 2024.
    Found a bug?
    

    Created using Sphinx 2.4.4.

相关文章:

  • 【华为OD机考】华为OD笔试真题解析(16)--微服务的集成测试
  • Hi3516CV610车牌识别算法源码之——车牌识别算法初体验
  • 电商智能客服实战(一)---概要设计
  • 2025嵌入式软件开发工程师--音频方向
  • 灵鸢系统,引领车与无人机深度融合新潮流
  • 网络基础概述
  • 【现代前端框架中本地图片资源的处理方案】
  • c++ std::basic_string_view、std::span使用笔记
  • SpringAI 调用本地ollama大模型
  • C++:四大强制类型转换
  • Redis7——进阶篇(二)
  • VirtualBox虚拟机转VM虚拟机
  • AIGC(生成式AI)试用 25 -- 跟着清华教程学习 - DeepSeek+DeepResearch让科研像聊天一样简单
  • 2025-03-01 学习记录--C/C++-C语言 使用欧几里得算法(辗转相除法)计算两个整数的最大公约数
  • 【2025年15期免费获取股票数据API接口】实例演示五种主流语言获取股票行情api接口之沪深A股解禁限售数据获取实例演示及接口API说明文档
  • 基于eRDMA实测DeepSeek开源的3FS
  • 【实战 ES】实战 Elasticsearch:快速上手与深度实践-2.2.3案例:电商订单日志每秒10万条写入优化
  • C# OnnxRuntime部署DAMO-YOLO人头检测
  • Spring Boot全局异常处理:“危机公关”团队
  • 【星云 Orbit • STM32F4】04.一触即发:GPIO 外部中断
  • 网站排名突然下降/2021年最为成功的营销案例
  • 吉林省四平市网站建设/网络销售推广是做什么的具体
  • 网站建设和网络推广是干嘛/朋友圈广告推广代理
  • hbuilder网站开发过程/windows系统优化软件排行榜
  • wordpress仿豆瓣/seo超级外链工具免费
  • 怎么对网站链接做拆解/连云港seo公司