RecursiveUrlLoader 第一个例子
from langchain_community. document_loaders import RecursiveUrlLoader
loader = RecursiveUrlLoader(
"https://docs.python.org/3.9/" ,
)
docs = loader. load( )
print ( docs[ 0 ] . metadata)
d:\soft\anaconda\envs\chat_chain\Lib\site-packages\langchain_community\document_loaders\recursive_url_loader.py:43: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document.
Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor:
from bs4 import XMLParsedAsHTMLWarning
import warnings
warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)
soup = BeautifulSoup(raw_html, "html.parser")
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.21 Documentation', 'language': None}
print ( docs[ 0 ] . page_content)
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><title>3.9.21 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<script src="_static/language_data.js"></script>
<script src="_static/sidebar.js"></script>
<link rel="search" type="application/opensearchdescription+xml"
title="Search within Python 3.9.21 documentation"
href="_static/opensearch.xml"/>
<link rel="author" title="About these documents" href="about.html" />
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="copyright" title="Copyright" href="copyright.html" />
<link rel="canonical" href="https://docs.python.org/3/index.html" />
<style>
@media only screen {
table.full-width-table {
width: 100%;
}
}
</style>
<link rel="shortcut icon" type="image/png" href="_static/py.svg" />
<script type="text/javascript" src="_static/copybutton.js"></script>
<script type="text/javascript" src="_static/menu.js"></script>
</head>
<body>
<div class="mobile-nav">
<input type="checkbox" id="menuToggler" class="toggler__input" aria-controls="navigation"
aria-pressed="false" aria-expanded="false" role="button" aria-label="Menu" />
<label for="menuToggler" class="toggler__label">
<span></span>
</label>
<nav class="nav-content" role="navigation">
<a href="https://www.python.org/" class="nav-logo">
<img src="_static/py.svg" alt="Logo"/>
</a>
<div class="version_switcher_placeholder"></div>
<form role="search" class="search" action="search.html" method="get">
<svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" class="search-icon">
<path fill-rule="nonzero"
d="M15.5 14h-.79l-.28-.27a6.5 6.5 0 001.48-5.34c-.47-2.78-2.79-5-5.59-5.34a6.505 6.505 0 00-7.27 7.27c.34 2.8 2.56 5.12 5.34 5.59a6.5 6.5 0 005.34-1.48l.27.28v.79l4.25 4.25c.41.41 1.08.41 1.49 0 .41-.41.41-1.08 0-1.49L15.5 14zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z" fill="#444"></path>
</svg>
<input type="text" name="q" aria-label="Quick search"/>
<input type="submit" value="Go"/>
</form>
</nav>
<div class="menu-wrapper">
<nav class="menu" role="navigation" aria-label="main navigation">
<div class="language_switcher_placeholder"></div>
<h3>Download</h3>
<p><a href="download.html">Download these documents</a></p>
<h3>Docs by version</h3>
<ul>
<li><a href="https://docs.python.org/3.14/">Python 3.14 (in development)</a></li>
<li><a href="https://docs.python.org/3.13/">Python 3.13 (stable)</a></li>
<li><a href="https://docs.python.org/3.12/">Python 3.12 (stable)</a></li>
<li><a href="https://docs.python.org/3.11/">Python 3.11 (security-fixes)</a></li>
<li><a href="https://docs.python.org/3.10/">Python 3.10 (security-fixes)</a></li>
<li><a href="https://docs.python.org/3.9/">Python 3.9 (security-fixes)</a></li>
<li><a href="https://docs.python.org/3.8/">Python 3.8 (EOL)</a></li>
<li><a href="https://docs.python.org/3.7/">Python 3.7 (EOL)</a></li>
<li><a href="https://docs.python.org/3.6/">Python 3.6 (EOL)</a></li>
<li><a href="https://docs.python.org/3.5/">Python 3.5 (EOL)</a></li>
<li><a href="https://docs.python.org/3.4/">Python 3.4 (EOL)</a></li>
<li><a href="https://docs.python.org/3.3/">Python 3.3 (EOL)</a></li>
<li><a href="https://docs.python.org/3.2/">Python 3.2 (EOL)</a></li>
<li><a href="https://docs.python.org/3.1/">Python 3.1 (EOL)</a></li>
<li><a href="https://docs.python.org/3.0/">Python 3.0 (EOL)</a></li>
<li><a href="https://docs.python.org/2.7/">Python 2.7 (EOL)</a></li>
<li><a href="https://docs.python.org/2.6/">Python 2.6 (EOL)</a></li>
<li><a href="https://www.python.org/doc/versions/">All versions</a></li>
</ul>
<h3>Other resources</h3>
<ul>
<li><a href="https://peps.python.org">PEP Index</a></li>
<li><a href="https://wiki.python.org/moin/BeginnersGuide">Beginner's Guide</a></li>
<li><a href="https://wiki.python.org/moin/PythonBooks">Book List</a></li>
<li><a href="https://www.python.org/doc/av/">Audio/Visual Talks</a></li>
<li><a href="https://devguide.python.org/">Python Developer’s Guide</a></li>
</ul>
</nav>
</div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li><img src="_static/py.svg" alt="python logo" style="vertical-align: middle; margin-top: -1px"/></li>
<li><a href="https://www.python.org/">Python</a> »</li>
<li class="switchers">
<div class="language_switcher_placeholder"></div>
<div class="version_switcher_placeholder"></div>
</li>
<li>
</li>
<li id="cpython-language-and-version">
<a href="#">3.9.21 Documentation</a> »
</li>
<li class="right">
<div class="inline-search" role="search">
<form class="inline-search" action="search.html" method="get">
<input placeholder="Quick search" aria-label="Quick search" type="text" name="q" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
|
</li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<h1>Python 3.9.21 documentation</h1>
<p>
Welcome! This is the official documentation for Python 3.9.21.
</p>
<p><strong>Parts of the documentation:</strong></p>
<table class="contentstable" align="center"><tr>
<td width="50%">
<p class="biglink"><a class="biglink" href="whatsnew/3.9.html">What's new in Python 3.9?</a><br/>
<span class="linkdescr"> or <a href="whatsnew/index.html">all "What's new" documents</a> since 2.0</span></p>
<p class="biglink"><a class="biglink" href="tutorial/index.html">Tutorial</a><br/>
<span class="linkdescr">start here</span></p>
<p class="biglink"><a class="biglink" href="library/index.html">Library Reference</a><br/>
<span class="linkdescr">keep this under your pillow</span></p>
<p class="biglink"><a class="biglink" href="reference/index.html">Language Reference</a><br/>
<span class="linkdescr">describes syntax and language elements</span></p>
<p class="biglink"><a class="biglink" href="using/index.html">Python Setup and Usage</a><br/>
<span class="linkdescr">how to use Python on different platforms</span></p>
<p class="biglink"><a class="biglink" href="howto/index.html">Python HOWTOs</a><br/>
<span class="linkdescr">in-depth documents on specific topics</span></p>
</td><td width="50%">
<p class="biglink"><a class="biglink" href="installing/index.html">Installing Python Modules</a><br/>
<span class="linkdescr">installing from the Python Package Index & other sources</span></p>
<p class="biglink"><a class="biglink" href="distributing/index.html">Distributing Python Modules</a><br/>
<span class="linkdescr">publishing modules for installation by others</span></p>
<p class="biglink"><a class="biglink" href="extending/index.html">Extending and Embedding</a><br/>
<span class="linkdescr">tutorial for C/C++ programmers</span></p>
<p class="biglink"><a class="biglink" href="c-api/index.html">Python/C API</a><br/>
<span class="linkdescr">reference for C/C++ programmers</span></p>
<p class="biglink"><a class="biglink" href="faq/index.html">FAQs</a><br/>
<span class="linkdescr">frequently asked questions (with answers!)</span></p>
</td></tr>
</table>
<p><strong>Indices and tables:</strong></p>
<table class="contentstable" align="center"><tr>
<td width="50%">
<p class="biglink"><a class="biglink" href="py-modindex.html">Global Module Index</a><br/>
<span class="linkdescr">quick access to all modules</span></p>
<p class="biglink"><a class="biglink" href="genindex.html">General Index</a><br/>
<span class="linkdescr">all functions, classes, terms</span></p>
<p class="biglink"><a class="biglink" href="glossary.html">Glossary</a><br/>
<span class="linkdescr">the most important terms explained</span></p>
</td><td width="50%">
<p class="biglink"><a class="biglink" href="search.html">Search page</a><br/>
<span class="linkdescr">search this documentation</span></p>
<p class="biglink"><a class="biglink" href="contents.html">Complete Table of Contents</a><br/>
<span class="linkdescr">lists all sections and subsections</span></p>
</td></tr>
</table>
<p><strong>Meta information:</strong></p>
<table class="contentstable" align="center"><tr>
<td width="50%">
<p class="biglink"><a class="biglink" href="bugs.html">Reporting bugs</a></p>
<p class="biglink"><a class="biglink" href="https://devguide.python.org/docquality/#helping-with-documentation">Contributing to Docs</a></p>
<p class="biglink"><a class="biglink" href="about.html">About the documentation</a></p>
</td><td width="50%">
<p class="biglink"><a class="biglink" href="license.html">History and License of Python</a></p>
<p class="biglink"><a class="biglink" href="copyright.html">Copyright</a></p>
</td></tr>
</table>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<h3>Download</h3>
<p><a href="download.html">Download these documents</a></p>
<h3>Docs by version</h3>
<ul>
<li><a href="https://docs.python.org/3.14/">Python 3.14 (in development)</a></li>
<li><a href="https://docs.python.org/3.13/">Python 3.13 (stable)</a></li>
<li><a href="https://docs.python.org/3.12/">Python 3.12 (stable)</a></li>
<li><a href="https://docs.python.org/3.11/">Python 3.11 (security-fixes)</a></li>
<li><a href="https://docs.python.org/3.10/">Python 3.10 (security-fixes)</a></li>
<li><a href="https://docs.python.org/3.9/">Python 3.9 (security-fixes)</a></li>
<li><a href="https://docs.python.org/3.8/">Python 3.8 (EOL)</a></li>
<li><a href="https://docs.python.org/3.7/">Python 3.7 (EOL)</a></li>
<li><a href="https://docs.python.org/3.6/">Python 3.6 (EOL)</a></li>
<li><a href="https://docs.python.org/3.5/">Python 3.5 (EOL)</a></li>
<li><a href="https://docs.python.org/3.4/">Python 3.4 (EOL)</a></li>
<li><a href="https://docs.python.org/3.3/">Python 3.3 (EOL)</a></li>
<li><a href="https://docs.python.org/3.2/">Python 3.2 (EOL)</a></li>
<li><a href="https://docs.python.org/3.1/">Python 3.1 (EOL)</a></li>
<li><a href="https://docs.python.org/3.0/">Python 3.0 (EOL)</a></li>
<li><a href="https://docs.python.org/2.7/">Python 2.7 (EOL)</a></li>
<li><a href="https://docs.python.org/2.6/">Python 2.6 (EOL)</a></li>
<li><a href="https://www.python.org/doc/versions/">All versions</a></li>
</ul>
<h3>Other resources</h3>
<ul>
<li><a href="https://peps.python.org">PEP Index</a></li>
<li><a href="https://wiki.python.org/moin/BeginnersGuide">Beginner's Guide</a></li>
<li><a href="https://wiki.python.org/moin/PythonBooks">Book List</a></li>
<li><a href="https://www.python.org/doc/av/">Audio/Visual Talks</a></li>
<li><a href="https://devguide.python.org/">Python Developer’s Guide</a></li>
</ul>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li><img src="_static/py.svg" alt="python logo" style="vertical-align: middle; margin-top: -1px"/></li>
<li><a href="https://www.python.org/">Python</a> »</li>
<li class="switchers">
<div class="language_switcher_placeholder"></div>
<div class="version_switcher_placeholder"></div>
</li>
<li>
</li>
<li id="cpython-language-and-version">
<a href="#">3.9.21 Documentation</a> »
</li>
<li class="right">
<div class="inline-search" role="search">
<form class="inline-search" action="search.html" method="get">
<input placeholder="Quick search" aria-label="Quick search" type="text" name="q" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
|
</li>
</ul>
</div>
<div class="footer">
© <a href="copyright.html">Copyright</a> 2001-2024, Python Software Foundation.
<br />
This page is licensed under the Python Software Foundation License Version 2.
<br />
Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License.
<br />
See <a href="/license.html">History and License</a> for more information.<br />
<br />
The Python Software Foundation is a non-profit corporation.
<a href="https://www.python.org/psf/donations/">Please donate.</a>
<br />
<br />
Last updated on Dec 08, 2024.
<a href="/bugs.html">Found a bug</a>?
<br />
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 2.4.4.
</div>
<script type="text/javascript" src="_static/switchers.js"></script>
</body>
</html>
print ( len ( docs) )
24
RecursiveUrlLoader 中自定义提取器
import re
from bs4 import BeautifulSoup
def bs4_extractor ( html: str ) - > str :
soup = BeautifulSoup( html, "html.parser" )
return re. sub( r"\n\n+" , "\n\n" , soup. text) . strip( )
loader = RecursiveUrlLoader(
"https://docs.python.org/3.9/" ,
extractor= bs4_extractor
)
docs = loader. load( )
print ( len ( docs) )
print ( docs[ 0 ] . metadata)
C:\Windows\Temp\ipykernel_12952\1217732938.py:5: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document.
Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor:
from bs4 import XMLParsedAsHTMLWarning
import warnings
warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)
soup = BeautifulSoup(html, "html.parser")
24
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.21 Documentation', 'language': None}
print ( docs[ 0 ] . page_content)
3.9.21 Documentation
Download
Download these documents
Docs by version
Python 3.14 (in development)
Python 3.13 (stable)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (security-fixes)
Python 3.8 (EOL)
Python 3.7 (EOL)
Python 3.6 (EOL)
Python 3.5 (EOL)
Python 3.4 (EOL)
Python 3.3 (EOL)
Python 3.2 (EOL)
Python 3.1 (EOL)
Python 3.0 (EOL)
Python 2.7 (EOL)
Python 2.6 (EOL)
All versions
Other resources
PEP Index
Beginner's Guide
Book List
Audio/Visual Talks
Python Developer’s Guide
Navigation
index
modules |
Python »
3.9.21 Documentation »
|
Python 3.9.21 documentation
Welcome! This is the official documentation for Python 3.9.21.
Parts of the documentation:
What's new in Python 3.9?
or all "What's new" documents since 2.0
Tutorial
start here
Library Reference
keep this under your pillow
Language Reference
describes syntax and language elements
Python Setup and Usage
how to use Python on different platforms
Python HOWTOs
in-depth documents on specific topics
Installing Python Modules
installing from the Python Package Index & other sources
Distributing Python Modules
publishing modules for installation by others
Extending and Embedding
tutorial for C/C++ programmers
Python/C API
reference for C/C++ programmers
FAQs
frequently asked questions (with answers!)
Indices and tables:
Global Module Index
quick access to all modules
General Index
all functions, classes, terms
Glossary
the most important terms explained
Search page
search this documentation
Complete Table of Contents
lists all sections and subsections
Meta information:
Reporting bugs
Contributing to Docs
About the documentation
History and License of Python
Copyright
Download
Download these documents
Docs by version
Python 3.14 (in development)
Python 3.13 (stable)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (security-fixes)
Python 3.8 (EOL)
Python 3.7 (EOL)
Python 3.6 (EOL)
Python 3.5 (EOL)
Python 3.4 (EOL)
Python 3.3 (EOL)
Python 3.2 (EOL)
Python 3.1 (EOL)
Python 3.0 (EOL)
Python 2.7 (EOL)
Python 2.6 (EOL)
All versions
Other resources
PEP Index
Beginner's Guide
Book List
Audio/Visual Talks
Python Developer’s Guide
Navigation
index
modules |
Python »
3.9.21 Documentation »
|
© Copyright 2001-2024, Python Software Foundation.
This page is licensed under the Python Software Foundation License Version 2.
Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License.
See History and License for more information.
The Python Software Foundation is a non-profit corporation.
Please donate.
Last updated on Dec 08, 2024.
Found a bug?
Created using Sphinx 2.4.4.