当前位置：首页 > wzjs >正文

给个网站好人有好报2021美国新冠疫情最新消息

wzjs 2025/8/21 18:00:56

给个网站好人有好报2021,美国新冠疫情最新消息,高端网站建设kgu,易云自助建站1、安装与配置安装方法使用 pip 直接安装（推荐大多数场景）： pip install lxml• 验证安装：导入库无报错即成功： from lxml import etree, html1. 基本用法：HTML解析 lxml 提供了两种常见的解析方法&…

1、安装与配置

安装方法
使用 pip 直接安装（推荐大多数场景）：
```
pip install lxml
```
• 验证安装：导入库无报错即成功：
```
from lxml import etree, html
```

1. 基本用法：HTML解析

lxml 提供了两种常见的解析方法：

html.fromstring() 用于解析 HTML 字符串。
html.parse() 用于解析 HTML 文件。

示例1：解析HTML字符串

from lxml import html# 假设有一个HTML字符串
html_content = """
<html><body><h1>Title of the page</h1><p class="content">This is a paragraph.</p><a href="https://example.com">Click Here</a></body>
</html>
"""# 解析HTML字符串
tree = html.fromstring(html_content)# 提取数据
title = tree.xpath('//h1/text()')  # 使用XPath提取标题
print(title)  # 输出 ['Title of the page']content = tree.xpath('//p[@class="content"]/text()')  # 通过类名提取段落
print(content)  # 输出 ['This is a paragraph.']link = tree.xpath('//a/@href')  # 获取超链接的URL
print(link)  # 输出 ['https://example.com']

示例2：解析HTML文件

from lxml import html# 解析HTML文件
tree = html.parse('example.html')# 提取数据
title = tree.xpath('//h1/text()')
print(title)

2. XPath 查询

lxml 的强大之处在于其支持 XPath 查询，它可以用来从 HTML 或 XML 文档中精确查找和提取数据。

XPath语法：

//tagname：查找所有的指定标签。
//tagname[@attribute='value']：查找具有特定属性值的标签。
tagname/text()：提取标签的文本内容。
@attribute：提取属性值。

示例：使用XPath查询HTML中的元素

from lxml import htmlhtml_content = """
<html><body><h1>Title of the page</h1><p class="content">This is a paragraph.</p><p class="content">Another paragraph.</p><a href="https://example.com">Click Here</a></body>
</html>
"""tree = html.fromstring(html_content)# 获取所有的p标签内容
paragraphs = tree.xpath('//p/text()')
print(paragraphs)  # 输出 ['This is a paragraph.', 'Another paragraph.']# 获取所有class为content的p标签内容
content_paragraphs = tree.xpath('//p[@class="content"]/text()')
print(content_paragraphs)  # 输出 ['This is a paragraph.', 'Another paragraph.']# 获取所有的超链接
links = tree.xpath('//a/@href')
print(links)  # 输出 ['https://example.com']

3. 使用 CSS 选择器

lxml 也支持 CSS 选择器来查找和提取数据。

示例：使用CSS选择器提取元素

from lxml import htmlhtml_content = """
<html><body><h1>Title of the page</h1><p class="content">This is a paragraph.</p><p class="content">Another paragraph.</p><a href="https://example.com">Click Here</a></body>
</html>
"""tree = html.fromstring(html_content)# 使用CSS选择器提取
title = tree.cssselect('h1')[0].text
print(title)  # 输出 'Title of the page'# 获取class为content的p标签内容
content_paragraphs = tree.cssselect('p.content')
for p in content_paragraphs:print(p.text)
# 输出：
# This is a paragraph.
# Another paragraph.

4. 处理XML文档

lxml 也可以用于解析和处理 XML 文档，和 HTML 文档的处理类似。

示例：解析和操作XML

from lxml import etreexml_content = """
<bookstore><book><title lang="en">Python Programming</title><author>John Doe</author><price>29.99</price></book><book><title lang="es">Aprendiendo Python</title><author>Jane Smith</author><price>24.99</price></book>
</bookstore>
"""# 解析XML
tree = etree.fromstring(xml_content)# 获取所有的title标签内容
titles = tree.xpath('//title/text()')
print(titles)  # 输出 ['Python Programming', 'Aprendiendo Python']# 获取所有作者的名字
authors = tree.xpath('//author/text()')
print(authors)  # 输出 ['John Doe', 'Jane Smith']# 获取第一个book的价格
price = tree.xpath('//book[1]/price/text()')
print(price)  # 输出 ['29.99']

5. 保存和输出XML/HTML

lxml 可以将解析后的树状结构保存回文件或转换为字符串。

示例：将树保存为XML文件

from lxml import etreexml_content = """
<bookstore><book><title lang="en">Python Programming</title><author>John Doe</author><price>29.99</price></book>
</bookstore>
"""tree = etree.fromstring(xml_content)# 保存为XML文件
tree.write('output.xml', pretty_print=True, xml_declaration=True, encoding='UTF-8')

示例：将树转换为字符串

html_content = """
<html><body><h1>Title of the page</h1></body>
</html>
"""
tree = html.fromstring(html_content)
html_str = etree.tostring(tree, pretty_print=True, encoding='unicode')
print(html_str)

6. 处理HTML和XML命名空间

如果XML或HTML文档使用了命名空间，lxml可以处理这些命名空间。

示例：带有命名空间的XML

from lxml import etreexml_content = """
<ns:bookstore xmlns:ns="http://example.com"><ns:book><ns:title>Python Programming</ns:title><ns:author>John Doe</ns:author><ns:price>29.99</ns:price></ns:book>
</ns:bookstore>
"""tree = etree.fromstring(xml_content)# 使用命名空间查找元素
namespaces = {'ns': 'http://example.com'}
title = tree.xpath('//ns:title/text()', namespaces=namespaces)
print(title)  # 输出 ['Python Programming']