当前位置: 首页 > news >正文

认识爬虫 —— xpath提取

爬虫流程:
  1. 得到html页面;即网站向服务器发送http请求
  2. 按照规则进行数据提取:xpath、bs4、re正则
  3. 数据存储:excel、txt、csv、sql
一、如何得到http页面

Python中有两个库:urlib、requests

urlib:python自带,无需额外安装来模拟http请求

requests:不是内置库,需要额外安装

安装requests库

Anaconda Prompt 输入 pip install requests

直接使用requests.get

import requests
url = "https://www.baidu.com"
response = requests.get(url)
print(response)

>>> <Response [200]>

使用response.text

虽然反馈了html,但存在乱码;以python自己猜测网页的编码方式编码,像baidu是utf8 而text猜测的识别的编码方式

import requests
url = "https://www.baidu.com"
response = requests.get(url).text
print(response)

使用response.content

返回的bytes流数据,我们获取这样的数据,再自己完成编码的转换工作

注意:解析网页 —— 网站写的时候用了一个编码,解析的时候也需要相同的编码

import requests
url = "https://www.baidu.com"
response = requests.get(url)
content = response.content.decode('utf8')
print(content)

拓展:

response.encoding

response.text在解码的过程中是以python猜测的编码方式进行解码;response.encoding就是看text方法猜测了哪种编码

response.status_code

requests提供的访问url的状态响应码

import requests
url = "https://www.baidu.com/s?wd=番茄"
response = requests.get(url)
content = response.content.decode('utf8')
print(content)

会报错,因为在网络中数据是由bytes传输的;我们需要借助字典来帮我们完成URL的建立

import requests
url = "https://www.baidu.com/s"
keyword={"wd" : "番茄"}
response = requests.get(url,params = keyword)
content = response.content.decode('utf8')
print(content)

很少数据,这是因为服务器识别到了爬虫,我们需要将自己的请求伪装成浏览器进行访问 —— 设置请求头参数 Request Headers

import requests
url = "https://www.baidu.com/s"
keyword = {"wd" : "番茄"}
headers = {"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0","Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
print(content)

这时我们就完成了第一步

二、 使用xpath进行数据提取

在上一步结尾我们得到的只是一堆包含html标签的字符串,所以我们需要先将字符串转化为HTML树形结构(xpath:专门用于从XML/HTML的树形结构中提取数据)

lxml库:可以将html字符串解析成树形结构(DOM树)从而支持 xpath —— 通过路径定位节点;css选择器 —— 通过类名、ID等定位

etree.HTML

用于将 HTML 字符串解析成一个可操作的 XML/HTML 树结构,方便使用 XPath 或 CSS 选择器提取数据

import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {"wd" : "番茄"}
headers = {"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0","Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]/text()")[0]print(title)

etree.tostring 

将节点树转化成字节流

etree.tostring(html,encoding='utf8').decode('utf8')

encoding='utf8':明确字节流的编码格式

.decode('utf8'):将字节流转化为人类可读的字符串

import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {"wd" : "番茄"}
headers = {"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0","Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]")[0]
print(title)

 

import requests
from lxml import etree
url = "https://www.baidu.com/s"
keyword = {"wd" : "番茄"}
headers = {"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0","Cookie":"""BAIDUID_BFESS=C2F14717F9818E7E288C02C871A05079:FG=1; BIDUPSID=C2F14717F9818E7E288C02C871A05079; PSTM=1754274231; delPer=0; PSINO=6; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=80852g0l808g0l818h848k058g258l1k906dp25; ZFY=Rr4kGZ1o:APEZpkaY71JP:AC2i57ZLD5GZ:Bh03QuyO9U0:C; H_WISE_SIDS=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339; log_chanel=ps; in_source=; log_first_time=1754274787664; ppfuid=FOCoIC3q5fKa8fgJnwzbE67EJ49BGJeplOzf+4l4EOvDuu2RXBRv6R3A1AZMa49I27C0gDDLrJyxcIIeAeEhD8JYsoLTpBiaCXhLqvzbzmvy3SeAW17tKgNq/Xx+RgOdb8TWCFe62MVrDTY6lMf2GrfqL8c87KLF2qFER3obJGmR7IW3+XbD73rlxsr4W49WGEimjy3MrXEpSuItnI4KD/Dhn1SwhV3ZiKhwl++fGuLgwNSQKKIDdXA6eDfuiw2FvAaonlbnyn/MZz/UNAdiObjXNgDlC7UcYiZqrHSvPgXGgLbz7OSojK1zRbqBESR5Pdk2R9IA3lxxOVzA+Iw1TWLSgWjlFVG9Xmh1+20oPSbrzvDjYtVPmZ+9/6evcXmhcO1Y58MgLozKnaQIaLfWRIM4pp9u1B7t2Y8SxQH/XnpSIyIsinQgrPBhVij7Jkrqt7K5jarLOWfD4m/czrcWX3eTNkbS2el0J2+pbyoXJb2gmGOupR9UnwrGA53MVEjRyF55yJEjttlLrWbPAsm9PnhchaSS27hNjpZcLp/IquiTVlyhJ3JXC6kz0QN46eBgz3redDooeLsg11MmhD7jfQaWR5MtmI9nwC4IX4+AvxFueLi0CGaDIQ+QtviUPhX6IF77Dy6yZYDs9YiLioS5IcJpB4bbKxkZr8ZftVYDKWsPOjTvdUJPjdjysFvUHB4mxEED2EBogeW4Pi7LyMotGwkDkjrw7dIOL8eH+akLvTHFYKgGjSgGJg717FXp0wr86a78d24iG0Dtj5SGbPVUi7YOni+QU6fNu7mlEJsJ//Q4HK2zxHINofz8BfOBxQ0x/yCYLgl4mRb8DUPDfQzB9V7hoY6OdNjT1InhfHXduVLvy2Rt8UlM5usdnb8SHq9GfVomUyetEVgvZpc0nX6wCOozDTDVDUBhw0qMXIJcLOiMZJ1KtwqlQvfWVsfRGkupdywOAwortJMEDlvwuXLPlX89p0qWlVQ4bSsd3NuUaLsQMGxjeIS0dI8HYXWNI/86gStBN63tfcJN+NPhp/fK4TDNGjnQK3nTHy3mDI/rjdDCuZqc2Epg7IGABnGWxpp7p/Knuwv2esAIeksaXrdqNl3U+YI6M0Djefx1T7yiVc/xxJNDJS+Dlsbqu3n4I65u7vHjYN8Y9FsEIqUiPgaHBzaPWEbkDtqYuIgK81L4pPw8Uh272Qn2s4hmYIGhvgWNjhMnVj2tQmSzvxssRNpayn6fKOzLUB/M2eyQCtg5rgHyX1m2s1h2jech47QBs6xgC/raD10QaC9KnqXjbt698Q==; BDUSS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; BDUSS_BFESS=hmU092NkVHV1Q0NmFmZ3otVWFPVm5raHAwVkRSZy1iaW93bUhMYlE1MGFxYmRvRUFBQUFBJCQAAAAAAQAAAAEAAADP-3BTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABockGgaHJBoSz; log_last_time=1754274897088; RT="z=1&dm=baidu.com&si=23abe0c8-0c9d-4c5a-9dd9-6549cc753bec&ss=mdwhwvkd&sl=5&tt=76z&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1aoq&ul=2e7b&hd=2e7p"; BDRCVFR[I1GM4qgEDat]=-_EV5wtlMr0mh-8uz4WUvY; H_PS_PSSID=62325_63143_63326_63881_63948_64012_64093_64123_64143_64162_64174_64245_64247_64254_64260_64308_64315_64317_64339_64366_64362_64364"""
}
response = requests.get(url,params = keyword,headers=headers)
content = response.content.decode('utf8')
tree = etree.HTML(content)
title = tree.xpath("//h3/a[contains(text(),'百度百科')]")[0]
result = etree.tostring(title,encoding='utf8').decode('utf8')
print(result)

 

拓展:

etree.parse

从文件/文件对象解析xml/html,返回一个ElementTree对象(ElementTree对象代表整个文档树,支持xpath文档级查询)

http://www.dtcms.com/a/314075.html

相关文章:

  • ML307模组 OpenCPU 软件调试
  • Oracle 定时任务相关
  • 计算机网络:有路由器参与的子网间通信原理
  • [spring-cloud: NamedContextFactory ClientFactoryObjectProvider]-源码阅读
  • SparkSQL—sequence 函数用法详解
  • 无人机路径规划技术要点与难点分析
  • 权限管理命令
  • 【C++】2. 类和对象(上)
  • Anthropic 禁止 OpenAI 访问 Claude API:商业竞争与行业规范的冲突
  • mongodb源代码分析创建db流程分析
  • 芯脑觉醒:Deepoc如何让送餐机器人“活”起来?
  • 手搓TCP服务器实现基础IO
  • Go语言高并发价格监控系统设计
  • TCP 协议的“无消息边界”(No Message Boundaries)特性
  • sqli-labs-master/Less-31~Less-40
  • 内联函数:提升效率的空间换时间艺术
  • 移动端 WebView 视频无法播放怎么办 媒体控件错误排查与修复指南
  • 官宣!多功能DC-DC数字电源控制器重磅首发
  • 应用药品GSP证书识别技术,提升药品流通各环节的合规管理效率和风控水平
  • 数据工程与处理:AI时代的数据基石与智能化管道
  • java~final关键字
  • doris `unicode` 是多语言混合类型分词与elasticsearch分词差异
  • Java从入门到精通 - 算法、正则、异常
  • MQTT:安装部署
  • 【AI 加持下的 Python 编程实战 2_13】第九章:繁琐任务的自动化(中)——自动批量合并 PDF 文档
  • CMake进阶: 使用FetchContent方法基于gTest的C++单元测试
  • Docker-07.Docker基础-数据卷挂载
  • 在CAPL自动化脚本中巧用panel函数
  • 关键领域软件研发如何构建智能知识管理体系?从文档自动化到安全协同的全面升级
  • 实现Trie(前缀和)C++