爬虫案例六用协程爬取趣笔阁
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
文章目录
- 前言
- 一、网址分析与代码
前言
提示:这里可以添加本文要记录的大概内容:
爬虫案例六用协程爬取趣笔阁
提示:以下是本篇文章正文内容,下面案例可供参考
一、网址分析与代码
爬取正本小说,首先在目录首页要提取到所有章节urls,然后再根据urls逐一访问并下载内容。
爬取的网址
分析发现要提取的url都在源代码上,这一步提取urls不需要协程直接访问用xpath提取就行,前12个urls去掉就行了。我们只提取正文的url
def get_all_urls():
url = "https://www.biqugecd.net/20_20612/"
session = requests.session()
session.headers = {
"":""#请求头自行添加
}
resp = session.get(url)
resp.encoding = 'gbk'
page = etree.HTML(resp.text)
names = page.xpath(".//div[@class='listmain']/dl/dd/a/text()")[12:]
urls = page.xpath(".//div[@class='listmain']/dl/dd/a/@href")[12:]
result = []
for name, url_ in zip(names, urls):
result.append({"name": name,"url": urljoin(url, url_)})
return result
接下来进入小说章节内容页,发现数据页是在源代码上,也可以通过xpath提取,不过数据有很多空白和一些无用的数据需要做下清理,我去掉了第一个和后4个的无用数据行,并用join连接列表,用re正则去除空白。并用协程下载内容。
detail_headers = {
"":"" #请求头自行添加
}
async def download_one(chapter):
url = chapter['url']
name = chapter['name']
async with aiohttp.ClientSession(headers=detail_headers) as session:
async with session.get(url) as resp:
html = await resp.text(encoding='gbk')
print(html)
# 解析html
tree = etree.HTML(html)
content = re.sub("\s","","".join(tree.xpath(".//div[@id='content']//text()")[1:-4]))
async with aiofiles.open(name + ".txt",mode="w",encoding="utf-8") as f:
await f.write(content)
print(name, "保存完毕")
async def download_all_chapters(all_chapter_urls):
tasks = []
for dic in all_chapter_urls:
t = asyncio.create_task(download_one(dic))
tasks.append(t)
await asyncio.wait(tasks)
启动代码如下:
def main():
# 1.得到所有章节的url 这里我设置只爬取5个url 全开怕对面崩了!!!
all_chapter_urls = get_all_urls()[:5]
print(len(all_chapter_urls))
# 2.异步协程下载所有章节
loop = asyncio.get_event_loop()
loop.run_until_complete(download_all_chapters(all_chapter_urls))
if __name__ == '__main__':
main()
我只爬取了前5个url,测试了自己写的代码是否有效,并没有直接全跑了,因为感觉对面服务器不咋的…我怂的很哈哈哈