当前位置：首页 > news >正文

爬虫基础学习-基本原理和GET请求

news 2025/8/22 12:54:55

爬虫学习阶段：
1、基本爬虫：可以爬数据，可以做登录，翻页，根据账户权限进行爬取
2、进阶爬虫：js逆向、ast 反爬绕过

爬虫的基本原理：
爬虫概述：
获取网页并提取和保存信息的自动化程序
1.获取网页
2.提取信息：css选择器 xpath
3.保存数据（大数据时代）
4.自动化

5.学习爬虫掌握的python模块：urllib （Python 内置的 HTTP 请求库）
包括4个模块：
request模块：最基本的http请求模块
error模块：异常处理模块
parse模块：工具模块提供url的处理方法
robotparser模块：识别robot.txt

最基本的请求：
使用openurl发送
get请求：
带参数直接在url上面拼接
参数可能不止一个
多参数：urllib.parse.urlencode(params)
基本的URL地址和一个包含查询参数的字典params
使用urllib.parse.urlencode()函数将查询参数编码为查询字符串

#! /usr/bin/env python3import urllib.requestdef load_urlbaidu():url = 'http://www.baidu.com'response = urllib.request.urlopen(url)#print(response.code)data = response.read()str_data = data.decode('utf-8')print(str_data)with open('baidu.html', 'w', encoding='utf-8') as f:f.write(str_data)load_urlbaidu()

#! /usr/bin/env python3import urllib.request
import urllib.parsedef load_urlbaidu():url = 'http://www.baidu.com/s?'params = {"wd": "孙悟空","pn": "80"}query_str = urllib.parse.urlencode(params)final_url = url + query_str# print(final_url)# 存在汉字需要进行转码# encode_url = urllib.parse.quote(final_url, safe=string.printable)# print(encode_url)response = urllib.request.urlopen(final_url)# print(response.code)data = response.read()str_data = data.decode('utf-8')print(str_data)with open('baidu_wukong.html', 'w', encoding='utf-8') as f:f.write(str_data)load_urlbaidu()

查看全文

http://www.dtcms.com/a/343379.html