当前位置：首页 > news >正文

旅游网站爬虫实战：抓取携程酒店价格趋势全解析

news 2025/11/15 12:06:06

一、为什么需要抓取携程酒店价格？

二、技术实现方案：从入门到进阶

基础版：静态页面抓取（适合初学者）

进阶版：动态数据抓取（实战推荐）

终极版：分布式爬虫架构（企业级方案）

三、数据可视化与分析

四、反爬虫应对策略

五、常见问题Q&A

六、总结与展望

免费python编程教程：https://pan.quark.cn/s/2c17aed36b72

一、为什么需要抓取携程酒店价格？

在旅游旺季，酒店价格像坐过山车一样波动。比如2025年国庆期间，三亚某五星级酒店基础房型价格从日常的800元飙升至2500元，而同一时期杭州西湖周边民宿价格涨幅超过300%。这种动态变化让消费者难以决策，也让旅游从业者需要实时掌握市场动态。

携程作为国内最大的在线旅游平台，其酒店数据具有三大核心价值：

实时性：每15分钟更新一次价格数据
完整性：覆盖全国98%的酒店资源
关联性：整合用户评分、剩余房源、促销活动等多维度信息

通过抓取这些数据，我们可以：

制作价格波动预警系统
分析不同城市/区域的定价策略
预测未来价格走势
对比竞争对手定价

二、技术实现方案：从入门到进阶

基础版：静态页面抓取（适合初学者）

工具准备：

Python 3.8+
requests库（发送HTTP请求）
BeautifulSoup（解析HTML）
pandas（数据处理）

代码示例：

import requests
from bs4 import BeautifulSoup
import pandas as pdheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}url = "https://hotels.ctrip.com/hotel/beijing1"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')hotels = []
for item in soup.select('.hotel_item'):name = item.select_one('.hotel_name').text.strip()price = item.select_one('.price').text.strip()score = item.select_one('.comment_score').text.strip() if item.select_one('.comment_score') else 'N/A'hotels.append({'name': name, 'price': price, 'score': score})df = pd.DataFrame(hotels)
df.to_csv('beijing_hotels.csv', index=False)

局限性：

只能获取页面初始加载的数据
无法获取动态加载的价格信息
容易被反爬机制拦截

进阶版：动态数据抓取（实战推荐）

技术升级点：

分析XHR请求：通过浏览器开发者工具（F12）的Network面板，找到真实的数据接口。例如携程的酒店列表接口：
```
https://hotels.ctrip.com/hotel/api/hotellist/gethotellist?cityId=1&checkIn=2025-11-15&checkOut=2025-11-16
```
参数构造：
- cityId：城市编码（北京=1，上海=2）
- checkIn/checkOut：入住/离店日期
- pageIndex：分页参数
- other：排序方式、价格区间等过滤条件

完整代码实现：

import requests
import pandas as pd
import random
from datetime import datetime, timedelta# 代理IP池（示例使用免费代理，实际建议使用付费住宅代理）
proxies = [{'http': 'http://123.123.123.123:8080'},{'http': 'http://124.124.124.124:8081'}
]# 用户代理池
user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...'
]def get_hotel_data(city_id, check_in, check_out):base_url = "https://hotels.ctrip.com/hotel/api/hotellist/gethotellist"params = {'cityId': city_id,'checkIn': check_in,'checkOut': check_out,'pageIndex': 1,'pageSize': 20,'order': 'price_l'  # 按价格升序排列}headers = {'User-Agent': random.choice(user_agents),'Referer': 'https://hotels.ctrip.com/'}try:proxy = random.choice(proxies)response = requests.get(base_url, params=params, headers=headers, proxies=proxy, timeout=10)if response.status_code == 200:data = response.json()return process_data(data)else:print(f"请求失败，状态码：{response.status_code}")return Noneexcept Exception as e:print(f"请求异常：{str(e)}")return Nonedef process_data(json_data):hotels = []for hotel in json_data.get('hotelList', []):hotels.append({'name': hotel.get('hotelName'),'price': hotel.get('minPrice'),'score': hotel.get('commentScore'),'address': hotel.get('address'),'room_count': hotel.get('roomCount')})return hotels# 示例：获取北京2025-11-15至2025-11-16的酒店数据
check_in = '2025-11-15'
check_out = '2025-11-16'
beijing_hotels = get_hotel_data(1, check_in, check_out)if beijing_hotels:df = pd.DataFrame(beijing_hotels)df.to_csv('beijing_hotels_dynamic.csv', index=False)print("数据抓取成功！已保存为CSV文件")

关键优化点：

随机切换用户代理和代理IP
设置合理的请求间隔（建议3-5秒）
异常处理机制
数据清洗（处理缺失值、异常价格）

终极版：分布式爬虫架构（企业级方案）

对于需要抓取全国酒店数据的场景，建议采用分布式架构：

Scrapy-Redis：实现分布式队列管理
Kafka：作为消息队列缓冲
Spark Streaming：实时处理数据流
MongoDB：存储结构化数据

架构示意图：

[爬虫节点1] → [Kafka队列] → [Spark处理] → [MongoDB存储]
[爬虫节点2] →                ↑
[爬虫节点3] →                ↓[监控报警系统] ← [数据可视化]

三、数据可视化与分析

抓取到的数据可以这样分析：

价格趋势图：

import matplotlib.pyplot as plt
import pandas as pd# 假设我们抓取了连续7天的数据
dates = ['2025-11-08', '2025-11-09', '2025-11-10', '2025-11-11', '2025-11-12', '2025-11-13', '2025-11-14']
prices = [580, 620, 750, 820, 950, 1200, 1500]  # 示例数据plt.figure(figsize=(10, 5))
plt.plot(dates, prices, marker='o', linestyle='-', color='b')
plt.title('北京某酒店价格7日趋势')
plt.xlabel('日期')
plt.ylabel('价格（元）')
plt.grid(True)
plt.show()

价格分布热力图：

import seaborn as sns
import numpy as np# 生成模拟数据
np.random.seed(42)
data = np.random.randint(300, 2000, size=(100,))
bins = [0, 500, 1000, 1500, 2000]
labels = ['300-500', '501-1000', '1001-1500', '1501-2000']
grouped = pd.cut(data, bins=bins, labels=labels, right=False)plt.figure(figsize=(8, 4))
sns.countplot(x=grouped, order=labels)
plt.title('酒店价格区间分布')
plt.xlabel('价格区间（元）')
plt.ylabel('酒店数量')
plt.show()

价格与评分关系散点图：

# 假设我们有两列数据：price和score
prices = np.random.normal(800, 300, 100)
scores = np.random.normal(4.5, 0.5, 100).clip(3, 5)  # 评分限制在3-5分plt.figure(figsize=(8, 6))
plt.scatter(prices, scores, alpha=0.6)
plt.title('酒店价格与评分关系')
plt.xlabel('价格（元）')
plt.ylabel('用户评分')
plt.grid(True)
plt.show()

四、反爬虫应对策略

携程的反爬机制主要包括：

IP频率限制：同一IP每分钟请求超过10次会被封禁
User-Agent检测：非浏览器UA会被拒绝
Cookie验证：需要携带有效会话信息
行为验证：突然的高频请求会触发验证码

应对方案：

反爬机制	应对策略	工具推荐
IP限制	使用代理IP池	站大爷IP代理、亿牛云
UA检测	随机切换User-Agent	自定义UA列表
Cookie验证	携带有效Cookie	从浏览器获取
行为验证	模拟人类操作	Selenium+Playwright
请求频率	随机延迟3-10秒	time.sleep(random.uniform(3,10))