当前位置：首页 > news >正文

第十课：爬虫综合实战：从数据采集到可视化分析

news 2025/8/24 8:52:30

Python爬虫是一种自动化的数据采集工具，它可以模拟浏览器行为，访问网页并提取所需信息。Python爬虫的实现通常涉及以下几个步骤：发送网页请求、获取网页内容、解析HTML、数据存储。本文将通过实战案例，展示如何使用Python爬虫技术从多个网站采集数据，并进行清洗、去重和可视化分析，最终构建一个房价分析系统。

1. 多网站数据联合采集

在进行多网站数据联合采集时，我们需要针对每个目标网站编写相应的爬虫程序。以下是一个示例，展示如何采集两个不同网站的房价数据。

示例代码：采集两个不同网站的房价数据

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
def fetch_house_prices_from_site1(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        # 假设网站1的数据结构
        items = soup.select('.house-item')
        data = []
        for item in items:
            title = item.select('.title')[0].text.strip()
            price = item.select('.price')[0].text.strip()
            data.append({'标题': title, '价格': price})
        return pd.DataFrame(data)
    else:
        print(f'Error fetching data from {url}')
        return None
 
def fetch_house_prices_from_site2(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        # 假设网站2的数据结构
        items = soup.select('.property-item')
        data = []
        for item in items:
            title = item.select('.title')[0].text.strip()
            price = item.select('.price')[0].text.strip()
            data.append({'标题': title, '价格': price})
        return pd.DataFrame(data)
    else:
        print(f'Error fetching data from {url}')
        return None
 
# 示例调用
df1 = fetch_house_prices_from_site1('https://example.com/houses')
df2 = fetch_house_prices_from_site2('https://another-example.com/properties')
 
# 合并数据
combined_df = pd.concat([df1, df2], ignore_index=True)
print(combined_df.head())

2. 数据清洗与去重策略

在数据采集过程中，由于各种原因（如网页结构变化、数据缺失等），采集到的数据往往需要进行清洗和去重。以下是一些常用的数据清洗和去重策略。

数据清洗

去除空值：使用Pandas的dropna()方法去除数据中的空值。
处理缺失值：根据业务逻辑，使用填充值（如均值、中位数、众数等）或删除缺失值。
数据类型转换：将字符串类型的数字转换为数值类型，便于后续分析。

数据去重

基于单列去重：使用Pandas的drop_duplicates()方法，根据某一列的值去重。
基于多列组合去重：根据多列的组合值去重，确保数据的唯一性。

示例代码：数据清洗与去重

# 假设combined_df为合并后的数据框
 
# 去除空值
cleaned_df = combined_df.dropna()
 
# 处理缺失值（示例：填充均值）
cleaned_df['价格'] = cleaned_df['价格'].fillna(cleaned_df['价格'].mean())
 
# 数据类型转换
cleaned_df['价格'] = cleaned_df['价格'].astype(float)
 
# 基于标题去重
deduplicated_df = cleaned_df.drop_duplicates(subset=['标题'])
 
print(deduplicated_df.head())

3. 使用Matplotlib生成图表

Matplotlib是Python中常用的数据可视化库，可以生成各种类型的图表，如柱形图、折线图、散点图等。以下是如何使用Matplotlib生成房价分布直方图的示例。

示例代码：生成房价分布直方图

import matplotlib.pyplot as plt
 
# 绘制房价分布直方图
plt.figure(figsize=(10, 6))
plt.hist(deduplicated_df['价格'], bins=30, color='skyblue', edgecolor='black')
plt.xlabel('房价(万元)')
plt.ylabel('数量')
plt.title('房价分布直方图')
plt.grid(True)
plt.show()

4. 完整项目：房价分析系统

以下是一个完整的房价分析系统示例，包括数据采集、数据清洗、数据去重和数据可视化。

项目结构

├── 房价分析系统

├── data_collection.py # 数据采集模块

├── data_cleaning.py # 数据清洗模块

├── data_visualization.py # 数据可视化模块

├── main.py # 主程序

└── requirements.txt # 项目依赖文件

main.py

from data_collection import fetch_house_prices_from_site1, fetch_house_prices_from_site2
from data_cleaning import clean_and_deduplicate_data
from data_visualization import plot_house_price_histogram
 
def main():
    # 采集数据
    df1 = fetch_house_prices_from_site1('https://example.com/houses')
    df2 = fetch_house_prices_from_site2('https://another-example.com/properties')
    
    # 合并数据
    combined_df = pd.concat([df1, df2], ignore_index=True)
    
    # 清洗和去重数据
    cleaned_df = clean_and_deduplicate_data(combined_df)
    
    # 可视化数据
    plot_house_price_histogram(cleaned_df)
 
if __name__ == '__main__':
    main()

data_collection.py

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
def fetch_house_prices_from_site1(url):
    # ...（与前面示例代码相同）
 
def fetch_house_prices_from_site2(url):
    # ...（与前面示例代码相同）

data_cleaning.py

import pandas as pd
 
def clean_and_deduplicate_data(df):
    # 去除空值
    cleaned_df = df.dropna()
    
    # 处理缺失值（示例：填充均值）
    cleaned_df['价格'] = cleaned_df['价格'].fillna(cleaned_df['价格'].mean())
    
    # 数据类型转换
    cleaned_df['价格'] = cleaned_df['价格'].astype(float)
    
    # 基于标题去重
    deduplicated_df = cleaned_df.drop_duplicates(subset=['标题'])
    
    return deduplicated_df

data_visualization.py

import matplotlib.pyplot as plt
 
def plot_house_price_histogram(df):
    # 绘制房价分布直方图
    plt.figure(figsize=(10, 6))
    plt.hist(df['价格'], bins=30, color='skyblue', edgecolor='black')
    plt.xlabel('房价(万元)')
    plt.ylabel('数量')
    plt.title('房价分布直方图')
    plt.grid(True)
    plt.show()

requirements.txt

requests==2.25.1
beautifulsoup4==4.9.3
pandas==1.3.3
matplotlib==3.4.3

结论

通过本文的介绍和实践案例，我们可以看到Python爬虫技术与数据可视化工具的强大功能。从数据采集到分析，再到可视化展示，Python提供了一套完整的解决方案。这不仅能够帮助我们高效地获取和处理数据，还能够使我们更直观地理解数据背后的信息。随着技术的不断发展，Python在数据采集与可视化领域的应用将更加广泛。

关注我！！🫵 持续为你带来Python相关内容。

查看全文

http://www.dtcms.com/a/62043.html