当前位置：首页 > news >正文

大连指令数据集的创建--数据收集与预处理_02

news 2025/10/20 15:45:20

1.去哪儿爬虫

编程语言：Python
爬虫框架：Selenium（用于浏览器自动化）
解析库：BeautifulSoup（用于解析HTML）

2.爬虫策略

目标网站：去哪儿（https://travel.qunar.com/travelbook/list.htm?order=hot_heat）
目标数据：大连的旅游攻略
流程概述：
1. 打开去哪儿旅游攻略并进行搜索。
2. 提取搜索结果页面中旅游攻略的链接。
3. 分别访问每个旅游攻略页面，提取内容并按照json格式保存到本地文件。

3.获取旅游攻略链接

3.1导入库

import re
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup

3.2配置文件路径和ChromeDriver路径，根据自己的Chrome的chromedriver.exe的路径配置ChromeDriver路径，根据自己存放链接的位置配置文件路径

# 配置文件路径和ChromeDriver路径
file_path = 'D:\\Pycharm\\space\\Qunaer\\去哪儿大连链接.txt'//根据实际位置设置路径
chrome_driver_path = "D:\\chromedriver\\chromedriver.exe"//根据实际位置设置路径

3.3初始化Selenium WebDriver

s = Service(chrome_driver_path)
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service=s, options=options)

3.4获取去哪儿大连的攻略链接，并存储在txt文档中。

# 打开去哪儿攻略页面
driver.get("https://travel.qunar.com/travelbook/list.htm?order=hot_heat")

wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.XPATH, '/html/body/div[3]/div/div[2]/div[1]/div[1]/input[1]')))

start_input = driver.find_element(By.XPATH, '/html/body/div[3]/div/div[2]/div[1]/div[1]/input[1]')
start_input.send_keys('大连')
time.sleep(1)
start_input = driver.find_element(By.CLASS_NAME, 'sub_btn')
start_input.click()
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'b_paging')))
paging_div = soup.find('div', class_='b_paging')
max_page = 0
if paging_div.text or paging_div.find_elements(By.TAG_NAME, 'a'):
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    max_page = 0
    for a_tag in paging_div.find_all('a'):
        if 'data-beacon' in a_tag.attrs and a_tag['data-beacon'] == 'click_result_page':
            href = a_tag['href']
            page_number = int(href.split('/')[-1].split('.')[0])
            max_page = max(max_page, page_number)
    print(f"Maximum page number: {max_page} {href}")
    pattern = re.compile(r'(\d+)\.htm$')
    href = pattern.sub('', href)

    for i in range(max_page):
        link = "https:" + href + str(i+1) + ".htm"
        time.sleep(1)
        driver.get(link)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        wait = WebDriverWait(driver, 10)
        wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'list_item')))
        list_items = soup.find_all('li', {'class': 'list_item'})

        for item in list_items:
            data_url = item.get('data-url')
            if data_url:
                with open(file_path, 'a', encoding='utf-8') as file:
                    file.write(data_url + '\n')
            else:
                print(f"No data-url found for this <li> element.")
        else:
            print("No <ul> element with class b_strategy_list found.")
else:
    print("The b_paging div is empty or doesn't contain any anchor tags.")

3.5运行结果：

4.获取并处理旅游攻略内容

4.1导入库

import json
import time
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common import TimeoutException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

4.2一个字典和三个函数的定义

先上代码，再分析定义函数的功能和作用。

chinese_months = {
    1: '一月',
    2: '二月',
    3: '三月',
    4: '四月',
    5: '五月',
    6: '六月',
    7: '七月',
    8: '八月',
    9: '九月',
    10: '十月',
    11: '十一月',
    12: '十二月'
}


def remove_after_comment(text):
    index = text.find("【备注】")
    if index != -1:
        return text[:index]
    else:
        return text

def get_season(month):
    if 3 <= month <= 5:
        return '春天'
    elif 6 <= month <= 8:
        return '夏天'
    elif 9 <= month <= 11:
        return '秋天'
    else:
        return '冬天'

def change(text):
    if '/' in text:
        try:
            date_obj = datetime.strptime(text, '%Y/%m/%d')
            month_name = chinese_months[date_obj.month]
            season = get_season(date_obj.month)
            result_list.append(f'{month_name}')
            result_list.append(f'{season}')
        except ValueError:
            pass
    elif text.isdigit():
        if int(text) < 99:
            result_list.append(f'{text}天')
        else:
            result_list.append(f'{text}元')
    else:
        if text not in result_list:
            result_list.append(text)

以下是对这四个函数的详细解释和分析

1. `chinese_months` 字典
功能：将公历月份（数字1-12）映射为中文月份名称。
作用：提供标准化的月份翻译，符合中文习惯。

2. `remove_after_comment` 函数
功能：删除文本中“【备注】”及其后的内容。
作用：清理数据中的注释信息，保留主文本。

3. `get_season` 函数
功能：根据月份返回对应季节。
作用：将月份映射为中文季节名称，符合中国常见的气象季节划分（3-5月为春，6-8月为夏等）。但需注意，中国幅员辽阔，实际季节可能因地而异，此处为通用逻辑。

4. `change` 函数
功能：多条件文本转换器，处理日期、数字和其他文本。
作用：
（1）日期处理：识别形如 `YYYY/MM/DD` 的日期，提取中文月份和季节（依赖 `chinese_months` 和 `get_season`）。例如，输入 `"2025/02/25"` → 输出 `["二月", "冬天"]`。
（2）数字处理：
- 小于99的数字添加“天”（如 `30` → `"30天"`，表示天数）。
- 大于等于99的数字添加“元”（如 `100` → `"100元"`，表示金额）。
（3）其他文本：直接保留唯一值（去重）。

4.3配置文件路径和ChromeDriver路径，根据自己的Chrome的chromedriver.exe的路径配置ChromeDriver路径，根据自己存放链接的位置配置文件路径。

# 配置文件路径和ChromeDriver路径
file_path = 'D:\\Pycharm\\space\\Qunaer\\去哪儿大连链接.txt'  # 保存内容的文件路径
chrome_driver_path = "D:\\chromedriver\\chromedriver.exe"  # ChromeDriver路径

4.4初始化Selenium WebDriver

s = Service(chrome_driver_path)
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service=s, options=options)

4.5打开3.5中生成的“去哪儿大连链接.txt”文件，获取链接，进一步处理，爬取大连旅游攻略内容。

with open(file_path, 'r') as file:
    for line in file:
        result_list = []
        line = line.strip()
        last_part = line.rsplit('/', 1)[-1]

        print(last_part)
        time.sleep(1)
        driver.get("https://travel.qunar.com/travelbook/note/" + last_part)
        driver.maximize_window()
        try:
            wait = WebDriverWait(driver, 10)
            wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'foreword_list')))
        except TimeoutException:
            continue

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        ul = soup.find('ul', {'class': 'foreword_list'})
        print(ul)
        soup1 = BeautifulSoup(str(ul), 'html.parser')
        data_elements = soup1.find_all('span', {'class': 'data'})
        for data_element in data_elements:
            for text in data_element.stripped_strings:
                change(text)

        inputxt = ''
        for r in result_list:
            inputxt += '#'
            inputxt += str(r)
        text_element = soup.find('div', class_='e_main')
        text_content = text_element.get_text(strip=True, separator='\n')

        data = {
            'instruction': inputxt,
            'summary': '',
            'output': remove_after_comment(text_content)
        }

        with open('D:\\Pycharm\\space\\Qunaer\\去哪儿大连.json', 'a', encoding='utf-8') as file:
            json.dump(data, file, ensure_ascii=False, indent=4)
            file.write('\n')

    time.sleep(3)
    driver.quit()

4.6运行结果