Python爬虫实战：豆瓣电影影评数据抓取和天气信息获取

一、豆瓣电影影评数据抓取

项目目标

使用Python语言，爬取《穿靴子的猫2》在豆瓣电影上的所有页的影评数据，并存储为JSON格式。

实现步骤

使用Selenium库，点击进入电影全部影评

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 创建浏览器驱动
driver = webdriver.Chrome()

# 打开豆瓣电影页面
driver.get('https://movie.douban.com/subject/25868125/')

# 等待“全部影评”按钮出现并点击
wait = WebDriverWait(driver, 10)
all_comments_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//a[@class='btn btn-default']')))
all_comments_button.click()

从'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'地址开始，抓取第一页的评论人名称、评论时间以及评论。

import requests
from bs4 import BeautifulSoup

# 抓取第一页评论数据
url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 解析评论数据
comments = soup.find_all('div', class_='comment-item')
comment_data = []
for comment in comments:
    author = comment.find('span', class_='comment-info').find('a').text
    time = comment.find('span', class_='comment-time').text
    content = comment.find('span', class_='short').text
    comment_data.append({'author': author, 'time': time, 'content': content})

继续抓取2-3页的所有评论人名称、评论时间以及评论。

# 循环抓取2-3页评论数据
for page in range(2, 4):
    url = f'https://movie.douban.com/subject/25868125/comments?start={page * 20}&limit=20&status=P&sort=new_score'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    comments = soup.find_all('div', class_='comment-item')
    for comment in comments:
        author = comment.find('span', class_='comment-info').find('a').text
        time = comment.find('span', class_='comment-time').text
        content = comment.find('span', class_='short').text
        comment_data.append({'author': author, 'time': time, 'content': content})

将抓取到的数据以文件存储的方式，存储为json格式数据。

import json

# 存储评论数据到JSON文件
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(comment_data, f, ensure_ascii=False, indent=4)

二、天气信息爬取

项目目标

设计一个简单的爬虫程序，爬取'http://www.weather.com.cn/'上的天气信息，并将爬取到的信息存储到Excel文件中。

实现要求

爬取城市名称、天气状况、最高温度、最低温度、风向、风力等信息。
存储到Excel文件中，文件名为当前日期，每个城市的信息存储在一个sheet中。
程序要能够自动识别当前日期，并输出运行日志。
程序要有异常处理机制，能够处理网络中断等异常情况。

代码实现

import requests
from bs4 import BeautifulSoup
import datetime
import openpyxl
import logging

# 设置日志记录
logging.basicConfig(filename='weather_crawler.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# 获取当前日期
today = datetime.date.today().strftime('%Y-%m-%d')

# 创建Excel工作簿
workbook = openpyxl.Workbook()

# 循环抓取每个城市的天气信息
for city in ['北京', '上海', '广州', '深圳']:
    try:
        # 发送请求获取网页内容
        url = f'http://www.weather.com.cn/weather/{city}/'
        response = requests.get(url)
        response.raise_for_status()  # 检查请求是否成功

        # 解析网页内容
        soup = BeautifulSoup(response.text, 'html.parser')
        weather_info = soup.find('ul', class_='t clearfix').find_all('li')

        # 获取天气信息
        city_name = city
        weather = weather_info[0].text
        high_temp = weather_info[1].text
        low_temp = weather_info[2].text
        wind_direction = weather_info[3].text
        wind_force = weather_info[4].text

        # 创建新的工作表
        worksheet = workbook.create_sheet(title=city_name)

        # 将数据写入工作表
        worksheet['A1'] = '城市名称'
        worksheet['B1'] = '天气状况'
        worksheet['C1'] = '最高温度'
        worksheet['D1'] = '最低温度'
        worksheet['E1'] = '风向'
        worksheet['F1'] = '风力'
        worksheet['A2'] = city_name
        worksheet['B2'] = weather
        worksheet['C2'] = high_temp
        worksheet['D2'] = low_temp
        worksheet['E2'] = wind_direction
        worksheet['F2'] = wind_force

        logging.info(f'成功抓取{city}的天气信息。')
    except Exception as e:
        logging.error(f'抓取{city}天气信息失败：{e}')

# 保存Excel文件
workbook.save(f'{today}_weather.xlsx')

总结

本题通过两个实战项目，详细讲解了使用Python进行网页数据抓取的流程，涵盖了Selenium库、BeautifulSoup库、requests库、json库、openpyxl库的使用，以及异常处理、日志记录等内容。通过学习本题，您可以掌握基本的网页数据抓取技能，并能够根据实际需求进行个性化的爬虫程序开发。

备注

本题分为设计和实现两部分，设计部分20分，实现部分80分。需要提交代码及相关结果截图。