Python爬虫与数据分析实战：豆瓣电影《穿靴子的猫2》影评数据分析

本项目使用Python语言，基于Selenium库和数据分析技术，爬取豆瓣电影《穿靴子的猫2》的所有页面的影评数据，并进行深入分析。

一、使用网络爬虫技术，抓取影评数据

目标网站： 豆瓣电影《穿靴子的猫2》影评页面 地址： https://movie.douban.com/subject/25868125/

步骤：

借助Selenium库，点击进入电影全部影评。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 初始化浏览器驱动
driver = webdriver.Chrome()
# 打开目标页面
driver.get('https://movie.douban.com/subject/25868125/')
# 等待“全部影评”按钮加载
wait = WebDriverWait(driver, 10)
all_comments_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//a[@class='']'))) 
# 点击“全部影评”按钮
all_comments_button.click()

从'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'地址开始，抓取第一页的评论人名称、评论时间以及评论。

from bs4 import BeautifulSoup
import requests

# 获取第一页影评数据
url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 提取评论数据
comments = soup.find_all('div', class_='comment-item')
comment_data = []
for comment in comments:
    author = comment.find('span', class_='comment-info').find('a').text
    time = comment.find('span', class_='comment-time').text
    content = comment.find('span', class_='short').text
    comment_data.append({'author': author, 'time': time, 'content': content})

继续抓取2-3页的所有评论人名称、评论时间以及评论。

# 循环抓取剩余页面数据
for i in range(1, 4):
    url = f'https://movie.douban.com/subject/25868125/comments?start={i * 20}&limit=20&status=P&sort=new_score'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    comments = soup.find_all('div', class_='comment-item')
    for comment in comments:
        author = comment.find('span', class_='comment-info').find('a').text
        time = comment.find('span', class_='comment-time').text
        content = comment.find('span', class_='short').text
        comment_data.append({'author': author, 'time': time, 'content': content})

将抓取到的数据以文件存储的方式，存储为json格式数据。

import json

# 将数据保存到json文件
with open('comments_data.json', 'w', encoding='utf-8') as f:
    json.dump(comment_data, f, ensure_ascii=False, indent=4)

二、使用数据分析技术，对抓取到的数据进行分析

1. 统计评分最高、最低的影评。

2. 统计影评中出现最多的关键词（可以自行定义关键词）。

3. 统计影评中出现最多的情感词（可以自行定义情感词），并分析情感倾向。

4. 根据评论时间，绘制影评数量随时间的变化曲线。

5. 根据评论人名称，统计不同评论人的评论数量，绘制评论人评论数量排名图。

6. 根据评论人名称，统计不同评论人的平均评分，绘制评论人平均评分排名图。

三、提交要求

提交Python代码及相关结果截图。
代码中应注明每个步骤的具体实现方法。
结果截图应包含抓取数据、分析结果及相关图表。
代码及结果截图需打包压缩上传至云盘，并将云盘链接提交至作业区。
作业提交截止时间为本周五晚23:59，逾期不接受。

注意： 以上代码仅供参考，实际代码需要根据具体需求进行调整和完善。

建议使用以下工具：

Pandas: 数据处理和分析库
Matplotlib: 绘制图表库
jieba: 中文分词库
Sentiment Analysis: 情感分析库

最终的分析结果需要根据具体数据情况进行展示和说明。