Python爬虫实战：抓取豆瓣电影《穿靴子的猫2》影评数据

本项目使用Python语言，结合Selenium和BeautifulSoup库，爬取豆瓣电影《穿靴子的猫2》所有页面的影评数据，并将其存储至CSV文件。

项目目标:

使用网络爬虫技术，抓取《穿靴子的猫2》在豆瓣电影上的所有页的影评数据，抓取地址：https://movie.douban.com/subject/25868125/

项目步骤:

借助Selenium库，点击进入电影全部影评内容页面，获取当前页面的HTML代码
使用BeautifulSoup库解析HTML代码，获取影评信息，包括影评标题、影评内容、评分、评论时间、评论人等
将获取的影评信息存储至CSV文件中

代码实现:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import csv

# 设置浏览器
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('lang=zh_CN.UTF-8')
options.add_argument('Referer=https://movie.douban.com/subject/25868125/')

driver = webdriver.Chrome(options=options)
driver.get('https://movie.douban.com/subject/25868125/comments?start=0&limit=20&sort=new_score&status=P')

# 获取总页数
soup = BeautifulSoup(driver.page_source, 'html.parser')
pagination = soup.find('div', class_='center')
total_pages = int(pagination.find_all('a')[-2].text)

# 爬取数据
with open('reviews.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    writer.writerow(['标题', '内容', '评分', '时间', '评论人'])

    for i in range(total_pages):
        driver.get(f'https://movie.douban.com/subject/25868125/comments?start={i*20}&limit=20&sort=new_score&status=P')
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        comments = soup.find_all('div', class_='comment-item')

        for comment in comments:
            title = comment.find('span', class_='comment-info').find_all('a')[0].text.strip()
            content = comment.find('span', class_='short').text.strip()
            rating = comment.find('span', class_='comment-info').find_all('span')[1].get('class')[0].replace('allstar', '')
            time = comment.find('span', class_='comment-info').find_all('span')[3].text.strip()
            author = comment.find('span', class_='comment-info').find_all('a')[1].text.strip()

            writer.writerow([title, content, rating, time, author])

        time.sleep(2)

driver.quit()

爬取结果截图:

项目总结:

本项目展示了使用Python爬虫技术抓取豆瓣电影影评数据的完整流程，从浏览器驱动、网页解析到数据存储，涉及了多个库和技术。该项目可以作为学习爬虫技术的参考案例，并可以扩展到其他网站和数据的抓取。