Python爬取豆瓣电影《穿靴子的猫2》影评数据

本文将介绍如何使用Python和Selenium库，抓取豆瓣电影《穿靴子的猫2》所有页面的影评数据，并以JSON格式存储。

爬取步骤

借助Selenium库，点击进入电影全部影评
从'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'地址开始，抓取第一页的评论人名称、评论时间以及评论。
继续抓取2-3页的所有评论人名称、评论时间以及评论。
将抓取到的数据以文件存储的方式，存储为json格式数据。

代码示例

由于豆瓣电影的反爬虫机制比较严格，需要使用一些反反爬虫技巧，比如模拟人为操作、使用代理IP等。为了方便，这里提供一个简单的代码示例，仅供参考。

import time
import json
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# 设置Chrome浏览器的选项，以便在后台运行
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
options.add_argument('disable-gpu')

# 启动Chrome浏览器，并打开豆瓣电影页面
driver = webdriver.Chrome(options=options)
driver.get('https://movie.douban.com/subject/25868125/')

# 点击“全部影评”按钮，加载所有影评数据
button = driver.find_element_by_class_name('more-btn')
while button.is_displayed():
    button.click()
    time.sleep(2)

# 获取影评数据
comments = []
for i in range(0, 60, 20):  # 抓取前3页数据
    url = f'https://movie.douban.com/subject/25868125/comments?start={i}&limit=20&status=P&sort=new_score'
    driver.get(url)
    items = driver.find_elements_by_css_selector('.comment-item')
    for item in items:
        comment = {}
        comment['user'] = item.find_element_by_class_name('comment-info').find_element_by_tag_name('a').text
        comment['time'] = item.find_element_by_class_name('comment-time').get_attribute('title')
        comment['content'] = item.find_element_by_class_name('short').text
        comments.append(comment)

# 保存影评数据到文件
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(comments, f, ensure_ascii=False)

# 关闭浏览器
driver.quit()

这段代码使用了Selenium库来模拟人为操作，点击“全部影评”按钮，加载所有影评数据。然后从第1页到第3页，依次抓取每一页的影评数据，将用户、时间和评论内容分别提取出来，存储为字典格式的数据。最后将这些数据保存为json格式的文件。注意要使用ensure_ascii=False参数来确保中文字符不被转义。