使用Selenium爬取《穿靴子的猫2》豆瓣电影影评数据

本教程使用Selenium库，爬取《穿靴子的猫2》在豆瓣电影上的所有页的影评数据，包括评论人名称、评论时间和评论内容。教程详细讲解了Selenium的应用，并提供了代码示例。

步骤

借助Selenium库，点击进入电影全部影评
从'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'地址开始，抓取第一页的评论人名称、评论时间以及评论。
继续抓取2-3页的所有评论人名称、评论时间以及评论。

代码示例

由于本题需要使用Selenium库，因此需要安装该库以及对应的浏览器驱动。以下代码使用Chrome浏览器和对应的驱动，如需使用其他浏览器请自行更改。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep

# 电影详情页地址
url = 'https://movie.douban.com/subject/25868125/'

# 启动浏览器
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 不显示浏览器窗口
driver = webdriver.Chrome(options=options)
driver.maximize_window()  # 最大化窗口

# 打开电影详情页
driver.get(url)

# 点击“全部影评”按钮
btn_all_comments = driver.find_element_by_xpath('//div[@class='reviews mod movie-content']//a[contains(@href, '/comments')]')
btn_all_comments.click()

# 等待评论页面加载完成
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.XPATH, '//div[@class='review-list  ']/div[@class='review-item']'))
)

# 抓取第一页的评论
comments = []
for i in range(5):  # 抓取5页评论
    # 抓取评论人名称、评论时间、评论内容
    items = driver.find_elements_by_xpath('//div[@class='review-list  ']/div[@class='review-item']')
    for item in items:
        name = item.find_element_by_xpath('.//a[@class='name']/text()')
        time = item.find_element_by_xpath('.//span[@class='comment-time ']/@title')
        content = item.find_element_by_xpath('.//div[@class='short-content']/text()')
        comments.append({'name': name, 'time': time, 'content': content})
    # 点击下一页按钮
    btn_next = driver.find_element_by_xpath('//div[@class='review-list  ']/div[@class='reviews-pagination']/a[@class='next']')
    btn_next.click()
    # 等待下一页加载完成
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, '//div[@class='review-list  ']/div[@class='review-item']'))
    )
    sleep(1)  # 等待1秒，防止页面未加载完全

# 打印所有评论
for comment in comments:
    print(comment['name'], comment['time'], comment['content'])

# 关闭浏览器
driver.quit()

注意:

上述代码示例仅抓取了5页评论，您可以根据需要修改代码中的循环次数。
为了避免被豆瓣反爬，建议您设置适当的延时，避免频繁访问网站。
本教程仅供学习交流使用，请勿用于任何非法活动。