Python 爬取豆瓣电影影评 - 代码详解及 Chromedriver 配置 - 常规

Python 爬取豆瓣电影影评 - 代码详解及 Chromedriver 配置

本文将详细介绍如何使用 Python 和 Selenium 爬取豆瓣电影影评，并提供代码示例和 Chromedriver 配置方法，帮助你轻松获取豆瓣电影评论数据。

导包

import time
import json
from selenium import webdriver
from selenium.webdriver.common.keys import Keys


# 设置Chrome浏览器的选项，以便在后台运行
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1920x1080')
options.add_argument('disable-gpu')

# 启动Chrome浏览器，并打开豆瓣电影页面
driver = webdriver.Chrome(options=options, executable_path='D:\chromedriver.exe')
driver.get('https://movie.douban.com/subject/25868125/')

# 点击'全部影评'按钮，加载所有影评数据
button = driver.find_element_by_class_name('more-btn')
while button.is_displayed():
    button.click()
    time.sleep(2)

# 获取影评数据
comments = []
for i in range(0, 60, 20):  # 抓取前3页数据
    url = f'https://movie.douban.com/subject/25868125/comments?start={i}&limit=20&status=P&sort=new_score'
    driver.get(url)
    items = driver.find_elements_by_css_selector('.comment-item')
    for item in items:
        comment = {}
        comment['user'] = item.find_element_by_class_name('comment-info').find_element_by_tag_name('a').text
        comment['time'] = item.find_element_by_class_name('comment-time').get_attribute('title')
        comment['content'] = item.find_element_by_class_name('short').text
        comments.append(comment)

# 保存影评数据到文件
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(comments, f, ensure_ascii=False)

# 关闭浏览器
driver.quit()

Chromedriver 配置

下载 Chromedriver：根据你的 Chrome 版本下载对应版本的 Chromedriver，下载地址： https://chromedriver.chromium.org/downloads
配置路径：将下载的 Chromedriver.exe 文件放到你的系统环境变量的 PATH 中，或者在代码中指定路径：

# 使用相对路径
driver = webdriver.Chrome(executable_path='./chromedriver.exe')

# 使用绝对路径
driver = webdriver.Chrome(executable_path='D:\chromedriver.exe')

代码解释

导入必要的库：time 用于控制程序执行速度，json 用于将爬取的数据保存为 JSON 格式，selenium 用于控制浏览器进行自动化操作，Keys 用于模拟键盘操作。
设置 Chrome 浏览器选项：
- headless：使用无头模式，即在后台运行浏览器，不显示浏览器窗口。
- window-size：设置浏览器窗口大小。
- disable-gpu：禁用 GPU 加速，可以提高爬取效率。
启动 Chrome 浏览器并打开豆瓣电影页面：使用 webdriver.Chrome() 函数启动 Chrome 浏览器，并使用 get() 方法打开指定的豆瓣电影页面。
点击 “全部影评” 按钮：
- 使用 find_element_by_class_name() 方法找到 “全部影评” 按钮。
- 使用 is_displayed() 方法判断按钮是否可见。
- 使用 click() 方法点击按钮，加载所有影评数据。
获取影评数据：
- 使用循环遍历每个影评条目。
- 使用 find_elements_by_css_selector() 方法获取每个条目中的用户、时间和内容信息。
- 将信息存储到一个字典中，然后添加到列表中。
保存影评数据到文件：使用 json.dump() 方法将列表数据保存为 JSON 格式的文件。
关闭浏览器：使用 quit() 方法关闭 Chrome 浏览器。

注意事项

确保你的 Chromedriver 版本与 Chrome 版本匹配。
不要频繁地抓取数据，避免对网站造成过大的压力。
可以根据实际情况修改代码，例如抓取更多页面的数据，修改保存数据的格式等。

总结

本文介绍了使用 Python 和 Selenium 爬取豆瓣电影影评的完整流程，包括代码示例、Chromedriver 配置方法和注意事项。希望本文可以帮助你轻松获取豆瓣电影评论数据。