Python爬取豆瓣电影《穿靴子的猫2》影评数据

使用Python爬取豆瓣电影《穿靴子的猫2》影评数据

本示例使用Python和Selenium库爬取豆瓣电影《穿靴子的猫2》的所有页影评数据，包括评论人名称、评论时间和评论内容，并以JSON格式存储数据。

代码实现

from selenium import webdriver
import time
import json

# 设置Chrome浏览器的驱动路径
driver_path = r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
# 创建Chrome浏览器对象
driver = webdriver.Chrome(executable_path=driver_path)

# 打开豆瓣电影页面
url = "https://movie.douban.com/subject/25868125/"
driver.get(url)
time.sleep(1)

# 点击“全部影评”按钮
button = driver.find_element_by_xpath('//*[@id="comments-section"]/div[1]/h2/span/a')
button.click()
time.sleep(1)

# 获取总页数
page_element = driver.find_element_by_xpath('//*[@id="paginator"]/a[last()-1]')
total_page = int(page_element.text)

# 抓取评论数据
result = []
for page in range(total_page):
    # 构建评论页面的URL
    start = page * 20
    url = f"https://movie.douban.com/subject/25868125/comments?start={start}&limit=20&status=P&sort=new_score"
    # 打开评论页面
    driver.get(url)
    time.sleep(1)
    # 获取评论数据
    comments = driver.find_elements_by_xpath('//*[@id="comments"]/div[@class="comment-item"]')
    for c in comments:
        name = c.find_element_by_xpath('.//span[@class="comment-info"]/a').text
        time_str = c.find_element_by_xpath('.//span[@class="comment-info"]/span[@class="comment-time "]/a/@title').text
        content = c.find_element_by_xpath('.//p/span').text
        item = {"name": name, "time": time_str, "content": content}
        result.append(item)
    print(f"已抓取第{page+1}页数据")

# 保存数据
with open("comments.json", "w", encoding="utf-8") as f:
    json.dump(result, f, ensure_ascii=False)

# 关闭浏览器
driver.quit()

运行结果截图

说明

代码首先设置Chrome浏览器的驱动路径，并创建一个Chrome浏览器对象。
打开豆瓣电影页面，并点击“全部影评”按钮进入评论页面。
获取评论页面的总页数，并循环抓取每页的评论数据。
将抓取到的数据存储为JSON格式文件。
最后关闭浏览器。

注意

需要安装Selenium库：pip install selenium
需要下载与Chrome浏览器版本匹配的chromedriver驱动程序，并将其添加到系统环境变量中。
豆瓣电影网站可能存在反爬机制，需要根据实际情况调整代码。