Python爬虫实战：抓取豆瓣电影《穿靴子的猫2》影评数据

本文将使用Python和Selenium库抓取豆瓣电影《穿靴子的猫2》的所有影评数据，并将其存储为JSON格式文件。

一、准备工作

安装Selenium库和Chrome浏览器驱动

!pip install selenium

下载Chrome浏览器驱动： http://chromedriver.storage.googleapis.com/index.html

导入库和设置Chrome浏览器驱动

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import json

# 设置Chrome浏览器驱动路径
driver_path = 'chromedriver.exe'

# 创建Chrome浏览器对象并设置等待时间
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(driver_path, options=options)
wait = WebDriverWait(driver, 10)

二、代码实现

点击进入电影全部影评

# 打开电影详情页面
url = 'https://movie.douban.com/subject/25868125/'
driver.get(url)

# 点击“全部影评”按钮
all_comments = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="comments-section"]/div[1]/h2/span[2]/a')))
all_comments.click()

# 等待页面加载完成
time.sleep(3)

抓取评论数据并存储为JSON格式数据

# 抓取评论数据
comments = []
for i in range(10):
    start = i * 20
    url = f'https://movie.douban.com/subject/25868125/comments?start={start}&limit=20&status=P&sort=new_score'
    driver.get(url)
    time.sleep(3)
    comment_items = driver.find_elements_by_xpath('//*[@class="comment-item"]')
    for item in comment_items:
        name = item.find_element_by_xpath('.//span[@class="comment-info"]/a').text
        date = item.find_element_by_xpath('.//span[@class="comment-info"]/span[3]').text
        content = item.find_element_by_xpath('.//p').text
        comments.append({'name': name, 'date': date, 'content': content})

# 存储为json格式数据
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(comments, f, ensure_ascii=False)

完整代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import json

# 设置Chrome浏览器驱动路径
driver_path = 'chromedriver.exe'

# 创建Chrome浏览器对象并设置等待时间
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(driver_path, options=options)
wait = WebDriverWait(driver, 10)

# 打开电影详情页面
url = 'https://movie.douban.com/subject/25868125/'
driver.get(url)

# 点击“全部影评”按钮
all_comments = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="comments-section"]/div[1]/h2/span[2]/a')))
all_comments.click()

# 等待页面加载完成
time.sleep(3)

# 抓取评论数据
comments = []
for i in range(10):
    start = i * 20
    url = f'https://movie.douban.com/subject/25868125/comments?start={start}&limit=20&status=P&sort=new_score'
    driver.get(url)
    time.sleep(3)
    comment_items = driver.find_elements_by_xpath('//*[@class="comment-item"]')
    for item in comment_items:
        name = item.find_element_by_xpath('.//span[@class="comment-info"]/a').text
        date = item.find_element_by_xpath('.//span[@class="comment-info"]/span[3]').text
        content = item.find_element_by_xpath('.//p').text
        comments.append({'name': name, 'date': date, 'content': content})

# 存储为json格式数据
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(comments, f, ensure_ascii=False)

# 关闭浏览器
driver.quit()

三、注意事项

代码中设置了headless模式，可以节省资源，避免弹出浏览器窗口。
代码中设置了等待时间，可以保证页面加载完成，避免抓取到错误的数据。
由于豆瓣电影的评论加载机制，代码中只抓取了前10页的评论。如果需要抓取更多页面的评论，需要修改代码中的循环次数。
由于豆瓣电影的反爬机制，可能存在抓取失败的情况，需要根据实际情况进行调整。
本文仅供参考，实际应用中需要根据具体情况进行修改和完善。

希望本文能够帮助您了解如何使用Python和Selenium库抓取豆瓣电影的影评数据。