Python爬虫实战：抓取豆瓣电影《穿靴子的猫2》影评数据

由于豆瓣电影网站反爬机制较强，使用普通的requests库很难抓取到数据，因此本次任务使用Selenium库模拟浏览器操作进行爬取。

代码如下：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import json

# 设置Chrome浏览器路径
browser = webdriver.Chrome('/usr/local/bin/chromedriver')

# 打开网页
url = 'https://movie.douban.com/subject/25868125/'
browser.get(url)

# 点击“全部影评”
button = browser.find_element_by_xpath('//*[@id='reviews']/div[1]/div[2]/a')
button.click()

# 等待网页加载完成
wait = WebDriverWait(browser, 10)
wait.until(EC.presence_of_element_located((By.ID, 'comments-section')))

# 读取网页源代码
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')

# 获取第一页评论
comments = []
for item in soup.find_all('div', class_='comment-item'):
    name = item.find('a', class_='').get_text().strip()  # 评论人名称
    time = item.find('span', class_='comment-time').get_text().strip()  # 评论时间
    content = item.find('span', class_='short').get_text().strip()  # 评论内容
    comments.append({'name': name, 'time': time, 'content': content})

# 获取2-3页评论
for i in range(1, 3):
    url = 'https://movie.douban.com/subject/25868125/comments?start={}&limit=20&status=P&sort=new_score'.format(i*20)
    browser.get(url)
    wait.until(EC.presence_of_element_located((By.ID, 'comments-section')))
    html = browser.page_source
    soup = BeautifulSoup(html, 'html.parser')
    for item in soup.find_all('div', class_='comment-item'):
        name = item.find('a', class_='').get_text().strip()  # 评论人名称
        time = item.find('span', class_='comment-time').get_text().strip()  # 评论时间
        content = item.find('span', class_='short').get_text().strip()  # 评论内容
        comments.append({'name': name, 'time': time, 'content': content})

# 存储为json文件
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(comments, f, ensure_ascii=False)

# 关闭浏览器
browser.quit()

运行程序后，会打开Chrome浏览器并自动模拟点击“全部影评”，然后抓取第一页和2-3页的评论信息，最后将数据存储为json格式的文件。

结果截图如下：