使用网络爬虫抓取《穿靴子的猫2》豆瓣电影影评数据教程

本教程将带领你使用 Python 和相关库（Selenium、BeautifulSoup、requests）抓取《穿靴子的猫2》豆瓣电影页面上的所有影评数据，并以 JSON 格式存储。

步骤1：使用 Selenium 进入电影全部影评页面

首先，需要安装 Selenium 库，可以使用以下命令进行安装：

pip install selenium

接着，需要下载 Chrome 浏览器对应版本的驱动，下载地址为：http://chromedriver.chromium.org/downloads

下载后将驱动解压，得到一个可执行文件，将其所在路径添加到系统环境变量中。

接下来就可以使用 Selenium 模拟点击进入全部影评的操作。代码如下：

from selenium import webdriver
import time

# 打开浏览器
driver = webdriver.Chrome()

# 打开网页
url = 'https://movie.douban.com/subject/25868125/'
driver.get(url)

# 点击进入全部影评
btn = driver.find_element_by_class_name('more')
btn.click()

# 等待页面加载完毕
time.sleep(2)

# 关闭浏览器
driver.quit()

步骤2：抓取第一页的评论人名称、评论时间以及评论

接下来使用 BeautifulSoup 和 requests 库来抓取页面数据。代码如下：

import requests
from bs4 import BeautifulSoup

# 请求网页
url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&sort=new_score&status=P'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)

# 解析网页
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all('div', class_='comment-item')

# 提取数据
data = []
for comment in comments:
    name = comment.find('span', class_='comment-info').find('a').get_text().strip()
    time = comment.find('span', class_='comment-time').get_text().strip()
    content = comment.find('span', class_='short').get_text().strip()
    data.append({'name': name, 'time': time, 'content': content})

# 打印数据
print(data)

步骤3：抓取第二页和第三页的所有评论人名称、评论时间以及评论

只需要修改 url 中的 start 参数，就可以抓取第二页和第三页数据。代码如下：

# 抓取第二页和第三页数据
data = []
for i in range(2, 4):
    start = (i - 1) * 20
    url = f'https://movie.douban.com/subject/25868125/comments?start={start}&limit=20&sort=new_score&status=P'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    comments = soup.find_all('div', class_='comment-item')
    for comment in comments:
        name = comment.find('span', class_='comment-info').find('a').get_text().strip()
        time = comment.find('span', class_='comment-time').get_text().strip()
        content = comment.find('span', class_='short').get_text().strip()
        data.append({'name': name, 'time': time, 'content': content})

# 打印数据
print(data)

步骤4：将抓取到的数据以文件存储的方式，存储为 JSON 格式数据

最后，将数据以文件存储的方式，存储为 JSON 格式数据。代码如下：

import json

# 存储数据
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False)

# 读取数据
with open('comments.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    print(data)

完整代码如下：

from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup
import json

# 打开浏览器
driver = webdriver.Chrome()

# 打开网页
url = 'https://movie.douban.com/subject/25868125/'
driver.get(url)

# 点击进入全部影评
btn = driver.find_element_by_class_name('more')
btn.click()

# 等待页面加载完毕
time.sleep(2)

# 抓取数据
headers = {'User-Agent': 'Mozilla/5.0'}
data = []
for i in range(1, 4):
    start = (i - 1) * 20
    url = f'https://movie.douban.com/subject/25868125/comments?start={start}&limit=20&sort=new_score&status=P'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    comments = soup.find_all('div', class_='comment-item')
    for comment in comments:
        name = comment.find('span', class_='comment-info').find('a').get_text().strip()
        time = comment.find('span', class_='comment-time').get_text().strip()
        content = comment.find('span', class_='short').get_text().strip()
        data.append({'name': name, 'time': time, 'content': content})

# 存储数据
with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False)

# 读取数据
with open('comments.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    print(data)

# 关闭浏览器
driver.quit()

注意：

以上代码仅抓取了前三页的评论数据，你可以根据需要修改代码抓取更多页面的数据。
本教程仅供学习交流，请勿进行任何违法行为。
在抓取网站数据之前，请务必遵守网站的 robots.txt 协议。
本教程可能需要根据豆瓣电影网站的页面结构进行调整。