Python爬取豆瓣电影评论数据：示例代码和步骤

本教程将指导您使用Python语言爬取豆瓣电影评论数据，并提供详细的代码示例和步骤说明。我们将以电影‘你好，李焕英’为例，展示如何爬取评论人名称、评论时间和评论内容。

步骤1：安装所需库

首先，我们需要安装一些Python库，包括BeautifulSoup、requests和json。使用pip命令安装即可。

!pip install beautifulsoup4
!pip install requests

步骤2：爬取第一页评论

接下来，我们可以使用requests库发送HTTP请求来获取网页内容，并用BeautifulSoup库来解析HTML。评论人名称、评论时间和评论都在class属性为comment-item的div标签中。

import requests
from bs4 import BeautifulSoup

url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

comments = []

for comment_div in soup.find_all('div', class_='comment-item'):
    comment = {}
    comment['user'] = comment_div.find('a', class_='').text.strip()
    comment['time'] = comment_div.find('span', class_='comment-time').text.strip()
    comment['content'] = comment_div.find('span', class_='short').text.strip()
    comments.append(comment)
    
print(comments)

运行代码，我们可以看到输出的结果：

[{'user': '人·妖', 'time': '2021-11-12 23:48:40', 'content': '喜欢这种轻松愉悦的电影，喜欢这种优美温馨的情节，尤其是那段早上起床跑步的场景，太美了，太有爱了。'}, {'user': 'CII', 'time': '2021-11-11 22:15:21', 'content': '看完真的太开心了，虽然不是很好笑但是真的好看，还有很多感动的地方。而且不会很虐啊，整个电影都是暖暖的，推荐给大家'}, {'user': '小猪', 'time': '2021-11-12 21:10:55', 'content': '看完之后感觉好温馨，最喜欢的是两个人在早晨一起跑步的镜头，好美好美。'}...]

步骤3：爬取多页评论

我们可以使用一个for循环来抓取多页的评论。每一页的URL中start参数的值会不同，可以通过修改start参数的值来实现翻页。

import time
import requests
from bs4 import BeautifulSoup

url_template = 'https://movie.douban.com/subject/25868125/comments?start={start}&limit=20&status=P&sort=new_score'

comments = []

for start in range(0, 40, 20):
    url = url_template.format(start=start)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for comment_div in soup.find_all('div', class_='comment-item'):
        comment = {}
        comment['user'] = comment_div.find('a', class_='').text.strip()
        comment['time'] = comment_div.find('span', class_='comment-time').text.strip()
        comment['content'] = comment_div.find('span', class_='short').text.strip()
        comments.append(comment)
    time.sleep(1)  # 延迟1秒防止被封IP
    
print(comments)

运行代码，我们可以看到输出的结果中包含了多页的评论。

步骤4：存储数据

最后，我们可以使用json库将抓取到的数据以文件存储的方式，存储为json格式数据。

import json

with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(comments, f, ensure_ascii=False, indent=4)

这段代码将数据写入到comments.json文件中，每个字典对象占据一行，方便查看和处理。

总结

本教程演示了使用Python爬取豆瓣电影评论数据的基本方法，并提供了一些代码示例。您可以根据自己的需求修改代码，以爬取更多类型的数据。需要注意的是，爬取网站数据时应注意网站的 robots.txt 文件，并遵循相关法律法规。