Python爬虫实战：抓取《穿靴子的猫2》豆瓣电影影评数据

本教程将使用Python爬虫技术，结合Selenium和Beautiful Soup库，抓取《穿靴子的猫2》在豆瓣电影的所有页面的影评数据。

抓取地址：

https://movie.douban.com/subject/25868125/

代码实现：

# 导入必要的库
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import json

# (1) 动态网页爬取操作

# 1. 导入Selenium库
from selenium import webdriver

# 2. 创建浏览器对象
browser = webdriver.Chrome()

# 3. 访问url地址
url = 'https://movie.douban.com/subject/25868125/'
browser.get(url)

# 4. 定位元素并点击（借助Selenium库，点击进入电影全部影评）
button = browser.find_element_by_xpath('//a[@class="more"]')
button.click()

# (2) 网页数据分析及数据抓取

# 1. 导入相关库
import requests
from bs4 import BeautifulSoup

# 2. 请求头提取
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 3. 发送请求并获得网页数据
response = requests.get(url, headers=headers)

# 4. 解析网页结构
soup = BeautifulSoup(response.text, 'html.parser')

# 5. 定位评论人名称
names = soup.select('div.comment > h3 > span.comment-info > a')
for name in names:
    print(name.text)

# 6. 定位评论时间
times = soup.select('div.comment > h3 > span.comment-info > span.comment-time')
for time in times:
    print(time.text)

# 7. 定位评论内容
comments = soup.select('div.comment > p > span.short')
for comment in comments:
    print(comment.text)

# (3) 多页网址数据抓取

# 1. 遍历页面循环
for i in range(10):
    url = 'https://movie.douban.com/subject/25868125/comments?start=' + str(i * 20) + '&limit=20&sort=new_score&status=P'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    names = soup.select('div.comment > h3 > span.comment-info > a')
    times = soup.select('div.comment > h3 > span.comment-info > span.comment-time')
    comments = soup.select('div.comment > p > span.short')
    for name, time, comment in zip(names, times, comments):
        print(name.text, time.text, comment.text)

# 2. 初始页面url确定
url = 'https://movie.douban.com/subject/25868125/comments?status=P'

# 3. 页面翻页规律设置
url = 'https://movie.douban.com/subject/25868125/comments?start=' + str(i * 20) + '&limit=20&sort=new_score&status=P'

# (4) 保存数据

# 1. 文件数据写入
import json

for i in range(10):
    url = 'https://movie.douban.com/subject/25868125/comments?start=' + str(i * 20) + '&limit=20&sort=new_score&status=P'
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    names = soup.select('div.comment > h3 > span.comment-info > a')
    times = soup.select('div.comment > h3 > span.comment-info > span.comment-time')
    comments = soup.select('div.comment > p > span.short')
    data = []
    for name, time, comment in zip(names, times, comments):
        data.append({'name': name.text, 'time': time.text, 'comment': comment.text})
    with open('comments.json', 'a', encoding='utf-8') as f:
        f.write(json.dumps(data, ensure_ascii=False) + '\n')

# 2. 文件定义json格式内容

```json
[
    {
        "name": "用户名",
        "time": "2021-01-01",
        "comment": "影评内容"
    },
    {
        "name": "用户名",
        "time": "2021-01-02",
        "comment": "影评内容"
    },
    ...
]

代码说明：

导入必要的库：selenium、Beautiful Soup、requests、json。
使用Selenium库模拟浏览器访问豆瓣电影页面，并点击“全部影评”按钮。
使用Beautiful Soup库解析网页结构，定位影评人名称、评论时间和评论内容。
使用requests库发送请求获取多页影评数据。
使用json库将数据保存到json文件中。

注意事项：

确保安装了必要的库：selenium、Beautiful Soup、requests、json。
在运行代码之前，请先设置Chrome浏览器的驱动程序路径，以便Selenium能够找到浏览器驱动程序。
豆瓣电影可能会限制爬虫访问，如果遇到限制，可以尝试修改User-Agent或使用代理服务器。

总结：

本教程提供了一个完整的豆瓣电影影评数据爬取代码示例，详细介绍了每个步骤的代码实现和注释，方便学习和实践。希望能够帮助大家更好地掌握Python爬虫技术。