由于豆瓣网站有反爬措施,需要使用代理IP和随机的User-Agent头来进行爬取。以下是示例代码:

import requests
from bs4 import BeautifulSoup
import random
import time

# 设置请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 设置代理IP
proxies = [
    {'http': 'http://121.232.148.178:9000'},
    {'http': 'http://39.137.69.7:8080'},
    {'http': 'http://39.137.69.6:80'},
    {'http': 'http://39.137.69.6:8080'},
    {'http': 'http://39.137.69.10:8080'},
    {'http': 'http://39.137.69.8:8080'},
]

# 循环爬取1000条评论
for i in range(0, 1000, 20):
    # 随机选择代理IP和请求头
    proxy = random.choice(proxies)
    header = headers.copy()
    header['User-Agent'] = random.choice(headers['User-Agent'])

    # 构造请求URL
    url = f'https://book.douban.com/subject/10554308/comments/hot?p={i}'

    # 发送请求,获取响应
    try:
        response = requests.get(url, headers=header, proxies=proxy, timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(f'Request Error: {e}')
        continue

    # 解析HTML
    soup = BeautifulSoup(response.content, 'html.parser')
    comments = soup.find_all(class_='comment-item')

    # 保存评论
    with open('comments.txt', 'a', encoding='utf-8') as f:
        for comment in comments:
            f.write(comment.p.text.strip() + '\n')

    # 随机休眠一段时间,避免被封IP
    time.sleep(random.randint(1, 5))

以上代码会将每一条评论保存到当前目录下的comments.txt文件中。在实际应用中,需要注意使用合法的代理IP和User-Agent头,以及合理的请求频率,避免被封IP


原文地址: https://www.cveoy.top/t/topic/cjvG 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录