用python编写代码用于爬取一千条豆瓣上对于平凡的世界这部小说的评价。
由于豆瓣网站有反爬措施,需要使用代理IP和随机的User-Agent头来进行爬取。以下是示例代码:
import requests
from bs4 import BeautifulSoup
import random
import time
# 设置请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
# 设置代理IP
proxies = [
{'http': 'http://121.232.148.178:9000'},
{'http': 'http://39.137.69.7:8080'},
{'http': 'http://39.137.69.6:80'},
{'http': 'http://39.137.69.6:8080'},
{'http': 'http://39.137.69.10:8080'},
{'http': 'http://39.137.69.8:8080'},
]
# 循环爬取1000条评论
for i in range(0, 1000, 20):
# 随机选择代理IP和请求头
proxy = random.choice(proxies)
header = headers.copy()
header['User-Agent'] = random.choice(headers['User-Agent'])
# 构造请求URL
url = f'https://book.douban.com/subject/10554308/comments/hot?p={i}'
# 发送请求,获取响应
try:
response = requests.get(url, headers=header, proxies=proxy, timeout=10)
response.raise_for_status()
except Exception as e:
print(f'Request Error: {e}')
continue
# 解析HTML
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(class_='comment-item')
# 保存评论
with open('comments.txt', 'a', encoding='utf-8') as f:
for comment in comments:
f.write(comment.p.text.strip() + '\n')
# 随机休眠一段时间,避免被封IP
time.sleep(random.randint(1, 5))
以上代码会将每一条评论保存到当前目录下的comments.txt文件中。在实际应用中,需要注意使用合法的代理IP和User-Agent头,以及合理的请求频率,避免被封IP
原文地址: https://www.cveoy.top/t/topic/cjvG 著作权归作者所有。请勿转载和采集!