写出针对httpsyouctripcomsightdalian4html这个网页的爬虫代码要求爬取此网页下每一个景点的所有评论
以下是针对https://you.ctrip.com/sight/dalian4.html这个网页的爬虫代码,爬取此网页下每一个景点的所有评论:
import requests
from bs4 import BeautifulSoup
# 构造请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
# 发送请求
url = 'https://you.ctrip.com/sight/dalian4.html'
response = requests.get(url, headers=headers)
# 解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
# 获取所有景点的id和名称
sight_list = soup.select('.list_mod2 .list_mod2_box .rdetailbox h2 a')
sight_ids = [sight['href'].split('/')[-1].split('.')[0] for sight in sight_list]
sight_names = [sight.text for sight in sight_list]
# 遍历每一个景点,获取评论
for i, sight_id in enumerate(sight_ids):
sight_name = sight_names[i]
# 构造请求
url = f'https://you.ctrip.com/sight/{sight_id}.html#ctm_ref=hod_sr_lst_dl_n_4_2'
response = requests.get(url, headers=headers)
# 解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
# 获取评论
comments = soup.select('.comment_ctrip .comment_single')
for comment in comments:
print(f'{sight_name}: {comment.select(".text_comment")}')
解释:
-
首先,我们发送一个请求,获取目标网页的HTML。
url = 'https://you.ctrip.com/sight/dalian4.html' response = requests.get(url, headers=headers) -
然后,我们解析HTML,获取每一个景点的id和名称。
sight_list = soup.select('.list_mod2 .list_mod2_box .rdetailbox h2 a') sight_ids = [sight['href'].split('/')[-1].split('.')[0] for sight in sight_list] sight_names = [sight.text for sight in sight_list] -
接着,我们遍历每一个景点,构造请求,获取评论。
for i, sight_id in enumerate(sight_ids): sight_name = sight_names[i] url = f'https://you.ctrip.com/sight/{sight_id}.html#ctm_ref=hod_sr_lst_dl_n_4_2' response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') comments = soup.select('.comment_ctrip .comment_single') for comment in comments: print(f'{sight_name}: {comment.select(".text_comment")}') -
最后,我们打印出每一个景点的名称和评论内容。
代码中使用了requests和BeautifulSoup库,需要先安装。可以使用以下命令进行安装:
pip install requests
pip install beautifulsoup4
原文地址: https://www.cveoy.top/t/topic/bCcN 著作权归作者所有。请勿转载和采集!