以下是针对https://you.ctrip.com/sight/dalian4.html这个网页的爬虫代码,爬取此网页下每一个景点的所有评论:

import requests
from bs4 import BeautifulSoup

# 构造请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 发送请求
url = 'https://you.ctrip.com/sight/dalian4.html'
response = requests.get(url, headers=headers)

# 解析HTML
soup = BeautifulSoup(response.text, 'html.parser')

# 获取所有景点的id和名称
sight_list = soup.select('.list_mod2 .list_mod2_box .rdetailbox h2 a')
sight_ids = [sight['href'].split('/')[-1].split('.')[0] for sight in sight_list]
sight_names = [sight.text for sight in sight_list]

# 遍历每一个景点,获取评论
for i, sight_id in enumerate(sight_ids):
    sight_name = sight_names[i]

    # 构造请求
    url = f'https://you.ctrip.com/sight/{sight_id}.html#ctm_ref=hod_sr_lst_dl_n_4_2'
    response = requests.get(url, headers=headers)

    # 解析HTML
    soup = BeautifulSoup(response.text, 'html.parser')

    # 获取评论
    comments = soup.select('.comment_ctrip .comment_single')
    for comment in comments:
        print(f'{sight_name}: {comment.select(".text_comment")}')

解释:

  1. 首先,我们发送一个请求,获取目标网页的HTML。

    url = 'https://you.ctrip.com/sight/dalian4.html'
    response = requests.get(url, headers=headers)
    
  2. 然后,我们解析HTML,获取每一个景点的id和名称。

    sight_list = soup.select('.list_mod2 .list_mod2_box .rdetailbox h2 a')
    sight_ids = [sight['href'].split('/')[-1].split('.')[0] for sight in sight_list]
    sight_names = [sight.text for sight in sight_list]
    
  3. 接着,我们遍历每一个景点,构造请求,获取评论。

    for i, sight_id in enumerate(sight_ids):
        sight_name = sight_names[i]
        url = f'https://you.ctrip.com/sight/{sight_id}.html#ctm_ref=hod_sr_lst_dl_n_4_2'
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        comments = soup.select('.comment_ctrip .comment_single')
        for comment in comments:
            print(f'{sight_name}: {comment.select(".text_comment")}')
    
  4. 最后,我们打印出每一个景点的名称和评论内容。

代码中使用了requests和BeautifulSoup库,需要先安装。可以使用以下命令进行安装:

pip install requests
pip install beautifulsoup4
写出针对httpsyouctripcomsightdalian4html这个网页的爬虫代码要求爬取此网页下每一个景点的所有评论

原文地址: https://www.cveoy.top/t/topic/bCcN 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录