Python 爬取人民日报文章：避免封 IP、关键词筛选和统计分析

以下是一个用于爬取人民日报文章的 Python 代码示例，该代码可以根据关键词筛选文章并下载到指定路径，同时统计爬取文章的数量和正确性比率，并包含防止封 IP 的技巧：

import requests
from bs4 import BeautifulSoup
import datetime

# 爬取人民日报文章
def crawl_articles(start_date, end_date, keyword, save_path):
    article_count = 0
    correct_count = 0

    cur_date = start_date
    while cur_date <= end_date:
        # 使用随机代理IP避免封禁
        proxies = {'http': 'http://' + random.choice(proxy_list), 'https': 'https://' + random.choice(proxy_list)}  # 使用随机代理IP
        url = 'http://search.people.com.cn/cnpeople/search.do?keyword=' + keyword + '&siteName=news&pageCode=0101&dateTime=' + cur_date.strftime('%Y-%m/%d')
        response = requests.get(url, proxies=proxies, headers=headers)  # 使用随机代理IP和自定义请求头
        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.find_all('a', target='_blank', class_='black')
        
        for article in articles:
            article_url = article['href']
            response = requests.get(article_url, headers=headers)  # 使用自定义请求头
            soup = BeautifulSoup(response.content, 'html.parser')
            article_content = soup.find('div', class_='box_con').text
            
            # 判断文章中是否包含关键词
            if keyword in article_content:
                correct_count += 1
                # 下载文章到指定路径
                with open(save_path + '/' + str(article_count) + '.txt', 'w', encoding='utf-8') as file:
                    file.write(article_content)
            article_count += 1
        
        # 爬取间隔，避免频繁请求
        time.sleep(random.randint(1, 3))  # 随机等待1到3秒
        cur_date += datetime.timedelta(days=1)
    
    # 避免除以零错误
    if article_count == 0:
        correctness_ratio = 0
    else:
        correctness_ratio = correct_count / article_count
    return article_count, correctness_ratio

if __name__ == '__main__':
    start_date_str = input('请输入开始时间（格式：YYYY-MM-DD）：')
    end_date_str = input('请输入结束时间（格式：YYYY-MM-DD）：')
    keyword = input('请输入关键词：')
    save_path = input('请输入保存路径：')
    
    start_date = datetime.datetime.strptime(start_date_str, '%Y-%m-%d')
    end_date = datetime.datetime.strptime(end_date_str, '%Y-%m-%d')
    
    # 使用随机代理IP
    proxy_list = ['123.123.123.123:8080', '456.456.456.456:8080'] # 替换成真实的代理IP列表
    random.shuffle(proxy_list)

    # 使用自定义请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    article_count, correctness_ratio = crawl_articles(start_date, end_date, keyword, save_path)
    
    print('爬取的文章数量：', article_count)
    print('爬取文章的正确性比率：', correctness_ratio)

解释：

防止封 IP:
- 代码添加了随机代理 IP 的功能，使用 requests.get(url, proxies=proxies) 参数传递代理 IP。
- 你需要在 proxy_list 中填入真实的代理 IP 列表，建议使用可靠的代理服务。
- 代码还增加了随机等待时间，使用 time.sleep(random.randint(1, 3)) 避免频繁请求。
关键词筛选:
- 代码根据关键词筛选文章，如果文章内容包含关键词，则将文章下载到指定路径。
统计分析:
- 代码统计了爬取文章的数量和正确性比率，正确性比率是指包含关键词的文章数量占总爬取文章数量的比例。
避免除以零错误:
- 代码在计算正确性比率之前判断是否爬取到文章，如果未爬取到文章，则将正确性比率设置为 0。

使用说明：

运行代码后，输入开始时间、结束时间、关键词和保存路径。
代码会根据用户的输入爬取人民日报文章，并将包含关键词的文章下载到指定路径中。
代码会打印爬取文章的数量和正确性比率。

注意：

请将 proxy_list 中的代理 IP 替换成真实的代理 IP。
爬取网站需要遵守网站的爬取规则，请勿过度爬取，以免造成网站负担。
本代码仅供学习交流使用，请勿用于任何违法行为。