Python爬取人民日报文章:代码示例、错误解决和优化
{ "title": "Python爬取人民日报文章:代码示例、错误解决和优化", "description": "本文提供Python代码示例,演示如何爬取人民日报文章,并处理常见的JSON解析错误。代码支持自定义时间范围和关键词,并能自动下载文章到指定路径,同时统计爬取文章数量和正确性比率。", "keywords": "python, 爬虫, 人民日报, 文章爬取, json解析, 错误处理, 代码示例", "content": ""import requests\nimport json\n\ndef crawl_people_daily(start_date, end_date, keyword):\n url = "http://search.people.com.cn/api-search/elasticSearch/searchByPage"\n headers = {\n "Content-Type": "application/json",\n "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36"\n }\n\n articles = []\n count = 0\n\n for page in range(1, 10):\n payload = {\n "keyword": keyword,\n "startTime": start_date,\n "endTime": end_date,\n "siteName": "人民日报",\n "pageNum": page,\n "pageSize": 20\n }\n\n try:\n response = requests.post(url, headers=headers, data=json.dumps(payload))\n response.raise_for_status()\n data = response.json()\n \n if data['count'] == 0:\n break\n \n for item in data['result']:\n title = item['title']\n content = item['content']\n articles.append((title, content))\n count += 1\n\n except requests.exceptions.RequestException as e:\n print("请求发生异常:", e)\n break\n \n except json.JSONDecodeError as e:\n print("响应内容解析异常:", e)\n break\n\n return articles, count\n\nstart_date = input("请输入开始时间(格式:yyyy-mm-dd):")\nend_date = input("请输入结束时间(格式:yyyy-mm-dd):")\nkeyword = input("请输入关键词:")\n\narticles, count = crawl_people_daily(start_date, end_date, keyword)\n\n# 将文章保存到文件\npath = "articles/"\nfor i, (title, content) in enumerate(articles):\n with open(path + f"article_{i+1}.txt", "w", encoding="utf-8") as file:\n file.write(f"标题:{title}\n\n内容:{content}\n")\n\n# 显示爬取的文章数量和正确性比率\ncorrect_count = sum(keyword in content for _, content in articles)\naccuracy = correct_count / count if count > 0 else 0\nprint(f"爬取的文章数量:{count}")\nprint(f"正确性比率:{accuracy}")\n"}
原文地址: https://www.cveoy.top/t/topic/qptD 著作权归作者所有。请勿转载和采集!