Python爬虫代码函数接口设计解析
Python爬虫代码函数接口设计解析/n/n本文将详细解析以下Python爬虫代码中每个函数的接口设计,包括函数名、作用、输入、输出等,并对代码进行了SEO优化,方便搜索引擎收录。/n/npython/nimport logging/nimport re/nfrom urllib.parse import urljoin/nimport time,random/nimport datetime/nimport requests/nimport json /nimport os/n/n/n/nlogging.basicConfig(level=logging.INFO,/n format='%(asctime)s - %(levelname)s: %(message)s')/n/nBASE_URL = 'https://jwc.sues.edu.cn/jxxw'/nTOTAL_PAGE = 12/nNum=0/n/n/ndef scrape_page(url):/n logging.info('scraping %s...', url)/n try:/n response = requests.get(url)/n if response.status_code == 200:/n return response.text/n print(f'get invalid status code {response.status_code} while scraping {url}')/n except requests.RequestException:/n print(f'error occurred while scraping {url}')/n/n/ndef scrape_index(page):/n index_url = f'{BASE_URL}/list{page}.htm'/n return scrape_page(index_url)/n/n/ndef parse_index(html):/n pattern = re.compile(r'href=/'(.*?)/'/starget=.*?/stitle=.*?')/n items = re.findall(pattern, html)/n if not items:/n return []/n for item in items:/n detail_url = urljoin(BASE_URL, item) /n logging.info(f'get detail url {detail_url}')/n yield detail_url/n/n/ndef scrape_detail(url):/n return scrape_page(url)/n/n/n/ndef parse_detail(html,url): /n/n global Num/n Num = Num+1/n/n title_pattern = re.compile(r'<h1/sclass=/'arti_title/'>(.*?)</h1>', re.S)/n date_pattern = re.compile(r'<span/sclass=/'arti_update/'>(.*?)</span>',re.S)/n title = re.search(title_pattern, html).group(1).strip() if re.search(title_pattern, html) else None/n dates = re.search(date_pattern, html).group(1) if re.search(date_pattern, html) else None/n return {/n '序号': Num,/n '链接': url,/n '标题': title,/n '时间': dates/n }/n/n/nRESULTS_DIR = 'results' /nexists = os.path.exists/nmakedirs = os.makedirs/nexists(RESULTS_DIR) or makedirs(RESULTS_DIR) /n/n/ndef save_data(data): /n name = data.get('标题')/n data_path = f'{RESULTS_DIR}/{name}.json'/n json.dump(data, open(data_path, 'w', encoding='utf-8'), ensure_ascii=False, indent=2)/n/n merge_file_path = f'{RESULTS_DIR}/merged_data.json'/n with open(merge_file_path, 'a', encoding='utf-8') as merge_file:/n json.dump(data, merge_file, ensure_ascii=False)/n merge_file.write('/n')/n/n/ndef main():/n for page in range(1, TOTAL_PAGE + 1):/n index_html = scrape_index(page)/n detail_urls = parse_index(index_html)/n for detail_url in detail_urls:/n detail_html = scrape_detail(detail_url)/n data = parse_detail(detail_html, detail_url) /n time.sleep(random.randint(1, 5))/n logging.info(f'get detail data {data}')/n logging.info('saving data to json file')/n save_data(data)/n logging.info('data saved successfully')/n/nif __name__ == '__main__':/n start_time = datetime.datetime.now()/n main()/n end_time = datetime.datetime.now()/n print('用时: ', end_time - start_time)/n/n/n### 函数接口设计解析/n/n1. scrape_page(url)/n/n * 作用: 爬取指定URL的页面内容/n * 输入: url (字符串)/n * 输出: response.text (字符串)/n/n2. scrape_index(page)/n/n * 作用: 爬取指定页数的索引页面内容/n * 输入: page (整数)/n * 输出: index_html (字符串)/n/n3. parse_index(html)/n/n * 作用: 解析索引页面内容,获取详情页面的URL/n * 输入: html (字符串)/n * 输出: detail_url (生成器)/n/n4. scrape_detail(url)/n/n * 作用: 爬取指定URL的详情页面内容/n * 输入: url (字符串)/n * 输出: response.text (字符串)/n/n5. parse_detail(html, url)/n/n * 作用: 解析详情页面内容,获取相关数据/n * 输入: html (字符串), url (字符串)/n * 输出: data (字典)/n/n6. save_data(data)/n/n * 作用: 将数据保存到json文件中/n * 输入: data (字典)/n * 输出: 无/n/n7. main()/n/n * 作用: 程序的主函数,控制整个爬取流程/n * 输入: 无/n * 输出: 无/n/n### 代码优化/n/n1. 添加代码注释: 增强代码可读性和可维护性。/n2. 使用logging模块: 记录爬取过程中的日志信息,方便调试和追踪问题。/n3. 使用更具描述性的变量名: 提高代码可读性。/n4. 使用f-string格式化字符串: 提高代码简洁性和可读性。/n5. 使用os.path.exists和os.makedirs: 简化文件操作。/n6. 使用json.dump和json.load: 简化json文件操作。/n/n希望以上解析和优化能够帮助您更好地理解和应用Python爬虫代码。/n
原文地址: https://www.cveoy.top/t/topic/o6wh 著作权归作者所有。请勿转载和采集!