首先,需要安装BeautifulSoup4库和pandas库。\n\npython\npip install beautifulsoup4\npip install pandas\n\n\n然后,可以使用以下代码实现爬取豆瓣电影TOP250排行榜的相关信息,并保存到excel文件中。\n\npython\nimport requests\nfrom bs4 import BeautifulSoup\nimport pandas as pd\n\ndef get_movie_info(url):\n headers = {\n 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;Win64;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'\n }\n response = requests.get(url, headers=headers)\n soup = BeautifulSoup(response.text, 'html.parser')\n \n movie_list = []\n movie_items = soup.find_all('div', class_='item')\n for item in movie_items:\n rank = item.find('em').text\n poster = item.find('img')['src']\n title = item.find('span', class_='title').text\n rating = item.find('span', class_='rating_num').text\n info = item.find('div', class_='bd').p.text.strip().split('\n')\n director = info[0].strip('导演: ').split('   ')[0]\n actors = info[0].strip('导演: ').split('   ')[1].strip('主演: ').replace('...', '')\n release_date = info[1].strip().split(' / ')[0]\n release_place = info[1].strip().split(' / ')[1]\n genres = info[1].strip().split(' / ')[2]\n comments = item.find('div', class_='star').find_all('span')[-1].text\n \n movie_list.append({\n '排名': rank,\n '海报': poster,\n '片名': title,\n '评分': rating,\n '导演': director,\n '主演': actors,\n '上映时间': release_date,\n '上映地点': release_place,\n '类型': genres,\n '热门短评': comments\n })\n \n return movie_list\n\ndef get_hot_comments(url):\n headers = {\n 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;Win64;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'\n }\n response = requests.get(url, headers=headers)\n soup = BeautifulSoup(response.text, 'html.parser')\n \n comments_list = []\n comments_items = soup.find_all('div', class_='comment-item')\n for i, item in enumerate(comments_items[:5]):\n comments = item.find('span', class_='short').text\n comments_list.append(f'{i+1}.{comments}')\n \n return '\n'.join(comments_list)\n\ndef main():\n base_url = 'https://movie.douban.com/top250'\n movie_list = []\n \n for page in range(10):\n url = f'{base_url}?start={page*25}&filter='\n movie_list.extend(get_movie_info(url))\n \n for movie in movie_list[:10]:\n movie['热门短评'] = get_hot_comments(movie['海报'])\n \n df = pd.DataFrame(movie_list)\n df.to_excel('douban_top250.xlsx', index=False)\n\nif __name__ == '__main__':\n main()\n\n\n运行上述代码后,将会在当前目录下生成名为"douban_top250.xlsx"的excel文件,其中包含了豆瓣电影TOP250排行榜中250部影片的基本信息以及第一页前10名影片的前5条热门短评。


原文地址: http://www.cveoy.top/t/topic/psJl 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录