我需要一个完整爬虫代码要有详细注释从’httpsmoviedoubancomsubject25868125commentsstart=0&limit=20&status=P&sort=new_score’地址开始抓取第一页的评论人名称、评论时间以及评论。继续抓取2-3页的所有评论人名称、评论时间以及评论。将抓取到的数据以文件存储的方式存储为json格式数据。要求如下1动态网页爬取操作
注:本代码使用的是Python3版本
(1)动态网页爬取操作
1.导入Selenium库
from selenium import webdriver
2.创建浏览器对象
browser = webdriver.Chrome()
3.访问url地址
url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'
browser.get(url)
4.定位元素并点击(借助Selenium库,点击进入电影全部影评)
btn_more = browser.find_element_by_css_selector('.lnk-tc')
btn_more.click()
(2)网页数据分析及数据抓取
1.导入相关库
import time
import json
from bs4 import BeautifulSoup
2.请求头提取
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
3.发送请求并获得网页数据
html = browser.page_source
4.解析网页结构
soup = BeautifulSoup(html, 'html.parser')
5.定位评论人名称
names = []
name_tags = soup.select('.comment-item .comment-info a')
for name_tag in name_tags:
names.append(name_tag.text.strip())
6.定位评论时间
times = []
time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
for time_tag in time_tags:
times.append(time_tag.text.strip())
7.定位评论内容
comments = []
comment_tags = soup.select('.comment-item .comment-content span')
for comment_tag in comment_tags:
comments.append(comment_tag.text.strip())
(3)多页网址数据抓取
1.遍历页面循环
for i in range(2, 4):
2.初始页面url确定
url = 'https://movie.douban.com/subject/25868125/comments?start={}&limit=20&status=P&sort=new_score'.format((i - 1) * 20)
3.页面翻页规律设置
browser.get(url)
time.sleep(2)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
names = []
name_tags = soup.select('.comment-item .comment-info a')
for name_tag in name_tags:
names.append(name_tag.text.strip())
times = []
time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
for time_tag in time_tags:
times.append(time_tag.text.strip())
comments = []
comment_tags = soup.select('.comment-item .comment-content span')
for comment_tag in comment_tags:
comments.append(comment_tag.text.strip())
(4)保存数据
1.文件数据写入
data = []
for i in range(len(names)):
item = {}
item['name'] = names[i]
item['time'] = times[i]
item['comment'] = comments[i]
data.append(item)
with open('comments.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
2.文件定义json格式
{
"name": "评论人名称",
"time": "评论时间",
"comment": "评论内容"
}
完整代码如下:
from selenium import webdriver
import time
import json
from bs4 import BeautifulSoup
# 动态网页爬取操作
browser = webdriver.Chrome()
url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'
browser.get(url)
btn_more = browser.find_element_by_css_selector('.lnk-tc')
btn_more.click()
# 网页数据分析及数据抓取
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
names = []
name_tags = soup.select('.comment-item .comment-info a')
for name_tag in name_tags:
names.append(name_tag.text.strip())
times = []
time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
for time_tag in time_tags:
times.append(time_tag.text.strip())
comments = []
comment_tags = soup.select('.comment-item .comment-content span')
for comment_tag in comment_tags:
comments.append(comment_tag.text.strip())
# 多页网址数据抓取
for i in range(2, 4):
url = 'https://movie.douban.com/subject/25868125/comments?start={}&limit=20&status=P&sort=new_score'.format((i - 1) * 20)
browser.get(url)
time.sleep(2)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
name_tags = soup.select('.comment-item .comment-info a')
for name_tag in name_tags:
names.append(name_tag.text.strip())
time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
for time_tag in time_tags:
times.append(time_tag.text.strip())
comment_tags = soup.select('.comment-item .comment-content span')
for comment_tag in comment_tags:
comments.append(comment_tag.text.strip())
# 保存数据
data = []
for i in range(len(names)):
item = {}
item['name'] = names[i]
item['time'] = times[i]
item['comment'] = comments[i]
data.append(item)
with open('comments.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
``
原文地址: http://www.cveoy.top/t/topic/g7k6 著作权归作者所有。请勿转载和采集!