注:本代码使用的是Python3版本

(1)动态网页爬取操作

1.导入Selenium库

from selenium import webdriver

2.创建浏览器对象

browser = webdriver.Chrome()

3.访问url地址

url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'
browser.get(url)

4.定位元素并点击(借助Selenium库,点击进入电影全部影评)

btn_more = browser.find_element_by_css_selector('.lnk-tc')
btn_more.click()

(2)网页数据分析及数据抓取

1.导入相关库

import time
import json
from bs4 import BeautifulSoup

2.请求头提取

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

3.发送请求并获得网页数据

html = browser.page_source

4.解析网页结构

soup = BeautifulSoup(html, 'html.parser')

5.定位评论人名称

names = []
name_tags = soup.select('.comment-item .comment-info a')
for name_tag in name_tags:
    names.append(name_tag.text.strip())

6.定位评论时间

times = []
time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
for time_tag in time_tags:
    times.append(time_tag.text.strip())

7.定位评论内容

comments = []
comment_tags = soup.select('.comment-item .comment-content span')
for comment_tag in comment_tags:
    comments.append(comment_tag.text.strip())

(3)多页网址数据抓取

1.遍历页面循环

for i in range(2, 4):

2.初始页面url确定

    url = 'https://movie.douban.com/subject/25868125/comments?start={}&limit=20&status=P&sort=new_score'.format((i - 1) * 20)

3.页面翻页规律设置

    browser.get(url)
    time.sleep(2)
    html = browser.page_source
    soup = BeautifulSoup(html, 'html.parser')
    names = []
    name_tags = soup.select('.comment-item .comment-info a')
    for name_tag in name_tags:
        names.append(name_tag.text.strip())
    times = []
    time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
    for time_tag in time_tags:
        times.append(time_tag.text.strip())
    comments = []
    comment_tags = soup.select('.comment-item .comment-content span')
    for comment_tag in comment_tags:
        comments.append(comment_tag.text.strip())

(4)保存数据

1.文件数据写入

data = []
for i in range(len(names)):
    item = {}
    item['name'] = names[i]
    item['time'] = times[i]
    item['comment'] = comments[i]
    data.append(item)

with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

2.文件定义json格式

{
    "name": "评论人名称",
    "time": "评论时间",
    "comment": "评论内容"
} 

完整代码如下:

from selenium import webdriver
import time
import json
from bs4 import BeautifulSoup

# 动态网页爬取操作

browser = webdriver.Chrome()
url = 'https://movie.douban.com/subject/25868125/comments?start=0&limit=20&status=P&sort=new_score'
browser.get(url)
btn_more = browser.find_element_by_css_selector('.lnk-tc')
btn_more.click()

# 网页数据分析及数据抓取

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')

names = []
name_tags = soup.select('.comment-item .comment-info a')
for name_tag in name_tags:
    names.append(name_tag.text.strip())

times = []
time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
for time_tag in time_tags:
    times.append(time_tag.text.strip())

comments = []
comment_tags = soup.select('.comment-item .comment-content span')
for comment_tag in comment_tags:
    comments.append(comment_tag.text.strip())

# 多页网址数据抓取

for i in range(2, 4):
    url = 'https://movie.douban.com/subject/25868125/comments?start={}&limit=20&status=P&sort=new_score'.format((i - 1) * 20)
    browser.get(url)
    time.sleep(2)
    html = browser.page_source
    soup = BeautifulSoup(html, 'html.parser')
    name_tags = soup.select('.comment-item .comment-info a')
    for name_tag in name_tags:
        names.append(name_tag.text.strip())
    time_tags = soup.select('.comment-item .comment-info span:nth-of-type(2)')
    for time_tag in time_tags:
        times.append(time_tag.text.strip())
    comment_tags = soup.select('.comment-item .comment-content span')
    for comment_tag in comment_tags:
        comments.append(comment_tag.text.strip())

# 保存数据

data = []
for i in range(len(names)):
    item = {}
    item['name'] = names[i]
    item['time'] = times[i]
    item['comment'] = comments[i]
    data.append(item)

with open('comments.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)
``

原文地址: http://www.cveoy.top/t/topic/g7k6 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录