Python爬取当当网数据并分析：实战教程及代码示例

由于涉及到爬取网站数据，我们需要先了解一下网站的反爬机制，以及相关政策，避免因为违反规定而引起不必要的麻烦。

网站反爬机制

当当网的反爬机制主要有以下几种：

IP限制：当一个IP地址在短时间内访问网站过于频繁时，网站会禁止该IP的访问。
验证码：当爬虫程序频繁访问网站时，网站会弹出验证码，要求用户输入正确的验证码才能继续访问。
User-Agent检测：网站会检测请求的User-Agent是否为浏览器的User-Agent，如果不是则会禁止访问。
动态页面：当当网的商品页面是动态生成的，需要通过JS动态加载数据，因此需要使用selenium等工具模拟浏览器操作。

爬取当当网数据

在爬取数据之前，我们需要先分析当当网的网站结构和数据接口，以此来确定爬取的策略和方法。

当当网的数据接口是通过ajax请求来获取的，我们可以通过抓包工具来查看数据请求的URL和参数信息，以此来模拟请求获取数据。

爬取当当网的数据可以分为以下几步：

使用selenium模拟浏览器打开当当网的商品页面
获取商品页面的HTML源码，解析出商品的基本信息
分析ajax请求，获取商品的评论数据
对评论数据进行分析和处理，得出统计结果

下面是爬取当当网数据的代码实现：

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

# 使用selenium打开当当网的商品页面
driver = webdriver.Chrome()
driver.get('https://product.dangdang.com/23627332.html')

# 等待页面加载完成
time.sleep(5)

# 获取商品页面的HTML源码
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# 解析商品的基本信息
title = soup.find('div', class_='name_info').find('h1').text
author = soup.find('div', class_='messbox_info').find('span', class_='t1').find('a').text
publisher = soup.find('div', class_='messbox_info').find('span', class_='t2').find('a').text
pubdate = soup.find('div', class_='messbox_info').find('span', class_='t3').text.split('：')[-1]
price = soup.find('div', class_='price_m').find('span', class_='price_n').text

# 分析ajax请求，获取商品的评论数据
url = 'http://product.dangdang.com/index.php?r=comment%2Flist&productId=23627332&categoryPath=01.00.00.00.00.00&mainProductId=23627332&mediumId=0&pageIndex=1&sortType=1&filterType=1&isSystem=0&tagId=0&tagFilterCount=0&template=publish&long_or_short=short'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers).json()
comments = response['data']['list']

# 对评论数据进行分析和处理
total_score = 0
good_score = 0
normal_score = 0
bad_score = 0
for comment in comments:
    if comment['score'] == 5:
        good_score += 1
    elif comment['score'] == 3:
        normal_score += 1
    elif comment['score'] == 1:
        bad_score += 1
    total_score += 1

# 输出统计结果
print('书名：', title)
print('作者：', author)
print('出版社：', publisher)
print('出版日期：', pubdate)
print('价格：', price)
print('总评价数：', total_score)
print('好评数：', good_score)
print('中评数：', normal_score)
print('差评数：', bad_score)

数据分析

在获取到当当网的数据后，我们可以使用pandas库对数据进行分析和处理。

下面是对当当网数据进行分析的代码实现：

import pandas as pd
import matplotlib.pyplot as plt

# 构造数据
data = {
    'score': [5, 3, 1],
    'count': [good_score, normal_score, bad_score]
}

# 将数据转换为DataFrame格式
df = pd.DataFrame(data)

# 计算好评率、差评率等指标
df['rate'] = df['count'] / df['count'].sum()
df['positive_rate'] = (df['count'] - df.loc[2, 'count']) / (df['count'].sum() - df.loc[2, 'count'])
df['negative_rate'] = df.loc[2, 'count'] / df['count'].sum()

# 输出统计结果
print(df)

# 可视化统计结果
plt.figure(figsize=(6, 6))
plt.pie(df['count'], labels=df['score'], autopct='%1.1f%%')
plt.title('当当网商品评论统计')
plt.show()

通过以上代码，我们可以对当当网的商品评论数据进行可视化展示，更直观地了解用户对商品的评价情况。