Python 使用 BeautifulSoup 解析网页内容教程

BeautifulSoup 是一个 Python 库，用于解析 HTML 和 XML 文档。它提供了一种简单易用的 API，可以让你方便地遍历和提取网页内容。

1. 安装 BeautifulSoup 库

首先需要安装 BeautifulSoup 库，可以使用以下命令安装：

pip install beautifulsoup4

2. 导入 BeautifulSoup 库和 requests 库

导入 BeautifulSoup 库和 requests 库（用于发送 HTTP 请求）：

from bs4 import BeautifulSoup
import requests

3. 发送 HTTP 请求获取网页内容

发送 HTTP 请求获取网页内容：

url = 'https://www.example.com'
response = requests.get(url)
content = response.text

4. 使用 BeautifulSoup 解析网页内容

使用 BeautifulSoup 解析网页内容：

soup = BeautifulSoup(content, 'html.parser')

5. 定位元素并获取内容或属性

通过标签名、类名、id 等方式定位元素并获取其内容或属性：

# 通过标签名定位元素
title = soup.title.string
print(title)

# 通过类名定位元素
items = soup.find_all('div', class_='item')
for item in items:
    name = item.find('h3', class_='name').string
    price = item.find('span', class_='price').string
    print(name, price)

# 通过id定位元素
logo = soup.find('img', id='logo')
 src = logo['src']
 print(src)

6. 其他常用操作

# 获取所有链接
links = soup.find_all('a')
for link in links:
    href = link['href']
    text = link.string
    print(href, text)

# 获取所有图片
imgs = soup.find_all('img')
for img in imgs:
    src = img['src']
    alt = img['alt']
    print(src, alt)

# 获取表格数据
table = soup.find('table', class_='table')
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    for col in cols:
        text = col.string
        print(text, end='	')
    print()

总结

BeautifulSoup 是一个强大的工具，可以用来解析网页内容。通过使用它，你可以轻松地提取你想要的信息。