Bilibili 饭拍图 文章图片爬取

本代码使用 Python 爬取 Bilibili 搜索页面'饭拍图'关键词第二页所有文章的图片,并保存到本地文件夹。

代码实现

import re
import requests
import time
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from urllib.parse import urlparse, urljoin

# Set the running mode
RUNNING_MODE = 1

# Set the URL to scrape
page_url = 'https://search.bilibili.com/article?keyword=饭拍图&page=2'  # Start from page 2

# Specify the save path and WebDriver path
save_path = 'bilibili_images'
webdriver_path = '/Users/yangjunjie/Downloads/chromedriver-mac-x64/chromedriver'  # Update with the correct path

# If the directory doesn't exist, create it
if not os.path.exists(save_path):
    os.makedirs(save_path)

# Set Chrome WebDriver options
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-extensions')
chrome_options.executable_path = webdriver_path

# Function to extract article URLs from the search page
def extract_article_urls(url):
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    time.sleep(3)

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    article_links = soup.find_all('a', class_='article-title')

    article_urls = [urljoin(url, link['href']) for link in article_links]
    driver.quit()
    return article_urls

# Function to scrape and download images from a given URL
def scrape_and_download_images(url, folder):
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    time.sleep(3)

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    image_elements = soup.find_all('img', {'data-src': True})

    for img_idx, img in enumerate(image_elements):
        image_url = img['data-src']

        if image_url.startswith('//'):
            image_url = 'https:' + image_url

        filename = f'image_{img_idx + 1}.jpg'
        save_file_path = os.path.join(folder, filename)

        if os.path.exists(save_file_path):
            print(f'[{img_idx + 1}/{len(image_elements)}] Image already exists. Path: {save_file_path}')
        else:
            try:
                with open(save_file_path, 'wb') as f:
                    response = requests.get(image_url)
                    f.write(response.content)
                    print(f'[{img_idx + 1}/{len(image_elements)}] Downloaded image: {filename}. Path: {save_file_path}')
            except requests.RequestException as e:
                print(f'Request error: {e}')
                print('Waiting for ten seconds before retrying...')
                time.sleep(10)
                continue

    driver.quit()

# Extract article URLs from the search page
article_urls = extract_article_urls(page_url)

# Scrape and download images from each article
for idx, article_url in enumerate(article_urls):
    print(f'Scraping images from article {idx + 1}/{len(article_urls)}')
    article_folder = os.path.join(save_path, f'article_{idx + 1}')
    if not os.path.exists(article_folder):
        os.makedirs(article_folder)
    scrape_and_download_images(article_url, article_folder)

代码说明

  1. 导入必要的库:re, requests, time, selenium, bs4 等。
  2. 设置运行模式和目标 URL。
  3. 指定图片保存路径和 ChromeDriver 路径。
  4. 定义extract_article_urls函数,用于提取搜索页面中所有文章的 URL。
  5. 定义scrape_and_download_images函数,用于从指定 URL 爬取并下载图片。
  6. 调用extract_article_urls函数获取文章 URL 列表。
  7. 遍历文章 URL 列表,调用scrape_and_download_images函数下载每篇文章的图片。

注意

  1. 请确保已安装正确版本的 Chrome WebDriver,并将webdriver_path变量更新为正确的路径。
  2. 本代码可能需要调整以适应 Bilibili 页面结构的改变。
  3. 请遵守 Bilibili 的使用条款,不要进行恶意爬取或数据滥用。

运行结果

代码运行后,将在指定的保存路径下创建名为article_1, article_2等文件夹,每个文件夹中包含该文章的图片。

扩展

  1. 可以修改代码,爬取其他页面或关键词的图片。
  2. 可以添加图片处理功能,例如压缩、格式转换等。
  3. 可以使用其他爬虫框架,例如 Scrapy,提高爬取效率。

希望本代码能帮助你快速爬取 Bilibili 饭拍图文章图片。


原文地址: https://www.cveoy.top/t/topic/hw4e 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录