Python Baidu Image Downloader: Download Images for Machine Learning

This is a Python script that downloads images of a specified keyword from Baidu Image Search. It uses the 'requests' library to send HTTP requests to the Baidu Image Search API, and then uses regular expressions to extract the image URLs from the API response. It then downloads the images using the 'requests' library and saves them to a folder on the local machine.

The script takes two arguments: the keyword to search for and the number of images to download (default is 10). It first constructs the API URL using the keyword and page number, and then sends a GET request to the API using the 'requests' library. It then extracts the image URLs from the API response using regular expressions. It then creates a folder on the local machine using the keyword as the folder name, and then downloads each image to the folder using the 'requests' library.

The script is useful for downloading large numbers of images for training machine learning models or for data analysis. However, it is important to respect the terms of service of the API and not abuse the service by downloading too many images too quickly.

import re
import os
import requests
from urllib.parse import quote

def download_pic(keyword, num=10):
    url_template = 'https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&fp=result&word={}&pn={}&rn=30'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    pic_urls = []
    for i in range(num // 30 + 1):
        url = url_template.format(quote(keyword), i * 30)
        response = requests.get(url, headers=headers)
        html = response.content.decode('utf-8')
        pic_urls += re.findall('"thumbURL":"(.*?)",', html)

    folder_path = './{}'.format(keyword)
    if not os.path.exists(folder_path):
        os.mkdir(folder_path)

    for idx, pic_url in enumerate(pic_urls[:num]):
        try:
            pic = requests.get(pic_url, headers=headers, timeout=5)
            img_name = '{}/{}_{}.jpg'.format(folder_path, keyword, idx)
            with open(img_name, 'wb') as f:
                f.write(pic.content)
            print('[INFO] Successfully downloaded: {}'.format(img_name))
        except Exception as e:
            print('[ERROR] Failed to download {}: {}'.format(pic_url, e))

if __name__ == '__main__':
    download_pic('cat', 50)