Learning Outcomes

  • To learn how to download multiple images in Python using synchronous and asynchronous code.

Automatically downloading images from a number of your HTML pages is an essential skill, in this guide you’ll be learning 4 methods on how to download images using Python!

Let’s begin with the easiest example, if we already have a list of image URLs then we can follow this process:

  1. Change into a directory where we would like to store all of the images.
  2. Make a request to download all of the images, one by one.
  3. We will also include error handling so that if a URL no longer exists the code will still work.

Python Imports

!pip install tldextract
Requirement already satisfied: tldextract in /opt/anaconda3/lib/python3.7/site-packages (2.2.2)
Requirement already satisfied: requests>=2.1.0 in /opt/anaconda3/lib/python3.7/site-packages (from tldextract) (2.22.0)
Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.7/site-packages (from tldextract) (46.0.0.post20200309)
Requirement already satisfied: requests-file>=1.4 in /opt/anaconda3/lib/python3.7/site-packages (from tldextract) (1.5.1)
Requirement already satisfied: idna in /opt/anaconda3/lib/python3.7/site-packages (from tldextract) (2.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/anaconda3/lib/python3.7/site-packages (from requests>=2.1.0->tldextract) (1.25.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/anaconda3/lib/python3.7/site-packages (from requests>=2.1.0->tldextract) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.7/site-packages (from requests>=2.1.0->tldextract) (2019.11.28)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.7/site-packages (from requests-file>=1.4->tldextract) (1.14.0)
import requests
import os
import subprocess
import urllib.request
from bs4 import BeautifulSoup
import tldextract

!mkdir all_images
!ls

Changing into the directory of the folder called all_images, this can be done by either:


cd all_images
os.chdir('path')
os.chdir('all_images')
!pwd
/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images

Method One: How To Download Multiple Images From A Python List

In order to download the multiple images, we’ll use the requests library. We’ll also create a python list to store any broken image URLs that didn’t return a 200 status code:

broken_images = []
image_urls = ['https://sempioneer.com/wp-content/uploads/2020/05/dataframe-300x84.png',
             'https://sempioneer.com/wp-content/uploads/2020/05/json_format_data-300x72.png']
for img in image_urls:
    # We can split the file based upon / and extract the last split within the python list below:
    file_name = img.split('/')[-1]
    print(f"This is the file name: {file_name}")
    # Now let's send a request to the image URL:
    r = requests.get(img, stream=True)
    # We can check that the status code is 200 before doing anything else:
    if r.status_code == 200:
        # This command below will allow us to write the data to a file as binary:
        with open(file_name, 'wb') as f:
            for chunk in r:
                f.write(chunk)
    else:
        # We will write all of the images back to the broken_images list:
        broken_images.append(img)
This is the file name: dataframe-300x84.png
This is the file name: json_format_data-300x72.png

☝️ See how simple that is! ☝️

If you check your folder, you will have now downloaded all of the images that contained a status code of 200!


downloading images correctly with python

Method Two: How To Download Multiple Images From Many HTML Web Pages

If we don’t yet have the exact image URLs, we will need to do the following:

  1. Download the HTML content of every web page.
  2. Extract all of the image URLs for every page.
  3. Create the file names.
  4. Check to see if the image status code is 200.
  5. Write all of images to your local computer.

This website internetingishard.com has some relative image URLs. Therefore we will need to ensure that our code can handle for the following two types of image source URLs:


web_pages = ['https://understandingdata.com/', 
             'https://understandingdata.com/data-engineering-services/',
             'https://www.internetingishard.com/html-and-css/links-and-images/']

We will also extract the domain of every URL whilst we loop over the webpages like so:


for page in webpages:
    domain_name = tldextract.extract(page).registered_domain
url_dictionary = {}
for page in web_pages:
    # 1. Extracting the domain name of the web page:
    domain_name = tldextract.extract(page).registered_domain
    print(f"The domain name: {domain_name}")    
    # 2. Request the web page:
    r = requests.get(page)
    # 3. Check to see if the web page returned a status_200:
    if r.status_code == 200:

        # 4. Create a URL dictionary entry for future use:
        url_dictionary[page] = []

        # 5. Parse the HTML content with BeautifulSoup and look for image tags:
        soup = BeautifulSoup(r.content, 'html.parser')

        # 6. Find all of the images per web page:
        images = soup.findAll('img')

        # 7. Store all of the images 
        url_dictionary[page].extend(images)

    else:
        print('failed!')
The domain name: understandingdata.com
The domain name: understandingdata.com
The domain name: internetingishard.com

Now let’s double check and filter our dictionary so that we only look at web pages where there was at least 1 image tag:

for key, value in url_dictionary.items():
    if len(value) > 0:
        print(f"This domain: {key} has more than 1 image on the web page.")
This domain: https://understandingdata.com/ has more than 1 image on the web page.
This domain: https://understandingdata.com/data-engineering-services/ has more than 1 image on the web page.
This domain: https://www.internetingishard.com/html-and-css/links-and-images/ has more than 1 image on the web page.

An easier way to write the above code would be via a dictionary comprehension:

cleaned_dictionary = {key: value for key, value in url_dictionary.items() if len(value) > 0}

We can now clean all of the image URLs inside of every dictionary key and change all of the relative URL paths to exact URL paths.

Let’s start by printing out all of the different image sources to see how we might need to clean up the data below:

for key, images in cleaned_dictionary.items():
    for image in images:
        print(image.attrs['src'])

For the scope of this tutorial, I have decided to:

  • Remove the logo links with the //
  • Add on the domain to the relative URLs
all_images = []

for key, images in cleaned_dictionary.items():
    # 1. Creating a clean_urls and domain name for every page:
    clean_urls = []
    domain_name = tldextract.extract(key).registered_domain
    # 2. Looping over every image per url:
    for image in images:
        # 3. Extracting the source (src) with .attrs:
        source_image_url = image.attrs['src']
        # 4. Clean The Data
        if source_image_url.startswith("//"):
            pass
        elif domain_name not in source_image_url and 'http' not in source_image_url:
            url = 'https://' + domain_name + source_image_url
            all_images.append(url)
        else:
            all_images.append(source_image_url)
print(all_images[0:5])
['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg', 'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg', 'https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png', 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g', 'https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg']

After cleaning the image URLs, we can now refer to method one for downloading the images to our computer!

This time let’s convert it into a function:

def extract_images(image_urls_list:list, directory_path):

    # Changing directory into a specific folder:
    os.chdir(directory_path)

    # Downloading all of the images
    for img in image_urls_list:
        file_name = img.split('/')[-1]

        # Let's try both of these versions in a loop [https:// and https://www.]
        url_paths_to_try = [img, img.replace('https://', 'https://www.')]
        for url_image_path in url_paths_to_try:
            print(url_image_path)
            try:
                r = requests.get(img, stream=True)
                if r.status_code == 200:
                    with open(file_name, 'wb') as f:
                        for chunk in r:
                            f.write(chunk)
            except Exception as e:
                pass        
!pwd
/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images
path = '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images'

extract_images(image_urls_list=all_images, 
               directory_path=path)

Fantastic!

Now there are some things that we didn’t necessarily cover for which include:

  • http:// only image urls.
  • http://www. only image urls.

But for the most part, you’ll be able to download images in bulk!


how to download multiple images within python

How To Speed Up Your Image Downloads

When working with 100’s or 1000’s of URLs its important to avoid using a synchronous approach to downloading images. An asynchronous approach means that we can download multiple web pages or multiple images in parallel.

This means that the overall execution time will be much quicker!


ThreadPoolExecutor()

The ThreadPoolExecutor is one of python’s built in I/O packages for creating an asynchronous behaviour via multiple threads. In order to utilise it, we will make sure that the function will only work on a single URL.

Then we will pass the image URL list into multiple workers 😉

def extract_single_image(img):
    file_name = img.split('/')[-1]

    # Let's try both of these versions in a loop [https:// and https://www.]
    url_paths_to_try = [img, img.replace('https://', 'https://www.')]
    for url_image_path in url_paths_to_try:
        try:
            r = requests.get(img, stream=True)
            if r.status_code == 200:
                with open(file_name, 'wb') as f:
                    for chunk in r:
                        f.write(chunk)
            return "Completed"
        except Exception as e:
            return "Failed"
all_images[0:5]
['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png',
 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g',
 'https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg']

The below code will create a new directory and then make it the current active working directory:

try:
    os.mkdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc')

except FileExistsError as e:
    print('The file path already exists!')

os.chdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc')
import concurrent.futures
import urllib.request

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(extract_single_image, image_url) for image_url in all_images}
    for future in concurrent.futures.as_completed(future_to_url):
        try:
            url = future_to_url[future]
        except Exception as e:
            pass
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))

You should’ve downloaded the images but at a much faster rate!


Async Programming!

Just like JavaScript, Python 3.6+ comes bundled with native support for co-routines called asyncio. Similar to NodeJS, there is a method available to you for creating custom event loops for async code.

We will also need to download an async code HTTP requests library called aiohttp

!pip install aiohttp

We will also download aiofiles that allows us to write multiple image files asynchronously:

!pip install aiofiles
import aiohttp
import aiofiles
import asyncio

try:
    os.mkdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc_event_loop')
except FileExistsError as e:
    print('The file path already exists!')

os.chdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc_event_loop')
!pwd

/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc_event_loop

How To Download 1 File Asychronously

print(all_images[0:1])
['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg']
single_image = 'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg'
async with aiohttp.ClientSession() as session:
    async with session.get(single_image) as resp:
        # 1. Capturing the image file name like we did before:
        single_image_name = single_image.split('/')[-1]
        # 2. Only proceed further if the HTTP response is 200 (Ok)
        if resp.status == 200:
            async with aiofiles.open(single_image_name, mode='wb') as f:
                await f.write(await resp.read())
                await f.close()
Downloading one image with aiofiles

We will need to structure our code slightly different for the async version to work across multiple files:

  1. We will have a fetch function to query every image URL.
  2. We will have a main function that creates, then executes a series of co-routines.
async def fetch(session, url):
    async with session.get(url) as resp:
        # 1. Capturing the image file name like we did before:
        url_name = url.split('/')[-1]
        # 2. Only proceed further if the HTTP response is 200 (Ok)
        if resp.status == 200:
            async with aiofiles.open(url_name, mode='wb') as f:
                await f.write(await resp.read())
                await f.close()
async def main(image_urls:list):
    tasks = []
    headers = {
        "user-agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}
    async with aiohttp.ClientSession(headers=headers) as session:
        for image in image_urls:
            tasks.append(await fetch(session, url))
    data = await asyncio.gather(*tasks)
main(all_images)
<coroutine object main at 0x107746e60>

☝️☝️☝️ Notice how when we call this function, it doesn’t actually run and produces a co-routine! ☝️☝️☝️

We can then use asyncio as method for executing all of the fetch callables that need to be completed:

Error with asyncio.run

If you receive this type of error when running the following command:


asyncio.run(main(all_images))

It is likely because you’re trying to run asyncio within an event loop which is not natively possible. (Jupyter notebook runs in an event loop!).


How To Download Multiple Python Files Inside Of A Python File (.py)

Let’s save the variable containing our URLs to a .txt file:

with open('images.txt', 'w') as f:
    for item in all_images:
        f.write(f"{item}n")

Create A Python File

Then you will need to create a python file and add the following code to it:

# Package / Module Imports
import aiohttp
import aiofiles
import asyncio
import os

# 1. Choose A Path - You will need to change this to your desired directory:
path = '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc_event_loop'

try:
    os.mkdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc_event_loop')
except FileExistsError as e:
    print('The file path already exists!')

# 2. Changing directory into that specific path:
os.chdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc_event_loop')

# 3. Reading the URLs from the text file:
with open('images.txt', 'r') as f:
    image_urls = f.read().split('n')

# 2. Creating the async functions:
async def fetch(session, url):
    async with session.get(url) as resp:
        # 1. Capturing the image file name like we did before:
        url_name = url.split('/')[-1]
        # 2. Only proceed further if the HTTP response is 200 (Ok)
        if resp.status == 200:
            async with aiofiles.open(url_name, mode='wb') as f:
                await f.write(await resp.read())
                await f.close()

async def main(image_urls:list):
    tasks = []
    headers = {
        "user-agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}
    async with aiohttp.ClientSession(headers=headers) as session:
        for image in image_urls:
            tasks.append(await fetch(session, image))
    data = await asyncio.gather(*tasks)

# 3. Executing all of the asyncio tasks:
try:
    asyncio.run(main(image_urls))
except Exception as e:
    print(e)

Then run the python script in either your terminal / command line with:


python3 python_file_name.py

Let’s break down what’s happening in the above code snippet:

  1. We are importing all of the relevant packages for async programming with files.
  2. Then we create a new directory.
  3. After creating the new folder we change that folder to be the active working directory.
  4. We then read the variable data which was previously saved from the file called images.txt
  5. Then we create a series of co-routines and execute them within a main() function with asyncio.
  6. As these co-routines are executed every file is asynchronously saved to your computer.
downloading multiple files with asyncio-aiohttp

Finally let’s clear up and delete all of the folders to clean up our environment:

folders_to_delete = [
'/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc_event_loop',
    '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images',
    '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc'
]
import shutil
try:
    for folder in folders_to_delete:
        print(f"Deleting this folder directory: {folder}")
        print('------')
        shutil.rmtree(folder)
except Exception as e:
    print(e)

Being able to download images with python allows you to extend your automation capabilities and what other programs, APIs etc you might use that image data with!

Hopefully you now feel confident about downloading images within Python