Web versions

How to Extract Images from the Web in Python

A Python image scraper isn’t just a tool for honing your programming skills. You can also use it to source images for a machine learning project or generate site thumbnails. While there may be other ways to do similar things, nothing can beat the control you have using tools you build yourself.

Learn how to grab images from any website using Python and the BeautifulSoup library.

USE VIDEO OF THE DAY

Like more generalized web scraping, image scraping is a method of downloading website content. It’s not illegal, but there are some rules and best practices to follow. First of all, you should avoid scraping a website if it explicitly says it doesn’t want it. You can find out by searching for a /robots.txt file on the target site.

Most websites allow crawling of the web because they want search engines to index their content. You can scrape these websites because their images are publicly available.

However, just because you can upload an image doesn’t mean you can use it as your own. Most websites license their images to prevent you from reposting or reusing them in any other way. Always assume that you cannot reuse images unless there is a specific exemption.

Python Package Setup

You will need to install a few packages before you begin. If Python is not installed on your computer, visit the official site python.org website to download and install the latest version.

Next, open your terminal in your project folder and enable a Python virtual environment to isolate your dependencies.

Finally, install the requests and BeautifulSoup packages using pip:

pip install bs4 requests

Image scraping with Python

For this image scraping tutorial, you will use the requests library to retrieve a web page containing the target images. You will then forward the response from that website to BeautifulSoup to retrieve all image link addresses from image Key words. You will then write each image file to a folder to upload the images.

How to Fetch Image URLs with Python’s BeautifulSoup

Now go ahead and create a Python file in your project’s root folder. Be sure to add the .py filename extension.

Each code snippet in this tutorial is a continuation of the previous one.

Open the Python file with any good code editor and use the following code to request a web page:

import requests
URL = "imagesiteURL"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
print(getURL.status_code)

If the above program generates a 200 response code, the request was successful. If not, you might want to make sure your network connection is stable. Also make sure you provided a valid URL.

Use now BeautifulSoup read the content of the web page using the parser_html:

from bs4 import BeautifulSoup

soup = BeautifulSoup(getURL.text, 'html.parser')

images = soup.find_all('img')
print(images)

This code creates a list of objects, each representing an image of the web page. However, what you need from this data is the text for each image. src attribute.

To extract the source of each image label:

imageSources = []

for image in images:
imageSources.append(image.get('src'))

print(imageSources)

Run your code again, and the image addresses should now appear in a new list (ImageSources). You have successfully extracted each image source from the target web page.

How to save images with Python

First, create a download destination folder in your project root directory and name it pictures.

For Python to successfully download images, their paths must be full absolute URLs. In other words, they must include the “http://” or “https://” prefix, as well as the full domain of the website. If the web page refers to its images using relative URLs, you will need to convert them to absolute URLs.

In the simplest case, when the URL is absolute, initiating the download is simply requesting each image from the previously fetched sources:

for image in imageSources:
webs = requests.get(image)
open('images/' + image.split('/')[-1], 'wb').write(webs.content)

The image.split(‘/’)[-1] The keyword splits the image link at each forward slash (/). Then it retrieves the image file name (including any extension) from the last item.

Keep in mind that, in rare cases, image filenames may conflict, resulting in download overwrites. Feel free to explore solutions to this problem as an extension of this example.

Absolute URLs can get quite complicated, with many edge cases to cover. Fortunately, there is a useful method in the requests.compat packet called urljoin. This method returns a complete URL, starting from a base URL and a URL which can be relative. It allows you to solve the values ​​you find in href and src attributes.

The final code looks like this:

from bs4 import BeautifulSoup
URL = "imagesiteURL"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')

images = soup.find_all('img')
resolvedURLs = []

for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))

for image in resolvedURLs:
webs = requests.get(image)
open('images/' + image.split('/')[-1], 'wb').write(webs.content)

Never run out of image data

Many image recognition projects hit a brick wall due to an insufficient amount of images to form a model. But you can still grab images from websites to boost your data repository. And luckily, Python is a powerful image scraper that you can use continuously without fear of being overpriced.

If you want to grab other types of data from the web, you might want to learn how to use Python for general web scraping.