Scraping Google Image to build a dataset

Mar 19, 2018 |

In this post, we show how to create a image dataset for Deep Learning Projects using Google Images. The code is extensively based on this blog post: pyimagesearch. Thank you @Adrian Rosebrock for the great article.

I modified the code related to file saving so that we can check if an image has already been downloaded: i.e is it a duplicate. If it is a duplicate, move to the next url, otherwise, add the hash value of the image in a list and save the image on the local drive.

The first step is to create a text files with links to all images: urls.txt

run a query on Google Image - for example, I searched for images of racoons
scroll down until you go thru all the files you want
Open Javascript console: View => Developer => JavaScript Console
Select Tab console
Pull down jquery into the JavaScript console

var script = document.createElement('script');
script.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(script);

Grab the URLs

var urls = $('.rg_di .rg_meta').map(function() { return JSON.parse($(this).text()).ou; });

write the URLs to file (one per line)

var textToSave = urls.toArray().join('\n');
var hiddenElement = document.createElement('a');
hiddenElement.href = 'data:attachment/text,' + encodeURI(textToSave);
hiddenElement.target = '_blank';
hiddenElement.download = 'urls.txt';
hiddenElement.click();

That’s what the text file contains:

http://cdn.natgeotv.com.au/subjects/headers/Animal-Racoon.jpg?v=28&azure=false&scale=both&width=1920&height=960&mode=crop
https://racoonprodbuilds.azureedge.net/images/29122016CrazyRacoon.jpg
https://static.independent.co.uk/s3fs-public/styles/article_small/public/thumbnails/image/2014/08/27/19/Raccoon-Getty.jpg
https://i.ytimg.com/vi/Gryv_Z7MSl0/maxresdefault.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Racoon_crossing_road.JPG/220px-Racoon_crossing_road.JPG
https://images.fineartamerica.com/images-medium-large-5/curious-racoon-sabrina-l-ryan.jpg
https://thumbs.dreamstime.com/b/racoon-926302.jpg
....
....

The last step is to download the images from the links in the text file

Let’s first import the libraries that we need:

import requests
import cv2
import os
import matplotlib.pyplot as plt
import hashlib
from PIL import Image
import io
import numpy as np

We then provide the name of the text file, and the folder where to save the images:

urls = "urls.txt"
output = "images"

We will be saving the images in the folder images/. Only unique images are saved, duplicates are skipped.

# open the text file that contains the urls
rows = open(urls).read().strip().split("\n")
#Keep track of the number of images downloaded
n_image = 0
ls_image_hash = list()
step_show_img = 50

def showMe(img_arr):
    '''
    Visualize image
    '''
    plt.imshow(img_arr)
    plt.show()

    
for url in rows:
    #Loop thru all the urls
    try:
    #try to download the image
        r = requests.get(url, timeout=60)
        # save the image locally
        fname = os.path.sep.join([output, "{}.jpg".format(str(n_image).zfill(8))])
        img_pil = Image.open(io.BytesIO(r.content))
        img_arr = np.array(img_pil)
        hash_object = hashlib.md5(img_arr)
        img_hash = hash_object.hexdigest()
        if img_hash in ls_image_hash:
            #This image is a duplicate -- skip
            print("Image is a duplicate - Not saved")
        else:
            #add to list
            ls_image_hash.append(img_hash)
            f = open(fname, "wb")
            f.close()
            n_image += 1
            if n_image % step_show_img == 0:
                showMe(img_pil)
            # update the counter
        print("[INFO] downloaded: {}".format(fname))

 
    except:
        print("[INFO] error downloading {}".format(url))