Scraping Google Image to build a dataset

In this post, we show how to create a image dataset for Deep Learning Projects using Google Images. The code is extensively based on this blog post: pyimagesearch. Thank you @Adrian Rosebrock for the great article.

I modified the code related to file saving so that we can check if an image has already been downloaded: i.e is it a duplicate. If it is a duplicate, move to the next url, otherwise, add the hash value of the image in a list and save the image on the local drive.

  • run a query on Google Image - for example, I searched for images of racoons
  • scroll down until you go thru all the files you want
  • Open Javascript console: View => Developer => JavaScript Console
  • Select Tab console
  • Pull down jquery into the JavaScript console
var script = document.createElement('script');
script.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(script);
  • Grab the URLs
var urls = $('.rg_di .rg_meta').map(function() { return JSON.parse($(this).text()).ou; });
  • write the URLs to file (one per line)
var textToSave = urls.toArray().join('\n');
var hiddenElement = document.createElement('a');
hiddenElement.href = 'data:attachment/text,' + encodeURI(textToSave);
hiddenElement.target = '_blank';
hiddenElement.download = 'urls.txt';
hiddenElement.click();

That’s what the text file contains:

http://cdn.natgeotv.com.au/subjects/headers/Animal-Racoon.jpg?v=28&azure=false&scale=both&width=1920&height=960&mode=crop
https://racoonprodbuilds.azureedge.net/images/29122016CrazyRacoon.jpg
https://static.independent.co.uk/s3fs-public/styles/article_small/public/thumbnails/image/2014/08/27/19/Raccoon-Getty.jpg
https://i.ytimg.com/vi/Gryv_Z7MSl0/maxresdefault.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Racoon_crossing_road.JPG/220px-Racoon_crossing_road.JPG
https://images.fineartamerica.com/images-medium-large-5/curious-racoon-sabrina-l-ryan.jpg
https://thumbs.dreamstime.com/b/racoon-926302.jpg
....
....
  • Let’s first import the libraries that we need:
import requests
import cv2
import os
import matplotlib.pyplot as plt
import hashlib
from PIL import Image
import io
import numpy as np
  • We then provide the name of the text file, and the folder where to save the images:
urls = "urls.txt"
output = "images"
  • We will be saving the images in the folder images/. Only unique images are saved, duplicates are skipped.
# open the text file that contains the urls
rows = open(urls).read().strip().split("\n")
#Keep track of the number of images downloaded
n_image = 0
ls_image_hash = list()
step_show_img = 50

def showMe(img_arr):
    '''
    Visualize image
    '''
    plt.imshow(img_arr)
    plt.show()

    
for url in rows:
    #Loop thru all the urls
    try:
    #try to download the image
        r = requests.get(url, timeout=60)
        # save the image locally
        fname = os.path.sep.join([output, "{}.jpg".format(str(n_image).zfill(8))])
        img_pil = Image.open(io.BytesIO(r.content))
        img_arr = np.array(img_pil)
        hash_object = hashlib.md5(img_arr)
        img_hash = hash_object.hexdigest()
        if img_hash in ls_image_hash:
            #This image is a duplicate -- skip
            print("Image is a duplicate - Not saved")
        else:
            #add to list
            ls_image_hash.append(img_hash)
            f = open(fname, "wb")
            f.close()
            n_image += 1
            if n_image % step_show_img == 0:
                showMe(img_pil)
            # update the counter
        print("[INFO] downloaded: {}".format(fname))

 
    except:
        print("[INFO] error downloading {}".format(url))