Web Scraping

How to download image files with RoboBrowser

In a previous post, we showed how RoboBrowser can be used to fill out online forms for getting historical weather data from Wunderground. This article will talk about how to use RoboBrowser to batch download collections of image files from Pexels, a site which offers free downloads. If you’re looking to work with images, or want to build a training set for an image classifier with Python, this post will help you do that.

In the first part of the code, we’ll load the RoboBrowser class from the robobrowser package, create a browser object which acts like a web browser, and navigate to the Pexels website.


# load the RoboBrowser class from robobrowser
from robobrowser import RoboBrowser

# define base site
base = "https://www.pexels.com/"

# create browser object, 
# which serves as an invisible web browser
browser = RoboBrowser()

# navigate to pexels.com
browser.open(base)

If you actually go to the website, you’ll see there’s a search box. We can identify this search form using the get_form method. Once we have this form, we can check what fields it contains by printing the fields attribute.


form = browser.get_form()

print(form.fields)

Printing the fields shows us that the search field name associated with the search box is “s”. This means that if we want to programmatically type something into the search box, we need to set that field’s value to our search input. Thus, if we want to search for pictures of “water”, we can do it like so:


# set search input to water
form["s"] = "water"

# submit form
browser.submit_form(form)

After we’ve submitted the search box, we can see what our current URL is by checking the url attribute.


print(browser.url)

Next, let’s examine the links on the page of results.


# get links on page
links = browser.get_links()

# get the URL of each page by scraping the href attribute
# of each link
urls = [link.get("href") for link in links]

# filter out any URLs from links without href attributes
# these appear as Python's special NoneType
urls = [url for url in urls if url is not None]

Now that we have the URLs on the page, let’s filter them to only the ones showing the search result images on the page. We can do this by looking for where “/photo/” appears in each URL.


# filter to what we need -- photo URLs
photos = [url for url in urls if '/photo/' in url]

# get photo link objects
photo_links = [link for link in links if link.get("href") in photos]


For our first example, let’s click on the first photo link (0th Python-indexed link) using the follow_link method.


browser.follow_link(photo_links[0])

Next, we want to find the link on the screen that says “Free Download.” This can be done using the get_link method of browser. All we need to do is to pass whatever text on the link we’re looking for i.e. “Free Download” in this case.


# get the download link
download_link = browser.get_link("Free Download")

# hit the download link
browser.open(download_link.get("href"))

Once we’ve hit the link, we can write the file data to disk, using browser.response.content


# get file name of picture from URL
file_name = photo[0].split("-")[-2]

with open(file_name + '.jpeg', "wb") as image_file:
    image_file.write(browser.response.content)


Suppose we wanted to download all of the images on the results page. We can tweak our code above like this:


# download every image on the first screen of results:
for url in photos:
    
    full_url = "https://pexels.com" + url

    try:    
        browser.open(full_url)
    
        download_link = browser.get_link("Free Download")
        
        browser.open(download_link.get("href"))
        
        file_name = url.split("/")[-2] + '.jpg'
        with open(file_name, "wb") as image_file:
            image_file.write(browser.response.content)   
            
    except Exception:
        
        pass

Instead of just downloading the images for one particular search query, we can modify our code above to create a generalized function, which will download the search result images associated with whatever input we choose.

Generalized function


def download_search_results(search_query):
    
    browser.open(base)
    
    form = browser.get_form()
      
    form['s'] = search_query
    
    browser.submit_form(form)

    links = browser.get_links()
    urls = [link.get("href") for link in links]
    urls = [url for url in urls if url is not None]
    
    # filter to what we need -- photos
    photos = [url for url in urls if '/photo/' in url]
    
    for url in photos:
        
        full_url = "https://pexels.com" + url
    
        try:    
            browser.open(full_url)
        
            download_link = browser.get_link("Free Download")
            
            browser.open(download_link.get("href"))
            
            file_name = url.split("/")[-2] + '.jpg'
            with open(file_name, "wb") as image_file:
                image_file.write(browser.response.content)   
                
        except Exception:
            
            pass


Then, we can call our function like this:


download_search_results("water")

download_search_results("river")

download_search_results("lake")

As mentioned previously, using RoboBrowser to download images from Pexels can also be really useful if you’re looking to build your own image classifier and you need to collect images for a training set. For instance, if you want to build a classifier that predicts if an image contains a dog or not, you could scrape Pexels with our function above to get training images, like this:


download_search_results("dog")

So far our code has one main drawback — our function currently just pull’s one page’s worth of results. If you want to pull additional pages, we need to adjust the search URL. For instance, to get the second page of dog images, we need to hit https://www.pexels.com/search/dog/?page=2. If we want to pull additional pages, we need to change our function’s code to hit these sequential URLS — ?page=2, ?page=3, …until either we have all of the images downloaded, or we’ve reached a certain page limit. Given that there might be thousands of images associated with a particular search query, you might want to set a limit to how many search results pages you hit.

Generalized function – download all page results up to a limit


def download_all_search_results(search_query, MAX_PAGE_INDEX = 30):
    
    browser.open(base)
    
    form = browser.get_form()
      
    form['s'] = search_query
    
    browser.submit_form(form)
    
    search_url = browser.url # new URL
    
    # create page index counter
    page_index = 1

    # define temporary holder for the photo links
    photos = [None]

    # set limit on page index
    # loop will break before MAX_PAGE_INDEX if there are less page results
    while photos != [] and page_index <= MAX_PAGE_INDEX:
        
        browser.open(search_url + "?page=" + str(page_index))
       
        links = browser.get_links()
        urls = [link.get("href") for link in links]
        urls = [url for url in urls if url is not None]
        
        # filter to what we need -- photos
        photos = [url for url in urls if '/photo/' in url]
        
        if photos == []:
            break
        
        
        for url in photos:
            
            full_url = "https://pexels.com" + url
        
            try:    
                browser.open(full_url)
            
                download_link = browser.get_link("Free Download")
                
                browser.open(download_link.get("href"), stream = True)
                
                file_name = url.split("/")[-2] + '.jpg'
                with open(file_name, "wb") as image_file:
                    image_file.write(browser.response.content)   
                    
            except Exception:
                
                pass

        print("page index ==> " + str(page_index))
        page_index += 1



Now we can call our function similar to above:


# download all the images from the first 30 pages
# of results for 'dog'
download_all_search_results("dog")

# Or -- just get first 10 pages of results
download_all_search_results("dog", 10)

You could also limit the number of downloads directly i.e. set a limit of 100, 200, 300, or “X” number of images you want to download.

See more Python articles on this site here. If you’re interested in learning more about web scraping with Python, check out the book below, or click here for recommended books on Python and open source programming.

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

2 years ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

3 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

4 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

4 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

4 years ago