Python

Downloading Every File on an FTP Server

Getting Started

Before I go into the title of this article, I’m going to give an introduction to using Python to work with FTP sites. In our example, I will use (and extend upon) some of the code written in the yahoo_fin package.

To work with FTP servers, we can use ftplib, which comes with the Python standard library, so you’ll probably already have it installed. In case you don’t, however, you can download it using pip:


pip install ftplib

To log into an FTP site, we first need to establish a connection. You can do this using the FTP method. Just replace ftp.nasdaqtrader.com with whatever FTP site you want to log into.

'''Load the ftplib package '''
import ftplib

'''Connect to FTP site'''
ftp = ftplib.FTP('ftp.nasdaqtrader.com')


Once we have the connection, we can log into the FTP site using the aptly-named login method:


'''Login'''
ftp.login()

In our example, we don’t actually need to provide a username and password as this FTP Server is open to anyone, but that won’t be the case for a lot of FTP servers. You can submit a username and password through the login method like below:


ftp.login('USERNAME' , 'PASSWORD')

You just need to replace ‘USERNAME’ and ‘PASSWORD’ with your username and password, respectively. If your login is successful, you should get a message like this:

‘230 User logged in.’

Once you’ve logged in, there are a collection of commands from the ftplib package at your disposal.

Navigating Around

One thing you might want to do when logged into an FTP server is to figure out your current working directory. You can do this using the pwd method, similar to Linux or PowerShell.


ftp.pwd()

Listing the contents of a directory can be done using the nlst method. To get the contents of the current directory, no parameters need to be passed. If you want to get the contents of another directory, you just need to pass its name like in the example below, where ‘SymbolDirectory’ is the name of a folder in the current working directory.


ftp.nlst()

ftp.nlst('SymbolDirectory')

You can also use ftp.dir to get what’s inside a folder. The difference is that ftp.dir will print out the contents, along with a label of whether each item is a sub-directory or not, rather than outputting the names of the items in the folder to a list, like in ftp.nlst.

To change the current working directory, we use the cwd command. If you like, you can confirm this change by calling the pwd method again.


ftp.cwd('SymbolDirectory')

Downloading a Single File

Now, what if we want to programmatically download a file from our FTP site? That’s done easily enough. We’re going to need to load the io package, which also comes with the Python standard library.


import io

Next, we use the io package to create a BytesIO object. This acts as a container that we will write the data from a file in the FTP Sever into. We call our object, r. Next, we use r and the retrbinary method of the ftp object to download the nasdaqlisted.txt file, which contains a list of the tickers currently in the NASDAQ.


r = io.BytesIO()
ftp.retrbinary('RETR nasdaqlisted.txt', r.write)

The object, r allows us to access the data downloaded from our selected file. This can be done using the getvalue method of the object. The downloaded result is a bytes object, so we convert the result to a string using the decode method. Hence, our downloaded result is now stored in the info variable.


info = r.getvalue().decode()

Our particular file is pipe (‘|’) delimited, so the code from yahoo_fin splits the data by ‘|’ and parses out the list of tickers currently listed on the NASDAQ.


splits = info.split('|')

tickers = [x for x in splits if 'N\r\n' in x]
tickers = [x.strip('N\r\n') for x in tickers]

Downloading Every File in an FTP Directory

Alright, so we have a process to get a single file’s contents from an FTP server. So, how do we get the contents of every file from a given folder? We just need to generalize the code we have above, like so:


file_mapper = {}
for file in ftp.nlst():
    
    r = io.BytesIO()
    ftp.retrbinary('RETR ' + file , r.write)
    
    file_mapper[file] = r
    
    r.close()

Creating an empty dict, file_mapper, we fill this dict with the contents of each file in the directory we’re currently in (‘SymbolDirectory’ in our example). The keys of the dictionary are the filenames in the folder, while the values are the BytesIO objects that we can use to get the actual data within each file.

Thus, if we want to get the actual text from one of these files — say the options.txt file in the SymbolDirectory folder, we would just type:


file_mapper['options.txt'].getvalue().decode()

If you want to write this file to disk, you could pass the data you retrieve to the builtin open function, like below:


data = file_mapper['options.txt'].getvalue().decode()

f = open('writing_options_to_disk.txt' , 'w+')
f.write(data)
f.close()

This will write a file to your current working directory (i.e. os.getcwd()). Now, some files you may want to download from an FTP site could be binary files — like PDFs, or excel files etc., rather than raw text files. In these cases, you’ll want to skip the decode method, and pass the ‘wb+’ parameter to open, like this:


data = file_mapper['some_binary_file.pdf'].getvalue()

f = open('writing_binary_file_to_disk.pdf' , 'wb')
f.write(data)
f.close()

Downloading Every File on an FTP Server

Now…we’re ready. How do we get our code to download every single file in each folder available on the FTP server? There’s a few ways we could go about this, but the gist is we need to be able to list the contents of each directory, and then download those contents. We basically have the pieces we need above.

Putting those together, below is a function to recursively list all files and sub-directories living on an FTP server. As you see, we’ll also need the sys package to get this job done.


import sys

def list_all_ftp_files(ftp, dir):
    dirs = []
    non_dirs = []

    '''Capture the print output of ftp.dir'''
    stream = io.StringIO()
    sys.stdout = stream
    ftp.dir(dir)
    streamed_result = stream.getvalue()   

    reduced = [x for x in streamed_result.split(' ') if x != '']

    '''Clean up list'''
    reduced = [x.split('\n')[0] for x in reduced]

    '''Get the names of the folders by which ones are labeled <DIR>'''
    indexes = [ix + 1 for ix,x in enumerate(reduced) if x == '<DIR>']

    folders = [reduced[ix] for ix in indexes]
    
    if dir == '/':
        non_folders = [x for x in ftp.nlst() if x not in folders]
    else:
        non_folders = [x for x in ftp.nlst(dir) if x not in folders]
        non_folders = [dir + '/' + x for x in non_folders]
        folders = [dir + '/' + x for x in folders]
        
    '''If currently scanning the root directory, just add the initial set of
       of folders in that directory to our grand list'''
    if dirs == []:
        dirs.extend(folders)
        '''Similarly, do the same for files that are not folders'''
    if non_dirs == []:
        non_dirs.extend(non_folders)

    '''If there are still sub-folders within a directory, keep searching deeper
       potential other sub-folders'''
    if len(folders) > 0:
        for sub_folder in sorted(folders):
            result = list_all_ftp_files(ftp, sub_folder)
            dirs.extend(result[0])
            non_dirs.extend(result[1])

    '''Return the list of all directories on the FTP Server, along
       with the list of all non-folder files'''
    return dirs , non_dirs

Breaking this down, the first part of the function uses ftp.dir to get a list of the contents in an input directory. As mentioned above, this doesn’t directly output anything to a list or some other data structure that could be stored into a variable. Instead, it prints a directory’s contents to the console. However, using the sys package we can capture this output into our variable, streamed_result in an indirect fashion. We then parse this to get the information we need to tell us whether each file is a sub-directory or not.


    '''Capture the print output of ftp.dir'''
    stream = io.StringIO()
    sys.stdout = stream
    ftp.dir(dir)
    streamed_result = stream.getvalue()   

    reduced = [x for x in streamed_result.split(' ') if x != '']

    '''Clean up list'''
    reduced = [x.split('\n')[0] for x in reduced]

Once we have a list of sub-directories, we continue searching through each of these for other potential sub-folders and non-folder files. If more sub-folders are found, the process is repeated until no more are found. The search process is repeated by recursively calling our function, list_all_ftp_files to search on the next deeper level of sub-directories.


    '''If there are still sub-folders within a directory, keep searching deeper
       potential other sub-folders'''
    if len(folders) > 0:
        for sub_folder in sorted(folders):
            result = list_all_ftp_files(ftp, sub_folder)
            dirs.extend(result[0])
            non_dirs.extend(result[1])

We can call our function like this:


'''Connect to FTP site'''
ftp = ftplib.FTP('ftp.nasdaqtrader.com')

'''Login'''
ftp.login()

'''Get all folders and files on FTP Server'''
folders , files = list_all_ftp_files(ftp , '/')

Now, folders will contain a list of all the sub-directories on our FTP site, while files will hold a list of all the non-folder files on the site. If you check the length of the files, you’ll see our example has over 26,000 files! We’re not actually going to download all of those in our example, but I will show you how so you can do it for your FTP site. As a warning, though, if you are downloading a large number of files, it’s good idea to check with the operator of the FTP server to make sure you won’t be impacting the performance of the server if you batch download thousands of files in short sequence.

If you’re good to go on downloading all the files on your FTP server, you can do it basically like we did above for a single directory. Now, however, we just loop over every file in files, our list of all non-folders living on the server.


file_mapper = {}
for file in files:
    
    r = io.BytesIO()
    ftp.retrbinary('RETR ' + file , r.write)
    
    file_mapper[file] = r
    
    r.close()


You can then either write another loop to actually write these files out to disk, or can you can incorporate logic in the for loop above to do that. As mentioned above, it’s a good idea to treat binary files differently from raw text files. Depending on how your files are structured, you should be able to do something like this:


import os

'''Write all files on FTP Server to disk'''
for key,val in file_mapper.items():

    '''Create directory structure if it doesn't exist'''
    if not os.path.exists(os.path.dirname(key)):
        os.makedirs(os.path.dirname(key))

    '''Check if file is a text file'''
    if len(key.split('.txt')) > 1:
        f = open(key , 'w+')
        f.write(val.getvalue().decode())
        f.close()
    else:
        f = open(key , 'wb+')
        f.write(val.getvalue())
        f.close()

The above code checks if each file in sequence is a text file by the extension (you could also use the os.path.splitext method from the os package). It then writes each file from the FTP server out to disk appropriately.

Full code

The full code is below:


import sys
import io
import ftplib 
import os

def list_all_ftp_files(ftp, dir):
    dirs = []
    non_dirs = []
 
    '''Capture the print output of ftp.dir'''
    stream = io.StringIO()
    sys.stdout = stream
    ftp.dir(dir)
    streamed_result = stream.getvalue()   
 
    reduced = [x for x in streamed_result.split(' ') if x != '']
 
    '''Clean up list'''
    reduced = [x.split('\n')[0] for x in reduced]
 
    '''Get the names of the folders by which ones are labeled <DIR>'''
    indexes = [ix + 1 for ix,x in enumerate(reduced) if x == '<DIR>']
 
    folders = [reduced[ix] for ix in indexes]
     
    if dir == '/':
        non_folders = [x for x in ftp.nlst() if x not in folders]
    else:
        non_folders = [x for x in ftp.nlst(dir) if x not in folders]
        non_folders = [dir + '/' + x for x in non_folders]
        folders = [dir + '/' + x for x in folders]
         
    '''If currently scanning the root directory, just add the initial set of
       of folders in that directory to our grand list'''
    if dirs == []:
        dirs.extend(folders)
        '''Similarly, do the same for files that are not folders'''
    if non_dirs == []:
        non_dirs.extend(non_folders)
 
    '''If there are still sub-folders within a directory, keep searching deeper
       potential other sub-folders'''
    if len(folders) > 0:
        for sub_folder in sorted(folders):
            result = list_all_ftp_files(ftp, sub_folder)
            dirs.extend(result[0])
            non_dirs.extend(result[1])
 
    '''Return the list of all directories on the FTP Server, along
       with the list of all non-folder files'''
    return dirs , non_dirs



'''Connect to FTP site'''
ftp = ftplib.FTP('ftp.nasdaqtrader.com')
 
'''Login'''
ftp.login()
 
'''Get all folders and files on FTP Server'''
folders , files = list_all_ftp_files(ftp , '/')


'''Get IO streams for each file'''
file_mapper = {}
for file in files:
     
    r = io.BytesIO()
    ftp.retrbinary('RETR ' + file , r.write)
     
    file_mapper[file] = r
     
    r.close()


'''Write all files on FTP Server to disk'''
for key,val in file_mapper.items():
    
    '''Create directory structure if it doesn't exist'''
    if not os.path.exists(os.path.dirname(key)):
        os.makedirs(os.path.dirname(key))
    
    '''Check if file is a text file'''
    if len(key.split('.txt')) > 1:
        f = open(key , 'w+')
        f.write(val.decode())
        f.close()
    else:
        f = open(key , 'wb+')
        f.write(val)
        f.close()


That’s it for this post! To learn more about ftplib, see its official documentation here.

See other posts of mine by clicking http://theautomatic.net/blog/, or by checking out related articles below.

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

1 year ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

2 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

3 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

3 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

3 years ago