File Manipulation

File Manipulation with Python

Getting started

Python is great for automating file creation, deletion, and other types of file manipulations.  Two of the primary packages used to perform these types of tasks are os and shutil.  We’ll be covering a few useful highlights from each of these.


import os
import shutil

How to get and change your current working directory

You can get your current working directory using os.getcwd:


os.getcwd()

Any actions you take without specifying a directory will be assumed to be associated with your current working directory i.e. if you create or search for a file without specifying a directory, Python will assume you’re in the value of os.getcwd().

To change your working directory, use os.chdir:


os.chdir("C:/path/to/new/directory")

How to merge a directory name with a file name

The os package contains a nice method to join together a directory name with a file’s name, like below:


os.path.join("C:/path/to/directory", "some_file.txt")

Running the above code will result in “C:/path/to/directory/some_file.txt”

We will make use of this functionality in the next section of listing all the files in a directory.

How to list all the files in a directory

To list the files in your immediate directory (not within any sub-folders in a particular directory), you can use os.listdir:


os.listdir()

To recursively list all the files / folders in a directory, we can use the os.walk method:


result_generator = os.walk("C:/path/to/directory")

files_result = [x for x in result_generator]

os.walk returns a generator. Using a list comprehension, we can get what files are contained within the input parent directory, as we do above. This list contains a collection of tuples, each of length 3. For each tuple, the first element is a folder. This folder may be the parent folder itself, or a sub-folder recursively found in the input parent directory.

The second element of the tuple is a list of directories immediately in the folder named in the first element of the tuple. The third, and last element of the tuple, contains any non-directory files immediately in the folder named in the first element of the tuple.

So to get just a list of all files and sub-folders by recursively searching through a directory (not a list of tuples, but a list of just file names), we can run the following code (extended from above):


full_paths = []
for folder, dir_list, file_list in files_result:
    
    # add in any recursively found sub-folders
    full_paths.extend([os.path.join(folder, sub) for sub in dir_list])
    
    # add in any recursively found non-folder files
    full_paths.extend([os.path.join(folder, file) for file in file_list])

print(full_paths)

How to create new directories

If you want to create a handful of folders / directories, it’s not difficult to manually do so.  But creating a few dozen folders manually gets mundane really fast.

The os package contains a method, os.mkdir (similar to the mkdir command in Linux or Windows), that we can use in our situation.

You can create a single folder as below:


# create folder in current directory
os.mkdir("this_is_a_new_folder")

# or, create folder in another directory
os.mkdir("C:/path/to/new/directory")

One problem I’ve seen several times in the past is to create a collection of folders for each state in the US.  We can do this using the us package with os:


from us.states import STATES

'''get list of all US states plus DC'''
all_states = [x.abbr for x in STATES]

'''create folder for each state'''
for state in all_states:
    os.mkdir(state)

Above, the STATES object from us.states contains attributes about each states.  We get the state abbreviation using the “.abbr” attribute of each element in STATES.

If we wanted to create a list of folders with full state names, rather than abbreviations, we could use the “.name” attribute:


all_states = [x.name for x in STATES]

A similar problem is to create a collection of folders for the letters of the alphabet.  This might be used to organize files or data related to last names.  Here we’ll use the string package to get the letters of the alphabet.


import string

'''create folder for each letter in the alphabet '''
for letter in string.ascii_uppercase:
    os.mkdir(letter)

How to delete files / folders

Next, what if you want to delete all the folders and their contents in a directory? You can do this using the os.unlink method. Be careful though, because the contents of the folders will be lost without going to the Recycle Bin (if you’re on Windows).

If you run into a “permission denied” error on Windows, you may need to run Python as an administrator.   This can generally be done by right clicking on the icon of the IDE you’re using (or command prompt if you’re running a script from there),  and clicking “Run as administrator.”


for folder in os.listdir("C:/path/to/directory"):
    os.unlink(elt)

Above, we used the os.listdir method to list all of the files / folders in a directory.

Or, if you just want to delete a single file, do this:


os.unlink("C:/path/to/a/file.txt")

Getting created and modified times

It’s a relatively common task to want to clean up files that are old.  To accomplish this, we first need to get the age of the files in the directory we want to clean up.

This may be done by either examining the created dates or modified dates of files in a directory.


import time

'''get list of files / folders in directory'''

contents = os.listdir()

'''Get the created time stamp of each file / folder in current directory '''

created_times = [time.ctime(os.path.getctime(elt)) for elt in contents]
created_times = [time.strptime(elt) for elt in created_times]

'''delete all files created prior to 2017'''
for index in range(len(created_times)):

    if created_times[index].tm_year < 2017:
        os.unlink(contents[index])

Let’s break this down.


created_times = [time.ctime(os.path.getctime(elt)) for elt in contents]

Here, we’re using the os.path.getctime method to get the created time stamp of each file / folder (or “elt”) in the root of the current directory.

We’re wrapping this method inside of time.ctime to convert the object returned by os.path.getctime to a date string that looks something like this:

“Wed Aug 4 20:30:24 2017”

This is done because the object returned by os.path.getctime is a float, which represents the number of seconds since January 1, 1970.  To read more about how times are represented in Python, see here.

Once we have a time stamp string for each element in the directory,  we convert each string to a struct_time type, which will allow us to easily parse information about the created dates e.g. the year or month particular files were created.

You can do this by calling the time.strptime method on each element in the created_times list.


created_times = [time.strptime(elt) for elt in created_times]

Lastly, we loop over the indices on created_times, and delete the corresponding file / folder if the created year is prior to 2017.  In other words, if the 5th element in created_times has a time stamp with a year before 2017, then the 5th element in the contents of the directory (stored in the variable contents) gets deleted.

You can get the year of a struct_time type by using the tm_year method:


for index in range(len(created_times)):

    if created_times[index].tm_year != 2017:
        os.unlink(contents[index])

So what if we want to delete everything last modified prior to 2017, rather than just created?

We can do almost exactly the same process, except we change os.path.getctime to os.path.getmtime:


mod_times = [time.ctime(os.path.getmtime(elt)) for elt in contents]
mod_times = [time.strptime(elt) for elt in created_times]

for index in range(len(mod_times)):

    if created_times[index].tm_year < 2017:
        os.unlink(contents[index])

How to rename a file

Rename a file using os.rename:


os.rename("C:/current/file/name.txt", "C:/new/file/name.txt")

How to copy a file

There’s a few ways to copy a file.

One standard method is by using shutil.copy, like below:


shutil.copy("C:/path/to/file.txt", "C:/new/directory")

The first argument is the name of the file or directory you want to copy. The second parameter is the destination of where you wanted the copy to be located. shutil.copy will copy the input file along with any associated permissions. Another method of copying a file is by using shutil.copy2, like this:


shutil.copy2("C:/path/to/file.txt", "C:/new/directory")

shutil.copy2 will also copy any metadata associated with the file, along with permissions.

How to move a file to a different folder

To move files from one directory to another, you can use the shutil.move function. To move a single file, we run code like below:


shutil.move("C:/path/to/file.txt", "C:/path/to/new/directory/file.txt")

The first parameter is the name of the file we want to move, while the second parameter is the new full path of the file.

Let’s say we want to move every file and sub-directory from a folder to a new destination.

We can do this using the shutil.move method in a loop like this:

for file in os.listdir("C:/path/to/current_location"):
    shutil.move(file ,  "C:/path/to/new/directory/" + file)

How to get the base name of a file

You can get the base name of a file using os.path.basename:


os.path.basename("C:/path/to/file.txt")

The above will return “file.txt”

How to get the directory name of a file

Similar to above, we can get the directory name of a file, using os.path.dirname:


os.path.dirname("C:/path/to/file.txt")

This returns “C:/path/to”


So that’s an introduction to manipulating files with Python! To see an introduction on working with files in R, see this post.

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

2 years ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

3 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

4 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

4 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

4 years ago