Python

Reading from databases with Python

Background – Reading from Databases with Python

This post will talk about several packages for working with databases using Python. We’ll start by covering pyodbc, which is one of the more standard packages used for working with databases, but we’ll also cover a very useful module called turbodbc, which allows you to run SQL queries efficiently (and generally faster) within Python.

pyodbc

pyodbc can be installed using pip:


pip install pyodbc

Let’s start by writing a simple SQL query using pyodbc. To do that, we first need to connect to a specific database. In the examples laid out here, we will be using a SQLite database on my machine. However, you can do this with many other database systems, as well, such as SQL Server, MySQL, Oracle, etc. In the connection string, we specify the database driver (e.g. one of the database systems mentioned), the server, database name, and potentially username / password. If you have a trusted connection setup, then you can specify that (like in the first example below).


import pyodbc

channel = pyodbc.connect("DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=sample_database.db;Trusted_connection=yes")

# or if using a password

channel = pyodbc.connect("DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=sample_database.db;Uid=YourUsername;Pwd=YourPassword;'")

Reading SQL query with pandas

After we’ve made the connection, we can write a SQL query to retrieve data from this database. One way of doing that is using the pandas package. Below, we wrap the SQL code inside quotes as the first parameter of pd.read_sql. The second parameter contains our connection object.


import pandas as pd

pd.read_sql("select * from sample_table;", channel)

Reading SQL query with pyodbc

Besides using pandas, we can execute a SQL query with pyodbc alone. In this case, we need to create a cursor object. A cursor is an object used to process the results of a SQL query.


cursor = channel.cursor()

cursor.execute("select * from sample_table;")

Next, we can extract the results of the query by using the fetchall method. fetchall returns the rows of the query result into a list, rather than a data frame. Each element in the list corresponds to a row.


rows = cursor.fetchall()

Alternatively, we can pull one row at a time using the fetchone method, like below. Each time we call this method, Python returns the next row in the query result. This can be handy when you’re dealing with a large dataset that you don’t want to bring entirely into memory at once.


cursor.execute("select * from sample_table;")

cursor.fetchone()

For example, here we can print 10 rows, one by one, of the result dataset.


for i in range(10):
    print(cursor.fetchone())

When we’re done with our database connection, it’s a good idea to close it, which we can do like this:


channel.close()

pypyodbc

pypyodbc is very similar to pyodbc, except that it is written in pure Python under the hood. It can also be installed via pip.


pip install pypyodbc

Our code above can be run exactly the same way, except we replace pyodbc with pypyodbc.


import pyodbc

channel = pypyodbc.connect("DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=sample_database.db;Trusted_connection=yes")

pypyodbc has a handful of methods that do not currently exist in pyodbc, such as the ability to more easily create Access database files.

turbodbc

Now that we’ve reviewed pyodbc, let’s talk about the turbodbc. On the surface, these packages have similar syntax. However, a primary advantage of turbodbc is that it is usually faster in extracting data than pyodbc. For example, turbodbc uses buffers to speed up returning multiple rows. The use of buffers, together with NumPy on the backend, combines to make the data type conversions faster between the database and the Python results.

It’s recommended to install turbodbc using conda install, like below. This will install all necessary dependencies across different platforms. If you don’t have conda, see here.


conda install -c conda-forge turbodbc

After that is done, the next step is to import turbodbc package.


import turbodbc

Similar to the above examples, we can connect to our database using a one-liner command. We can use either the connection string, like previously, or directly specify the DSN (“data source name”).


turbodbc.connect(dsn = "SQLite3 Datasource")

After the connection is created, we can define a cursor object, like what we did earlier in this post.


cursor = channel.cursor()

Now, using the same syntax as with pyodbc, we can execute our same SQL query.


cursor.execute("select * from sample_table;")

Lastly, let’s fetch all of the rows using the familiar fetchall methowed.


rows = cursor.fetchall()

Conclusion

That’s it for this post! There’s much more to turbodbc and pyodbc. In a future post, we’ll explore their additional functionality.

Keep up with my latest posts by following my blog on Twitter!.

Andrew Treadway

Recent Posts

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

2 years ago

How to stop long-running code in Python

Ever had long-running code that you don't know when it's going to finish running? If…

3 years ago

Faster alternatives to pandas

Background If you've done any type of data analysis in Python, chances are you've probably…

3 years ago

Automated EDA with Python

In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

4 years ago

How to plot XGBoost trees in R

In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

4 years ago

Python collections tutorial

In this post, we'll discuss the underrated Python collections package, which is part of the…

4 years ago