Background – Reading from Databases with Python
This post will talk about several packages for working with databases using Python. We’ll start by covering pyodbc, which is one of the more standard packages used for working with databases, but we’ll also cover a very useful module called turbodbc, which allows you to run SQL queries efficiently (and generally faster) within Python.
pyodbc
pyodbc can be installed using pip:
pip install pyodbc
Let’s start by writing a simple SQL query using pyodbc. To do that, we first need to connect to a specific database. In the examples laid out here, we will be using a SQLite database on my machine. However, you can do this with many other database systems, as well, such as SQL Server, MySQL, Oracle, etc. In the connection string, we specify the database driver (e.g. one of the database systems mentioned), the server, database name, and potentially username / password. If you have a trusted connection setup, then you can specify that (like in the first example below).
import pyodbc channel = pyodbc.connect("DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=sample_database.db;Trusted_connection=yes") # or if using a password channel = pyodbc.connect("DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=sample_database.db;Uid=YourUsername;Pwd=YourPassword;'")
Reading SQL query with pandas
After we’ve made the connection, we can write a SQL query to retrieve data from this database. One way of doing that is using the pandas package. Below, we wrap the SQL code inside quotes as the first parameter of pd.read_sql. The second parameter contains our connection object.
import pandas as pd pd.read_sql("select * from sample_table;", channel)
Reading SQL query with pyodbc
Besides using pandas, we can execute a SQL query with pyodbc alone. In this case, we need to create a cursor object. A cursor is an object used to process the results of a SQL query.
cursor = channel.cursor() cursor.execute("select * from sample_table;")
Next, we can extract the results of the query by using the fetchall method. fetchall returns the rows of the query result into a list, rather than a data frame. Each element in the list corresponds to a row.
rows = cursor.fetchall()
Alternatively, we can pull one row at a time using the fetchone method, like below. Each time we call this method, Python returns the next row in the query result. This can be handy when you’re dealing with a large dataset that you don’t want to bring entirely into memory at once.
cursor.execute("select * from sample_table;") cursor.fetchone()
For example, here we can print 10 rows, one by one, of the result dataset.
for i in range(10): print(cursor.fetchone())
When we’re done with our database connection, it’s a good idea to close it, which we can do like this:
channel.close()
pypyodbc
pypyodbc is very similar to pyodbc, except that it is written in pure Python under the hood. It can also be installed via pip.
pip install pypyodbc
Our code above can be run exactly the same way, except we replace pyodbc with pypyodbc.
import pyodbc channel = pypyodbc.connect("DRIVER={SQLite3 ODBC Driver};SERVER=localhost;DATABASE=sample_database.db;Trusted_connection=yes")
pypyodbc has a handful of methods that do not currently exist in pyodbc, such as the ability to more easily create Access database files.
turbodbc
Now that we’ve reviewed pyodbc, let’s talk about the turbodbc. On the surface, these packages have similar syntax. However, a primary advantage of turbodbc is that it is usually faster in extracting data than pyodbc. For example, turbodbc uses buffers to speed up returning multiple rows. The use of buffers, together with NumPy on the backend, combines to make the data type conversions faster between the database and the Python results.
It’s recommended to install turbodbc using conda install, like below. This will install all necessary dependencies across different platforms. If you don’t have conda, see here.
conda install -c conda-forge turbodbc
After that is done, the next step is to import turbodbc package.
import turbodbc
Similar to the above examples, we can connect to our database using a one-liner command. We can use either the connection string, like previously, or directly specify the DSN (“data source name”).
turbodbc.connect(dsn = "SQLite3 Datasource")
After the connection is created, we can define a cursor object, like what we did earlier in this post.
cursor = channel.cursor()
Now, using the same syntax as with pyodbc, we can execute our same SQL query.
cursor.execute("select * from sample_table;")
Lastly, let’s fetch all of the rows using the familiar fetchall methowed.
rows = cursor.fetchall()
Conclusion
That’s it for this post! There’s much more to turbodbc and pyodbc. In a future post, we’ll explore their additional functionality.
Keep up with my latest posts by following my blog on Twitter!.