Blog

All about Python Sets

All about Python Sets

Python
See also my tutorials on lists and list comprehensions. Background on sets A set in Python is an unordered collection of unique elements. Sets are mutable and iterable (more on these properties later). Sets are useful for when dealing with a unique collection of elements - e.g. finding the unique elements within a list to determine if there are are any values which should be present. The operations built around sets are also handy when you need to perform mathematical set-like operations. For example, how would you figure out the common elements between two lists? Or what elements are in one list, but not another? With sets, it's easy! How to create a set We can define a set using curly braces, similar to how we define dictionaries. [code lang="python"]…
Read More
3 ways to scrape tables from PDFs with Python

3 ways to scrape tables from PDFs with Python

Python
This post will go through a few ways of scraping tables from PDFs with Python. To learn more about scraping tables and other data from PDFs with R, click here. Note, this options will only work for PDFs that are typed - not scanned-in images. tabula-py tabula-py is a very nice package that allows you to both scrape PDFs, as well as convert PDFs directly into CSV files. tabula-py can be installed using pip: [code] pip install tabula-py [/code] If you have issues with installation, check this. Once installed, tabula-py is straightforward to use. Below we use it scrape all the tables from a paper on classification regarding the Iris dataset (available here). [code lang="python"] import tabula file = "http://lab.fs.uni-lj.si/lasin/wp/IMIT_files/neural/doc/seminar8.pdf" tables = tabula.read_pdf(file, pages = "all", multiple_tables = True) [/code]…
Read More
Four ways to reverse a string in R

Four ways to reverse a string in R

R
R offers several ways to reverse a string, include some base R options. We go through a few of those in this post. We'll also compare the computational time for each method. Reversing a string can be especially useful in bioinformatics (e.g. finding the reverse compliment of a DNA strand). To get started, let's generate a random string of 10 million DNA bases (we can do this with the stringi package as well, but for our purposes here, let's just use base R functions). [code lang="R"] set.seed(1) dna <- paste(sample(c("A", "T", "C", "G"), 10000000, replace = T), collapse = "") [/code] 1) Base R with strsplit and paste One way to reverse a string is to use strsplit with paste. This is the slowest method that will be shown, but…
Read More
How to get options data with Python

How to get options data with Python

Python, Web Scraping
In a previous post, we talked about how to get real-time stock prices with Python. This post will go through how to download financial options data with Python. We will be using the yahoo_fin package. The yahoo_fin package comes with a module called options. This module allows you to scrape option chains and get option expiration dates. To get started we'll just import this module from yahoo_fin. [code lang="python"] from yahoo_fin import options [/code] How to get options expiration dates Any option contract has an expiration date. To get all of the option expiration dates for a particular stock, we can use the get_expiration_dates method in the options package. This method is equivalent to scraping all of the date selection boxes on the options page for an individual stock (e.g.…
Read More

Don’t forget the “utils” package in R

R
With thousands of powerful packages, it's easy to glaze over the libraries that come preinstalled with R. Thus, this post will talk about some of the cool functions in the utils package, which comes with a standard installation of R. While utils comes with several familiar functions, like read.csv, write.csv, and help, it also contains over 200 other functions. readClipboard and writeClipboard One of my favorite duo of functions from utils is readCLipboard and writeClipboard. If you're doing some manipulation to get a quick answer between R and Excel, these functions can come in handy. readClipboard reads in whatever is currently on the Clipboard. For example, let's copy a column of cells from Excel. We can now run readClipboard() in R. The result of running this command is a vector…
Read More
Speed Test: Sapply vs. Vectorization

Speed Test: Sapply vs. Vectorization

R
The apply functions in R are awesome (see this post for some lesser known apply functions). However, if you can use pure vectorization, then you'll probably end up making your code run a lot faster than just depending upon functions like sapply and lapply. This is because apply functions like these still rely on looping through elements in a vector or list behind the scenes - one at a time. Vectorization, on the other hand, allows parallel operations under the hood - allowing much faster computation. This posts runs through a couple such examples involving string substitution and fuzzy matching. String substitution For example, let's create a vector that looks like this: test1, test2, test3, test4, ..., test1000000 with one million elements. With sapply, the code to create this would…
Read More
Why defining constants is important – a Python example

Why defining constants is important – a Python example

Python
This post will walk through an example of why defining a known constant can save lots of computational time. How to find the key with the maximum value in a Python dictionary There's a few ways to go about getting the key associated with the max value in a Python dictionary. The two ways we'll show each involve using a list comprehension. First, let's set the scene by creating a dictionary with 100,000 key-value pairs. We'll just make the keys the integers between 0 and 99,999 and we'll use the random package to randomly assign values for each of these keys based off the uniform distribution between 0 and 100,000. [code lang="python"] import random import time vals = [random.uniform(0, 100000) for x in range(100000)] mapping = dict(zip(range(100000), vals)) [/code] Now,…
Read More
Creating a word cloud on R-bloggers posts

Creating a word cloud on R-bloggers posts

R, Web Scraping
This post will go through how to create a word cloud of article titles scraped from the awesome R-bloggers. Our goal will be to use R's rvest package to search through 50 successive pages on the site for article titles. The stringr and tm packages will be used for string cleaning and for creating a term document frequency matrix (with tm). We will then create a word cloud based off the words comprising these titles. First, we'll load the packages we need. [code lang="R"] # load packages library(rvest) library(stringr) library(tm) library(wordcloud) [/code] Let's write a function that will take a webpage as input and return all the scraped article titles. [code lang="R"] scrape_post_titles <- function(site) { # scrape HTML from input site source_html <- read_html(site) # grab the title attributes…
Read More
Scraping data from a JavaScript webpage with Python

Scraping data from a JavaScript webpage with Python

Python, Web Scraping
This post will walk through how to use the requests_html package to scrape options data from a JavaScript-rendered webpage. requests_html serves as an alternative to Selenium and PhantomJS, and provides a clear syntax similar to the awesome requests package. The code we'll walk through is packaged into functions in the options module in the yahoo_fin package, but this article will show how to write the code from scratch using requests_html so that you can use the same idea to scrape other JavaScript-rendered webpages. Note: requests_html requires Python 3.6+. If you don't have requests_html installed, you can download it using pip: [cc] pip install requests_html [/cc] Motivation Let's say we want to scrape options data for a particular stock. As an example, let's look at Netflix (since it's well known). If…
Read More
2 packages for extracting dates from a string of text in Python

2 packages for extracting dates from a string of text in Python

Pandas, Python
This post will cover two different ways to extract dates from strings with Python. The main purpose here is that the strings we will parse contain additional text - not just the date. Scraping a date out of text can be useful in many different situations. Option 1) dateutil The first option we'll show is using the dateutil package. Here's an example: [code lang="python"] from dateutil.parser import parse parse("Today is 12-01-18", fuzzy_with_tokens=True) [/code] Above, we use a method in dateutil called parse. The first parameter to this method is the string of the text we want to use to search for a date. The second parameter, fuzzy_with_tokens, is set equal to True - this causes the method to return where the date is found in the string. In other words,…
Read More