BeautifulSoup vs. Rvest

BeautifulSoup vs. Rvest

Python, R, Web Scraping
This post will compare Python's BeautifulSoup package to R's rvest package for web scraping. We'll also talk about additional functionality in rvest (that doesn't exist in BeautifulSoup) in comparison to a couple of other Python packages (including pandas and RoboBrowser). Getting started BeautifulSoup and rvest both involve creating an object that we can use to parse the HTML from a webpage. However, one immediate difference is that BeautifulSoup is just a web parser, so it doesn't connect to webpages. rvest, on the other hand, can connect to a webpage and scrape / parse its HTML in a single package. In BeautifulSoup, our initial setup looks like this: [code lang="python"] # load packages from bs4 import BeautifulSoup import requests # connect to webpage resp = requests.get(""https://www.azlyrics.com/b/beatles.html"") # get BeautifulSoup object soup…
Read More
Web Browsing and Parsing with RoboBrowser and requests_html

Web Browsing and Parsing with RoboBrowser and requests_html

Python, Web Scraping
Background So you've learned all about BeautifulSoup. What's next? Python is a great language for automating web operations. In a previous article we went through how to use BeautifulSoup and requests to scrape stock-related articles from Nasdaq's website. This post talks about a couple of alternatives to using BeautifulSoup directly. One way of scraping and crawling the web is to use Python's RoboBrowser package, which is built on top of requests and BeautifulSoup. Because it's built using each of these packages, writing code to scrape the web is a bit simplified as we'll see below. RoboBrowser works similarly to the older Python 2.x package mechanize in that it allows you to simulate a web browser. A second option is using requests_html, which was also discussed here, and which we'll also…
Read More
All about Python Sets

All about Python Sets

Python
See also my tutorials on lists and list comprehensions. Background on sets A set in Python is an unordered collection of unique elements. Sets are mutable and iterable (more on these properties later). Sets are useful for when dealing with a unique collection of elements - e.g. finding the unique elements within a list to determine if there are are any values which should be present. The operations built around sets are also handy when you need to perform mathematical set-like operations. For example, how would you figure out the common elements between two lists? Or what elements are in one list, but not another? With sets, it's easy! How to create a set We can define a set using curly braces, similar to how we define dictionaries. [code lang="python"]…
Read More
3 ways to scrape tables from PDFs with Python

3 ways to scrape tables from PDFs with Python

Python
This post will go through a few ways of scraping tables from PDFs with Python. To learn more about scraping tables and other data from PDFs with R, click here. Note, this options will only work for PDFs that are typed - not scanned-in images. tabula-py tabula-py is a very nice package that allows you to both scrape PDFs, as well as convert PDFs directly into CSV files. tabula-py can be installed using pip: [code] pip install tabula-py [/code] If you have issues with installation, check this. Once installed, tabula-py is straightforward to use. Below we use it scrape all the tables from a paper on classification regarding the Iris dataset (available here). [code lang="python"] import tabula file = "http://lab.fs.uni-lj.si/lasin/wp/IMIT_files/neural/doc/seminar8.pdf" tables = tabula.read_pdf(file, pages = "all", multiple_tables = True) [/code]…
Read More
How to get options data with Python

How to get options data with Python

Python, Web Scraping
In a previous post, we talked about how to get real-time stock prices with Python. This post will go through how to download financial options data with Python. We will be using the yahoo_fin package. The yahoo_fin package comes with a module called options. This module allows you to scrape option chains and get option expiration dates. To get started we'll just import this module from yahoo_fin. [code lang="python"] from yahoo_fin import options [/code] How to get options expiration dates Any option contract has an expiration date. To get all of the option expiration dates for a particular stock, we can use the get_expiration_dates method in the options package. This method is equivalent to scraping all of the date selection boxes on the options page for an individual stock (e.g.…
Read More
Why defining constants is important – a Python example

Why defining constants is important – a Python example

Python
This post will walk through an example of why defining a known constant can save lots of computational time. How to find the key with the maximum value in a Python dictionary There's a few ways to go about getting the key associated with the max value in a Python dictionary. The two ways we'll show each involve using a list comprehension. First, let's set the scene by creating a dictionary with 100,000 key-value pairs. We'll just make the keys the integers between 0 and 99,999 and we'll use the random package to randomly assign values for each of these keys based off the uniform distribution between 0 and 100,000. [code lang="python"] import random import time vals = [random.uniform(0, 100000) for x in range(100000)] mapping = dict(zip(range(100000), vals)) [/code] Now,…
Read More
Scraping data from a JavaScript webpage with Python

Scraping data from a JavaScript webpage with Python

Python, Web Scraping
This post will walk through how to use the requests_html package to scrape options data from a JavaScript-rendered webpage. requests_html serves as an alternative to Selenium and PhantomJS, and provides a clear syntax similar to the awesome requests package. The code we'll walk through is packaged into functions in the options module in the yahoo_fin package, but this article will show how to write the code from scratch using requests_html so that you can use the same idea to scrape other JavaScript-rendered webpages. Note: requests_html requires Python 3.6+. If you don't have requests_html installed, you can download it using pip: [cc] pip install requests_html [/cc] Motivation Let's say we want to scrape options data for a particular stock. As an example, let's look at Netflix (since it's well known). If…
Read More
2 packages for extracting dates from a string of text in Python

2 packages for extracting dates from a string of text in Python

Pandas, Python
This post will cover two different ways to extract dates from strings with Python. The main purpose here is that the strings we will parse contain additional text - not just the date. Scraping a date out of text can be useful in many different situations. Option 1) dateutil The first option we'll show is using the dateutil package. Here's an example: [code lang="python"] from dateutil.parser import parse parse("Today is 12-01-18", fuzzy_with_tokens=True) [/code] Above, we use a method in dateutil called parse. The first parameter to this method is the string of the text we want to use to search for a date. The second parameter, fuzzy_with_tokens, is set equal to True - this causes the method to return where the date is found in the string. In other words,…
Read More

Intro to Python Course

Python
For anyone in the NYC area, I am offering an in-person introductory Python course on January 7th, 2019. The description of the workshop is below. Please see here to register on Eventbrite. Want to learn Python? Consider attending this workshop! This hands-on class will be a great introduction for how to code in Python, the important features of the language, and will help you build a strong foundation for learning more in the future! Overview This course provides a workshop for introducing you to Python. We'll walk through how to write and run Python programs, when to use particular data structures, how to handle different data types, and more. The class will be a great start in learning one of the most versatile, powerful programming languages being used today! All…
Read More
How to measure DNA similarity with Python and Dynamic Programming

How to measure DNA similarity with Python and Dynamic Programming

Python
*Note, if you want to skip the background / alignment calculations and go straight to where the code begins, just click here. Dynamic Programming and DNA Dynamic programming has many uses, including identifying the similarity between two different strands of DNA or RNA, protein alignment, and in various other applications in bioinformatics (in addition to many other fields). For anyone less familiar, dynamic programming is a coding paradigm that solves recursive problems by breaking them down into sub-problems using some type of data structure to store the sub-problem results. In this way, recursive problems (like the Fibonacci sequence for example) can be programmed much more efficiently because dynamic programming allows you to avoid duplicate (and hence, wasteful) calculations in your code. Click here to read more about dynamic programming. Let's…
Read More