2017 - Open Source Automation

30Dec 2017 by Andrew Treadway

Underrated R Functions

R

I wanted to write a post about a couple of handy functions in R that don't always get the recognition they deserve. This article will talk about a few functions that form part of R's core functional programming capabilities. R has thousands of functions, so this is just a short list, and I'll probably write other articles like this in the future to discuss some different R functions. Reduce Let's start with the Reduce function (note the capital "R"). Reduce takes a list or vector as input, and reduces it down to a single element. It works by applying a function to the first two elements of the vector or list, and then applying the same function to that result with the third element. This new result gets passed with…

11Dec 2017 by Andrew Treadway

Vectorize Fuzzy Matching

R

One of the best things about R is its ability to vectorize code. This allows you to run code much faster than you would if you were using a for or while loop. In this post, we're going to show you how to use vectorization to speed up fuzzy matching. First, a little bit of background will be covered. If you're familiar with vectorization and / or fuzzy matching, feel free to skip further down the post. What is vectorization? Vectorization works by performing operations on entire vectors, or by extension, matrices, rather than iterating through each element in a collection of objects one at a time. A basic example is adding two vectors together. This can be done like this: [code lang="R"] a <- c(3, 4, 5) b <-…

14Oct 2017 by Andrew Treadway

Running R Code in Parallel

R

Background Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time. Thankfully, running R code in parallel is relatively simple using the parallel package. This package provides parallelized versions of sapply, lapply, and rapply. Parallelizing code works best when you need to call a function or perform an operation on different elements of a list or vector when doing so on any particular element of the list (or vector) has no impact on the evaluation of any other element. This could be running a large number of models across different elements of a list, scraping…

12Oct 2017 by Andrew Treadway

Word Frequency Analysis

Python, Web Scraping

In a previous article, we talked about using Python to scrape stock-related articles from the web. As an extension of this idea, we’re going to show you how to use the NLTK package to figure out how often different words occur in text, using scraped stock articles. Initial Setup Let's import the NLTK package, along with requests and BeautifulSoup, which we'll need to scrape the stock articles. [code language="python" style="font-size: 8px"] '''load packages''' import nltk import requests from bs4 import BeautifulSoup [/code] Pulling the data we'll need Below, we're copying code from my scraping stocks article. This gives us a function, scrape_all_articles (along with two other helper functions), which we can use to pull the actual raw text from articles linked to from NASDAQ's website. [code language="python"] def scrape_news_text(news_url): news_html…

03Oct 2017 by Andrew Treadway

Running Python from the Task Scheduler

Python, System Administration

Background Running Python from the Windows Task Scheduler is a really useful capability. It allows you to run Python in production on a Windows system, and can save countless hours of work. For instance, running code like extracting data from a database on an automated, regular basis is a common need at many companies. How to run Python from the command line Before we go into how to schedule a Python script to run, you need to understand how to run Python from the command line. To open the command prompt (command line), press the windows key and type cmd into the search box. Next, suppose your python script is called cool_python_script.py, and is saved under C:\Users. You can run this script from the command prompt by typing the below…

29Sep 2017 by Andrew Treadway

Downloading Every File on an FTP Server

File Manipulation, Python, Web Login

Getting Started Before I go into the title of this article, I'm going to give an introduction to using Python to work with FTP sites. In our example, I will use (and extend upon) some of the code written in the yahoo_fin package. To work with FTP servers, we can use ftplib, which comes with the Python standard library, so you’ll probably already have it installed. In case you don’t, however, you can download it using pip: [code lang="python"] pip install ftplib [/code] To log into an FTP site, we first need to establish a connection. You can do this using the FTP method. Just replace ftp.nasdaqtrader.com with whatever FTP site you want to log into. [code lang="python"] '''Load the ftplib package ''' import ftplib '''Connect to FTP site''' ftp…

19Sep 2017 by Andrew Treadway

RoboBrowser: Automating Online Forms

Python, Web Scraping

Background RoboBrowser is a Python 3.x package for crawling through the web and submitting online forms. It works similarly to the older Python 2.x package, mechanize. This post is going to give a simple introduction using RoboBrowser to submit a form on Wunderground for scraping historical weather data. Initial setup RoboBrowser can be installed via pip: [code lang="python"] pip install robobrowser [/code] Let's do the initial setup of the script by loading the RoboBrowser package. We'll also load pandas, as we'll be using that a little bit later. [code lang="python"] from robobrowser import RoboBrowser import pandas as pd [/code] Create RoboBrowser Object Next, we create a RoboBrowser object. This object functions similarly to an actual web browser. It allows you to navigate to different websites, fill in forms, and get…

16Sep 2017 by Andrew Treadway

Parsing Dates with Pandas

Pandas, Python

The pandas package is one of the most powerful Python packages available. One useful feature of pandas is its Timestamp method. This provides functionality to convert strings in a variety of formats to dates. The problem we're trying to solve in this article is how to parse dates from strings that may contain additional text / words. We will look at this problem using pandas. In the first step, we'll load the pandas package. [code lang="python"] '''Load pandas package ''' import pandas as pd [/code] Next, let's create a sample string containing a made-up date with other text. For now, assume the dates will not contain spaces (we will re-examine this later). Taking this assumption, we use the split method, available for strings in Python, to create a list of…

31Aug 2017 by Andrew Treadway

File Manipulation with Python

File Manipulation, Python, System Administration

Getting started Python is great for automating file creation, deletion, and other types of file manipulations. Two of the primary packages used to perform these types of tasks are os and shutil. We'll be covering a few useful highlights from each of these. [code lang="python"] import os import shutil [/code] How to get and change your current working directory You can get your current working directory using os.getcwd: [code lang="python"] os.getcwd() [/code] Any actions you take without specifying a directory will be assumed to be associated with your current working directory i.e. if you create or search for a file without specifying a directory, Python will assume you're in the value of os.getcwd(). To change your working directory, use os.chdir: [code lang="python"] os.chdir("C:/path/to/new/directory") [/code] How to merge a directory name…

24Aug 2017 by Andrew Treadway

Scraping Articles About Stocks

Python, Web Scraping

See recommended books here. The following article will show you an example of how to scrape articles about stocks from the Web using Python 3. Specifically, we'll be looking at articles linked from http://www.nasdaq.com. If you're not familiar with list comprehensions, you may want to check this, as we'll be using them in our code. Initial, Specific Example Let's start with a specific stock -- say, Netflix, for example. Articles linked to a specific stock ticker from Nasdaq's website have the following pattern: http://www.nasdaq.com/symbol/TICKER/news-headlines, where TICKER is replaced with whatever ticker you want. In our case, we will start by dealing specifically with Netflix's (NFLX) stock. So our site of interest is: http://www.nasdaq.com/symbol/nflx/news-headlines The first step is to load the requests and BeautifulSoup packages. Here, we'll also set the variable site equal to…

Year: 2017