Blog

How to change a file’s last modified date with R

File Manipulation, R
This relatively quick post goes through how to change a file's last modified date with base R. How to change a file's modified time with R Let's say we have a file, test.txt. What if we want to change the last modified date of the file (let's suppose the file's not that important)? Let's say, for instance, we want to make a file have a last modified date back in the 1980's. We can do that with one line of code. First, let's use file.info to check the current modified date of some file called test.txt. [code lang="R"] file.info("test.txt") [/code] We can see above by looking at mtime that this file was last modified December 4th, 2018. Now, we can use a function called Sys.setFileTime to change the modified date…
Read More
10 R functions for Linux commands and vice-versa

10 R functions for Linux commands and vice-versa

File Manipulation, R, System Administration
This post will go through 10 different Linux commands and their R alternatives. If you're interested in learning more R functions for working with files like some of those below, also check out this post. How to list all the files in a directory Linux R What does it do? ls list.files() Lists all the files in a directory ls -R list.files(recursive = TRUE) Recursively lists all the files in a directory and all sub-directories ls | grep "something" list.files(pattern = "something") Lists all the files in a directory containing the regex "something" R [code lang="R"] list.files("/path/to/directory") list.files("/path/to/do/directory", recursive = TRUE) # search for files containing "something" in their name list.files("/path/to/do/directory", pattern = "something") # search for all CSV files list.files("/path/to/do/directory", pattern = ".csv") [/code] Linux [code lang="bash"] ls /path/to/directory…
Read More

Intro to Python Course

Python
For anyone in the NYC area, I am offering an in-person introductory Python course on January 7th, 2019. The description of the workshop is below. Please see here to register on Eventbrite. Want to learn Python? Consider attending this workshop! This hands-on class will be a great introduction for how to code in Python, the important features of the language, and will help you build a strong foundation for learning more in the future! Overview This course provides a workshop for introducing you to Python. We'll walk through how to write and run Python programs, when to use particular data structures, how to handle different data types, and more. The class will be a great start in learning one of the most versatile, powerful programming languages being used today! All…
Read More
How to measure DNA similarity with Python and Dynamic Programming

How to measure DNA similarity with Python and Dynamic Programming

Python
*Note, if you want to skip the background / alignment calculations and go straight to where the code begins, just click here. Dynamic Programming and DNA Dynamic programming has many uses, including identifying the similarity between two different strands of DNA or RNA, protein alignment, and in various other applications in bioinformatics (in addition to many other fields). For anyone less familiar, dynamic programming is a coding paradigm that solves recursive problems by breaking them down into sub-problems using some type of data structure to store the sub-problem results. In this way, recursive problems (like the Fibonacci sequence for example) can be programmed much more efficiently because dynamic programming allows you to avoid duplicate (and hence, wasteful) calculations in your code. Click here to read more about dynamic programming. Let's…
Read More
Those “other” apply functions…

Those “other” apply functions…

R
So you know lapply, sapply, and apply...but...what about rapply, vapply, or eapply? These are generally a little less known as far as the apply family of functions in R go, so this post will explore how they work. rapply Let's start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We'll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let's say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables. Use rapply to…
Read More
How to run R from the Task Scheduler

How to run R from the Task Scheduler

R, System Administration
In a prior post, we covered how to run Python from the Task Scheduler on Windows. This article is similar, but it'll show how to run R from the Task Scheduler, instead. Similar to before, let's first cover how to R from the command line, as knowing this is useful for running it from the Task Scheduler. Running R from the Command Line To open up the command prompt, just press the windows key and search for cmd. When R is installed, it comes with a utility called Rscript. This allows you to run R commands from the command line. If Rscript is in your PATH, then typing Rscript into the command line, and pressing enter, will not result in an error. Otherwise, you might get a message saying "'Rscript'…
Read More

Data Analysis with Python Course: How to read, wrangle, and analyze data

Pandas, Python
For anyone in the NYC area, I will be holding an in-person training session December 3rd on doing data analysis with Python. We will be covering the pandas, pyodbc, and matplotlib packages. Please register at Eventbrite here: https://www.eventbrite.com/e/data-analysis-with-python-how-to-read-wrangle-and-analyze-data-tickets-51945542516. Overview Learn how to apply Python to read, wrangle, visualize, and analyze data!  This course provides a hands-on session where we'll walk through a prepared curriculum on doing data analysis with Python.  All code and practice exercises during the session will be made available after the course is complete.     About the course During this hands-on class, you will learn the fundamentals of doing data analysis in Python, the powerful pandas package, and pyodbc for connecting to databases. We will walk through using Python to analyze and answer key questions on sales…
Read More
How to build a logistic regression model from scratch in R

How to build a logistic regression model from scratch in R

Machine Learning, R
Background In a previous post, we showed how using vectorization in R can vastly speed up fuzzy matching. Here, we will show you how to use vectorization to efficiently build a logistic regression model from scratch in R. Now we could just use the caret or stats packages to create a model, but building algorithms from scratch is a great way to develop a better understanding of how they work under the hood. Definitions & Assumptions In developing our code for the logistic regression algorithm, we will consider the following definitions and assumptions: x = A dxn matrix of d predictor variables, where each column xi represents the vector of predictors corresponding to one data point (with n such columns i.e. n data points) d = The number of predictor…
Read More
Dpylthon…dplyr for Python!

Dpylthon…dplyr for Python!

Python
If you're an avid R user, you probably use the famous dplyr package. Python has a package meant to be similar to dplyr, called dplython. This article will give an introduction for how to use dplython. For the examples below, we'll use a sample dataset that comes with R giving attributes about the US states, including population, area, and income levels. You can see the dataset by clicking here. Initial setup dplython can be installed using pip:. pip install dplython Once the package is installed, let's load a few methods from it, and read in our dataset. [code lang="python"] # load packages from dplython import select, DplyFrame, X, arrange, count, sift, head, summarize, group_by, tail, mutate import pandas as pd # read in data state_df = pd.read_csv("state_info.txt") [/code] After we've…
Read More
Getting data from PDFs the easy way with R

Getting data from PDFs the easy way with R

R
Earlier this year, a new package called tabulizer was released in R, which allows you to automatically pull out tables and text from PDFs. Note, this package only works if the PDF's text is highlightable (if it's typed) -- i.e. it won't work for scanned-in PDFs, or image files converted to PDFs. If you don't have tabulizer installed, just run install.packages("tabulizer") to get started. Initial Setup After you have tabulizer installed, we'll load it, and define a variable referencing an example PDF. [code lang="R"] library(tabulizer) site <- "http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" [/code] The PDFs you manipulate with this package don't have to be located on your machine -- you can use tabulizer to reference a PDF by a URL. For our first example, we're going to use a sample PDF file found here:…
Read More