data science Archives - Open Source Automation

04Apr 2023 by Andrew Treadway

Software Engineering for Data Scientists (New book!)

Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for Data Scientists is available now! Check it out at this link. Use promo code au35tre to save 30% on this book and any products sold from Manning. Why Software Engineering for Data Scientists? Data science and software engineering have been merging more and more, especially over the last decade. Software Engineering for Data Scientists is my upcoming book that will help you learn more about software engineering and how it can make your life easier as a data scientist! This book covers the following key topics: Source control How to implement exception handling and write robust code Object-oriented programming for data scientists How to monitor the progress of training machine learning models Scaling your Python…

24Mar 2020 by Andrew Treadway

What to study if you’re under quarantine

Python, R

If you're staying indoors more often recently because of the current COVID-19 outbreak and looking for new things to study, here's a few ideas! Free 365 Data Science Courses 365 Data Science is making all of their courses free until April 15. They have a variety of courses across R, Python, SQL, and more. Their platform also has courses that give a great mathematical foundation behind machine learning, which helps you a lot as you get deeper into data science. They also have courses on deep learning, which is a hot field right now. In addition to pure data science, 365 Data Science also covers material on Git / Github, which is essential for any data scientist nowadways. Another nice feature of 365 Data Science is that they also offer…

20Aug 2019 by Andrew Treadway

How to get an AUC confidence interval

Machine Learning, R

Background AUC is an important metric in machine learning for classification. It is often used as a measure of a model's performance. In effect, AUC is a measure between 0 and 1 of a model's performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link. Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model's performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R's pROC package, which…

02Oct 2018 by Andrew Treadway

How to build a logistic regression model from scratch in R

Machine Learning, R

Background In a previous post, we showed how using vectorization in R can vastly speed up fuzzy matching. Here, we will show you how to use vectorization to efficiently build a logistic regression model from scratch in R. Now we could just use the caret or stats packages to create a model, but building algorithms from scratch is a great way to develop a better understanding of how they work under the hood. Definitions & Assumptions In developing our code for the logistic regression algorithm, we will consider the following definitions and assumptions: x = A dxn matrix of d predictor variables, where each column xi represents the vector of predictors corresponding to one data point (with n such columns i.e. n data points) d = The number of predictor…

23Jun 2018 by Andrew Treadway

ICA on Images with Python

Machine Learning, Python

Click here to see my recommended reading list. What is Independent Component Analysis (ICA)? If you're already familiar with ICA, feel free to skip below to how we implement it in Python. ICA is a type of dimensionality reduction algorithm that transforms a set of variables to a new set of components; it does so such that that the statistical independence between the new components is maximized. This is similar to Principle Component Analysis (PCA), which maps a collection of variables to statistically uncorrelated components, except that ICA goes a step further by maximizing statistical independence rather than just developing components that are uncorrelated. Like other dimensionality reduction methods, ICA seeks to reduce the number of variables in a set of data, while retaining key information. In the example we…

Tag: data science