If your laptop has any setup issues, please work with us to resolve them by Thursday. If your laptop has not yet been checked, you should come early on Thursday, or just walk through the setup checklist yourself (and let us know you have done so).
Resources:
For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time this weekend practicing Python:
Introduction to Python does a great job explaining Python essentials and includes tons of example code.
If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message) and send me your code in Slack.
Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
Command Line Resources:
If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
If you want to do more at the command line with CSV files, try out csvkit, which can be installed via pip.
Lesson on file reading with airline safety data (code, data, article)
Data cleaning exercise
Walkthrough of Python homework with Chipotle data (code, data, article)
Homework:
Complete the Python homework assignment with the Chipotle data, add a commented Python script to your GitHub repo, and submit a link using the homework submission form. You have until Tuesday (9/1) to complete this assignment. (Note: Pandas, which is covered in class 4, should not be used for this assignment.)
PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
If you're not using Anaconda, install the Jupyter Notebook (formerly known as the IPython Notebook) using pip. (The Jupyter or IPython Notebook is included with Anaconda.)
If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
This notebook demonstrates the different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
This is a nice, short tutorial on pivot tables in Pandas.
For working with geospatial data in Python, GeoPandas looks promising. This tutorial uses GeoPandas (and scikit-learn) to build a "linguistic street map" of Singapore.
Visualization Resources:
Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.
For more on Pandas plotting, read this notebook or the visualization page from the official Pandas documentation.
Optional: Complete the bonus exercise listed in the human learning notebook. It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/8).
If you're not using Anaconda, install requests and Beautiful Soup 4 using pip. (Both of these packages are included with Anaconda.)
For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
Real-World Active Learning is a readable and thorough introduction to "active learning", a variation of machine learning in which humans label only the most "important" observations.
Optional: Complete the homework exercise listed in the web scraping code. It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/15).
Optional: If you're not using Anaconda, install Seaborn using pip. If you're using Anaconda, install Seaborn by running conda install seaborn at the command line. (Note that some students in past courses have had problems with Anaconda after installing Seaborn.)
API Resources:
This Python script to query the U.S. Census API was created by a former DAT student. It's a bit more complicated than the example we used in class, it's very well commented, and it may provide a useful framework for writing your own code to query APIs.
Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
The Data Science Toolkit is a collection of location-based and text-related APIs.
Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.
Web Scraping Resources:
The Beautiful Soup documentation is incredibly thorough, but is hard to use as a reference guide. However, the section on specifying a parser may be helpful if Beautiful Soup appears to be parsing a page incorrectly.
For more Beautiful Soup examples and tutorials, see Web Scraping 101 with Python, a former DAT student's well-commented notebook on scraping Craigslist, this notebook from Stanford's Text As Data course, and this notebook and associated video from Harvard's Data Science course.
For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
For more complex web scraping projects, Scrapy is a popular application framework that works with Python. It has excellent documentation, and here's a tutorial with detailed slides and code.
robotstxt.org has a concise explanation of how to write (and read) the robots.txt file.
import.io and Kimono claim to allow you to scrape websites without writing any code.
This lecture on Image Classification shows how KNN could be used for detecting similar images, and also touches on topics we will cover in future classes (hyperparameter tuning and cross-validation).
Model evaluation using train/test split (notebook)
Exploring the scikit-learn documentation: module reference, user guide, class and function documentation
Homework:
Watch Data science in Python (35 minutes) for an introduction to linear regression (and a review of other course content), or at the very least, read through the associated notebook.
For another explanation of training error versus testing error, the bias-variance tradeoff, and train/test split (also known as the "validation set approach"), watch Hastie and Tibshirani's video on estimating prediction error (12 minutes, starting at 2:34).
Software development skills for data scientists discusses the importance of writing functions and proper code comments (among other skills), which are highly useful for creating a reproducible analysis.
Your first project presentation is on Tuesday (9/22)! Please submit a link to your project repository (with slides, code, data, and visualizations) by 6pm on Tuesday.
For a brief introduction to confidence intervals, hypothesis testing, p-values, and R-squared, as well as a comparison between scikit-learn code and Statsmodels code, read my DAT7 lesson on linear regression.
Earlier this year, a major scientific journal banned the use of p-values:
Scientific American has a nice summary of the ban.
This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
Science Isn't Broken includes a neat tool that allows you to "p-hack" your way to "statistically significant" results.
For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
Confirm that you have TextBlob installed by running import textblob from within your preferred Python environment. If it's not installed, run pip install textblob at the command line (not from within Python).
When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.
Create a Kaggle account, join the competition using the invitation link, download the sample submission, and then submit the sample submission (which will require SMS account verification).
Homework:
Your draft paper is due on Thursday (10/8)! Please submit a link to your project repository (with paper, code, data, and visualizations) before class.
Watch Kaggle: How it Works (4 minutes) for a brief overview of the Kaggle platform.
Download the competition files, move them to the DAT8/data directory, and make sure you can open the CSV files using Pandas. If you have any problems opening the files, you probably need to turn off real-time virus scanning (especially Microsoft Security Essentials).
Optional: Come up with some theories about which features might be relevant to predicting the response, and then explore the data to see if those theories appear to be true.
Optional: Watch my project presentation video (16 minutes) for a tour of the end-to-end machine learning process for a Kaggle competition, including feature engineering. (Or, just read through the slides.)
NLP Resources:
If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
You will be assigned to review the project drafts of two of your peers. You have until Tuesday 10/20 to provide them with feedback, according to the peer review guidelines.
Download and install Graphviz, which will allow you to visualize decision trees in scikit-learn.
Windows users should also add Graphviz to your path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin
Optional: Keep working on our Kaggle competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Tuesday 10/27 (class 21).
Resources:
Specialist Knowledge Is Useless and Unhelpful is a brief interview with Jeremy Howard (past president of Kaggle) in which he argues that data science skills are much more important than domain expertise for creating effective predictive models.
Learning from the best is an excellent blog post covering top tips from Kaggle Masters on how to do well on Kaggle.
Feature Engineering Without Domain Expertise (17 minutes), a talk by Kaggle Master Nick Kridler, provides some simple advice about how to iterate quickly and where to spend your time during a Kaggle competition.
scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.
scikit-learn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.
Browse the excellent solution paper from the winner of Kaggle's CrowdFlower competition for an example of the work and insight required to win a Kaggle competition.
Optional: Watch these two excellent (and related) videos from Caltech's Learning From Data course: bias-variance tradeoff (15 minutes) and regularization (8 minutes).
scikit-learn Resources:
This is a longer example of feature scaling in scikit-learn, with additional discussion of the types of scaling you can use.
Practical Data Science in Python is a long and well-written notebook that uses a few advanced scikit-learn features: pipelining, plotting a learning curve, and pickling a model.
Sebastian Raschka has a number of excellent resources for scikit-learn users, including a repository of tutorials and examples, a library of machine learning tools and extensions, a new book, and a semi-active blog.
scikit-learn has an incredibly active mailing list that is often much more useful than Stack Overflow for researching functions and asking questions.
If you forget how to use a particular scikit-learn function that we have used in class, don't forget that this repository is fully searchable!
Clustering Resources:
For a very thorough introduction to clustering, read chapter 8 (69 pages) of Introduction to Data Mining (available as a free download), or browse through the chapter 8 slides.
This PowerPoint presentation from Columbia's Data Mining class provides a good introduction to clustering, including hierarchical clustering and alternative distance metrics.
The K-modes algorithm can be used for clustering datasets of categorical features without converting them to numerical values. Here is a Python implementation.
For more details on lasso regression, read Tibshirani's original paper.
For a math-ier explanation of regularization, watch the last four videos (30 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
There are some special considerations when using dummy encoding for categorical features with a regularized model. This Cross Validated Q&A debates whether the dummy variables should be standardized (along with the rest of the features), and a comment on this blog post recommends that the baseline level should not be dropped.
Regular Expressions Resources:
Google's Python Class includes an excellent introductory lesson on regular expressions (which also has an associated video).
Python for Informatics has a nice chapter on regular expressions. (If you want to run the examples, you'll need to download mbox.txt and mbox-short.txt.)
This GA slide deck provides a brief introduction to databases and SQL. The Python script from that lesson demonstrates basic SQL queries, as well as how to connect to a SQLite database from Python and how to query it using Pandas.
The repository for this SQL Bootcamp contains an extremely well-commented SQL script that is suitable for walking through on your own.
This GA notebook provides a shorter introduction to databases and SQL that helpfully contrasts SQL queries with Pandas syntax.
w3schools has a sample database that allows you to practice SQL from your browser. Similarly, Kaggle allows you to query a large SQLite database of Reddit Comments using their online "Scripts" application.
If you want to go deeper into databases and SQL, Stanford has a well-respected series of 14 mini-courses.
Blaze is a Python package enabling you to use Pandas-like syntax to query data living in a variety of data storage systems.
Recommendation Systems
This GA slide deck provides a brief introduction to recommendation systems, and the Python script from that lesson demonstrates how to build a simple recommender.
Chapter 9 of Mining of Massive Datasets (36 pages) is a more thorough introduction to recommendation systems.
Chapters 2 through 4 of A Programmer's Guide to Data Mining (165 pages) provides a friendlier introduction, with lots of Python code and exercises.
The Netflix Prize was the famous competition for improving Netflix's recommendation system by 10%. Here are some useful articles about the Netflix Prize:
The People Inside Your Machine (23 minutes) is a Planet Money podcast episode about how Amazon Mechanical Turks can assist with recommendation engines (and machine learning in general).
Coursera has a course on recommendation systems, if you want to go even deeper into the material.