This tutorial offers a basic introduction to practicing data science. We’ll walk through several typical projects that range from conceptualization to acquiring data, to analyzing and visualizing it, to drawing conclusions. We assume familiarity with the command line and the ability to use libraries and code.
Topics covered include:
Data acquisition and cleaning
Building practical data storage, analysis, and production systems
Joseph Adler has years of experience working with lots of popular data mining packages, including databases (including Oracle, PostgreSQL, and MS Access), statistical analysis tools (SAS, SPSS, S-Plus, and R), and data mining tools (SAS Enterprise Miner, Insightful Miner, Oracle Data Mining, Weka, and SPSS Clementine). He is currently leading a project at Verisign to pick a data mining package for enterprise deployment.
Hilary Mason is the lead scientist at bit.ly, where she is finding sense in vast data sets. She is a former computer science professor with a background in machine learning and data mining, has published numerous academic papers, and regularly releases code on her personal site, www.hilarymason.com.She has discovered two new species, loves to bake cookies, and asks way too many questions.
Drew Conway is a PhD candidate in Politics at NYU. He studies international relations, conflict, and terrorism using the tools of mathematics, statistics, and computer science in an attempt to gain a deeper understanding of these phenomena. His academic curiosity is informed by his years as an analyst in the U.S. intelligence and defense communities.
Jake Hofman is a member of the Human Social Dynamics group at Yahoo! Research. His work involves data-driven modeling of social data, focusing on applications of machine learning and statistical inference to large-scale data. He holds a B.S. in Electrical Engineering from Boston University and a Ph.D. in Physics from Columbia University.