Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON
Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer ABOUT ME
Bird’s-eye overview: not comprehensive explanation of these tools! Take data from start-to-finish Preprocessing: Pandas Analysis: scikit-learn Analysis: nltk Data pipeline: MRjob Visualization: matplotlib What next? ABOUT THIS TALK
So many tools Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability Community support “Easy” language to learn Both a scripting and production-ready language WHY PYTHON?
How to find the best tool(s)? The 90/10 rule Simple is better than complex FROM POINT A TO POINT…X?
Available resources Documentation, tutorials, books, videos Ease of use (with a grain of salt) Community support and continuous development Widely used WHY I CHOSE THESE TOOLS
The importance of data preprocessing AKA wrangling, munging, manipulating, and so on Preprocessing is also getting to know your data Missing values? Categorical/continuous? Distribution? PREPROCESSING
Data analysis and modeling Similar to R and Excel Easy-to-use data structures DataFrame Data wrangling tools Merging, pivoting, etc PANDAS
Keep everything in Python Community support/resources Use for preprocessing File I/0, cleaning, manipulation, etc Combinable with other modules NumPy, SciPy, statsmodel, matplotlib PANDAS
File I/O PANDAS
Finding missing values PANDAS
Removing missing values PANDAS
Pivoting PANDAS
Other things Statistical methods Merge/join like SQL Time series Has some visualization functionality PANDAS
Application of algorithms that learn from examples Representation and generalization Useful in everyday life Especially useful in data analysis MACHINE LEARNING
Supervised learning Classification and regression Unsupervised learning Clustering and dimensionality reduction MACHINE LEARNING
Machine learning module Open-source Built-in datasets Good resources for learning SCIKIT-LEARN
Scikit-learn: your data has to be continuous Here’s what one observation/label looks like: SCIKIT-LEARN
Transform categorical values/labels SCIKIT-LEARN
Classification SCIKIT-LEARN
Classification SCIKIT-LEARN
Other things Very comprehensive of machine learning algorithms Preprocessing tools Methods for testing the accuracy of your model SCIKIT-LEARN
Concerned with interactions between computers and human languages Derive meaning from text Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
Natural Language ToolKit Access to over 50 corpora Corpus: body of text NLP tools Stemming, tokenizing, etc Resources for learning NLTK
Stopword removal NLTK
Stopword removal NLTK
Stemming NLTK
Other things Lemmatizing, tokenization, tagging, parse trees Classification Chunking Sentence structure NLTK
Data that takes too long to process on your machine Not “big data” but larger data Solution: MapReduce! Processing large datasets with a parallel, distributed algorithm Map step Reduce step PROCESSING LARGE DATA
Map step Takes series of key/value pairs Ex. Word counts: break line into words, return word and count within line Reduce step Once for each unique key: iterates through values associated with that key Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
Write MapReduce jobs in Python Test code locally without installing Hadoop Lots of thorough documentation A few things to know Keep everything in one class MRJob program in a separate file Output to new file if doing something like word counts MRJOB
Stemmed file Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) And so on… MRJOB
Map Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) Line 3: (‘first’, 1), (‘wed’, 1) Line 4: (‘father’, 1) Line 5: (‘father’, 1) Reduce (‘miss’, 2) (‘taylor’, 2) (‘first’, 2) (‘wed’, 2) (‘father’, 2) MRJOB
Let’s count all words in the Gutenberg file Map step MRJOB
Reduce (and run) step MRJOB
Results Mapped counts reduced Key/val pairs MRJOB
Other things Run on Hadoop clusters Can write highly complex jobs Works with Elasticsearch MRJOB
The “final step” Conveying your results in a meaningful way Literally see what’s going on DATA VISUALIZATION
2D visualization library Very VERY widely used Wide variety of plots Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB
Remember this? MATPLOTLIB
Bar chart of distribution MATPLOTLIB
Let’s graph our word count frequencies (Hint: It’s a power law distribution!) MATPLOTLIB
High frequency of low numbers, low frequency of high numbers MATPLOTLIB
Other things Many different kinds of graphs Customizable Time series MATPLOTLIB
Phew! Which tool to choose depends on your needs Workflow: Preprocess Analyze Visualize WHAT NEXT?
Pandas scikit-learn NLTK MRJob matplotlib RESOURCES
Twitter LinkedIn NYC Python CONTACT ME!
Questions? THE END!