Download presentation
Presentation is loading. Please wait.
Published byTyler Barry Cooper Modified over 9 years ago
1
Sarah Guido @sarah_guido Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON
2
Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer ABOUT ME
3
Bird’s-eye overview: not comprehensive explanation of these tools! Take data from start-to-finish Preprocessing: Pandas Analysis: scikit-learn Analysis: nltk Data pipeline: MRjob Visualization: matplotlib What next? ABOUT THIS TALK
4
So many tools Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability Community support “Easy” language to learn Both a scripting and production-ready language WHY PYTHON?
5
How to find the best tool(s)? The 90/10 rule Simple is better than complex FROM POINT A TO POINT…X?
6
Available resources Documentation, tutorials, books, videos Ease of use (with a grain of salt) Community support and continuous development Widely used WHY I CHOSE THESE TOOLS
7
The importance of data preprocessing AKA wrangling, munging, manipulating, and so on Preprocessing is also getting to know your data Missing values? Categorical/continuous? Distribution? PREPROCESSING
8
Data analysis and modeling Similar to R and Excel Easy-to-use data structures DataFrame Data wrangling tools Merging, pivoting, etc PANDAS
9
Keep everything in Python Community support/resources Use for preprocessing File I/0, cleaning, manipulation, etc Combinable with other modules NumPy, SciPy, statsmodel, matplotlib PANDAS
10
File I/O PANDAS
11
Finding missing values PANDAS
12
Removing missing values PANDAS
13
Pivoting PANDAS
14
Other things Statistical methods Merge/join like SQL Time series Has some visualization functionality PANDAS
15
Application of algorithms that learn from examples Representation and generalization Useful in everyday life Especially useful in data analysis MACHINE LEARNING
16
Supervised learning Classification and regression Unsupervised learning Clustering and dimensionality reduction MACHINE LEARNING
17
Machine learning module Open-source Built-in datasets Good resources for learning SCIKIT-LEARN
18
Scikit-learn: your data has to be continuous Here’s what one observation/label looks like: SCIKIT-LEARN
19
Transform categorical values/labels SCIKIT-LEARN
20
Classification SCIKIT-LEARN
21
Classification SCIKIT-LEARN
22
Other things Very comprehensive of machine learning algorithms Preprocessing tools Methods for testing the accuracy of your model SCIKIT-LEARN
23
Concerned with interactions between computers and human languages Derive meaning from text Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
24
Natural Language ToolKit Access to over 50 corpora Corpus: body of text NLP tools Stemming, tokenizing, etc Resources for learning NLTK
25
Stopword removal NLTK
26
Stopword removal NLTK
27
Stemming NLTK
28
Other things Lemmatizing, tokenization, tagging, parse trees Classification Chunking Sentence structure NLTK
29
Data that takes too long to process on your machine Not “big data” but larger data Solution: MapReduce! Processing large datasets with a parallel, distributed algorithm Map step Reduce step PROCESSING LARGE DATA
30
Map step Takes series of key/value pairs Ex. Word counts: break line into words, return word and count within line Reduce step Once for each unique key: iterates through values associated with that key Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
31
Write MapReduce jobs in Python Test code locally without installing Hadoop Lots of thorough documentation A few things to know Keep everything in one class MRJob program in a separate file Output to new file if doing something like word counts MRJOB
32
Stemmed file Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) And so on… MRJOB
33
Map Line 1: (‘miss’, 2), (‘taylor’, 1) Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) Line 3: (‘first’, 1), (‘wed’, 1) Line 4: (‘father’, 1) Line 5: (‘father’, 1) Reduce (‘miss’, 2) (‘taylor’, 2) (‘first’, 2) (‘wed’, 2) (‘father’, 2) MRJOB
34
Let’s count all words in the Gutenberg file Map step MRJOB
35
Reduce (and run) step MRJOB
36
Results Mapped counts reduced Key/val pairs MRJOB
37
Other things Run on Hadoop clusters Can write highly complex jobs Works with Elasticsearch MRJOB
38
The “final step” Conveying your results in a meaningful way Literally see what’s going on DATA VISUALIZATION
39
2D visualization library Very VERY widely used Wide variety of plots Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB
40
Remember this? MATPLOTLIB
41
Bar chart of distribution MATPLOTLIB
42
Let’s graph our word count frequencies (Hint: It’s a power law distribution!) MATPLOTLIB
43
High frequency of low numbers, low frequency of high numbers MATPLOTLIB
44
Other things Many different kinds of graphs Customizable Time series MATPLOTLIB
45
Phew! Which tool to choose depends on your needs Workflow: Preprocess Analyze Visualize WHAT NEXT?
46
Pandas http://pandas.pydata.org/ scikit-learn http://scikit-learn.org/ NLTK http://www.nltk.org/ MRJob http://mrjob.readthedocs.org/ matplotlib http://matplotlib.org/ RESOURCES
47
Twitter @sarah_guido LinkedIn https://www.linkedin.com/in/sarahguido NYC Python http://www.meetup.com/nycpython/ CONTACT ME!
48
Questions? THE END!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.