Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.

Similar presentations


Presentation on theme: "Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON."— Presentation transcript:

1 Sarah Guido @sarah_guido Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON

2  Data scientist at Reonomy  University of Michigan graduate  NYC Python organizer  PyGotham organizer ABOUT ME

3  Bird’s-eye overview: not comprehensive explanation of these tools!  Take data from start-to-finish  Preprocessing: Pandas  Analysis: scikit-learn  Analysis: nltk  Data pipeline: MRjob  Visualization: matplotlib  What next? ABOUT THIS TALK

4  So many tools  Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability  Community support  “Easy” language to learn  Both a scripting and production-ready language WHY PYTHON?

5  How to find the best tool(s)?  The 90/10 rule  Simple is better than complex FROM POINT A TO POINT…X?

6  Available resources  Documentation, tutorials, books, videos  Ease of use (with a grain of salt)  Community support and continuous development  Widely used WHY I CHOSE THESE TOOLS

7  The importance of data preprocessing  AKA wrangling, munging, manipulating, and so on  Preprocessing is also getting to know your data  Missing values? Categorical/continuous? Distribution? PREPROCESSING

8  Data analysis and modeling  Similar to R and Excel  Easy-to-use data structures  DataFrame  Data wrangling tools  Merging, pivoting, etc PANDAS

9  Keep everything in Python  Community support/resources  Use for preprocessing  File I/0, cleaning, manipulation, etc  Combinable with other modules  NumPy, SciPy, statsmodel, matplotlib PANDAS

10  File I/O PANDAS

11  Finding missing values PANDAS

12  Removing missing values PANDAS

13  Pivoting PANDAS

14  Other things  Statistical methods  Merge/join like SQL  Time series  Has some visualization functionality PANDAS

15  Application of algorithms that learn from examples  Representation and generalization  Useful in everyday life  Especially useful in data analysis MACHINE LEARNING

16  Supervised learning  Classification and regression  Unsupervised learning  Clustering and dimensionality reduction MACHINE LEARNING

17  Machine learning module  Open-source  Built-in datasets  Good resources for learning SCIKIT-LEARN

18  Scikit-learn: your data has to be continuous  Here’s what one observation/label looks like: SCIKIT-LEARN

19  Transform categorical values/labels SCIKIT-LEARN

20  Classification SCIKIT-LEARN

21  Classification SCIKIT-LEARN

22  Other things  Very comprehensive of machine learning algorithms  Preprocessing tools  Methods for testing the accuracy of your model SCIKIT-LEARN

23  Concerned with interactions between computers and human languages  Derive meaning from text  Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING

24  Natural Language ToolKit  Access to over 50 corpora  Corpus: body of text  NLP tools  Stemming, tokenizing, etc  Resources for learning NLTK

25  Stopword removal NLTK

26  Stopword removal NLTK

27  Stemming NLTK

28  Other things  Lemmatizing, tokenization, tagging, parse trees  Classification  Chunking  Sentence structure NLTK

29  Data that takes too long to process on your machine  Not “big data” but larger data  Solution: MapReduce!  Processing large datasets with a parallel, distributed algorithm  Map step  Reduce step PROCESSING LARGE DATA

30  Map step  Takes series of key/value pairs  Ex. Word counts: break line into words, return word and count within line  Reduce step  Once for each unique key: iterates through values associated with that key  Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA

31  Write MapReduce jobs in Python  Test code locally without installing Hadoop  Lots of thorough documentation  A few things to know  Keep everything in one class  MRJob program in a separate file  Output to new file if doing something like word counts MRJOB

32  Stemmed file  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  And so on… MRJOB

33 Map  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  Line 3: (‘first’, 1), (‘wed’, 1)  Line 4: (‘father’, 1)  Line 5: (‘father’, 1) Reduce  (‘miss’, 2)  (‘taylor’, 2)  (‘first’, 2)  (‘wed’, 2)  (‘father’, 2) MRJOB

34  Let’s count all words in the Gutenberg file  Map step MRJOB

35  Reduce (and run) step MRJOB

36  Results  Mapped counts reduced  Key/val pairs MRJOB

37  Other things  Run on Hadoop clusters  Can write highly complex jobs  Works with Elasticsearch MRJOB

38  The “final step”  Conveying your results in a meaningful way  Literally see what’s going on DATA VISUALIZATION

39  2D visualization library  Very VERY widely used  Wide variety of plots  Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB

40  Remember this? MATPLOTLIB

41  Bar chart of distribution MATPLOTLIB

42  Let’s graph our word count frequencies  (Hint: It’s a power law distribution!) MATPLOTLIB

43  High frequency of low numbers, low frequency of high numbers MATPLOTLIB

44  Other things  Many different kinds of graphs  Customizable  Time series MATPLOTLIB

45  Phew!  Which tool to choose depends on your needs  Workflow:  Preprocess  Analyze  Visualize WHAT NEXT?

46  Pandas  http://pandas.pydata.org/  scikit-learn  http://scikit-learn.org/  NLTK  http://www.nltk.org/  MRJob  http://mrjob.readthedocs.org/  matplotlib  http://matplotlib.org/ RESOURCES

47  Twitter  @sarah_guido  LinkedIn  https://www.linkedin.com/in/sarahguido  NYC Python  http://www.meetup.com/nycpython/ CONTACT ME!

48 Questions? THE END!


Download ppt "Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON."

Similar presentations


Ads by Google