Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.

Slides:

Advertisements

Similar presentations

1 I256: Applied Natural Language Processing Marti Hearst Aug 30, 2006.

Advertisements

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,

An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.

NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Big Data Course Plans at Purdue Ananth Iyer. Big Data/Analytics Coursera course on Big Data by Bill Howe claims that Big Data involves issues of

(Edit via Slide Master) Name – Job Title From R to Python Robert Mastrodomenico Global Sports Statistics.

Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.

Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.

Industrial Project (234313) Final Presentation “App Analyzer” Deliver the right apps users want! (VMware) Students: Edward Khachatryan & Elina Zharikov.

Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?

April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

SPSS Presented by Chabalala Chabalala Lebohang Kompi Balone Ndaba.

HAMS Technologies 1

Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.

Nick Draper 05/11/2008 Mantid Manipulation and Analysis Toolkit for ISIS data.

A Powerful Python Library for Data Analysis BY BADRI PRUDHVI BADRI PRUDHVI.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Python for: Data Science. Python  Python is an open source scripting language.  Developed by Guido Van Rossum in late 1980s  Named after Monty Python.

Advanced Tips And Tricks For Power Query

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Data analysis tools Subrata Mitra and Jason Rahman.

COMP 4332 Tutorial 1 Feb 16 WANG YUE Tutorial Overview & Learning Python.

Matplotlib SANTHOSH Boggarapu.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

A Simple Approach for Author Profiling in MapReduce

Image taken from: slideshare

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Python for data analysis Prakhar Amlathe Utah State University

IBM Predictive Analytics Virtual Users’ Group Meeting March 30, 2016

Big Data is a Big Deal!.

MapReduce Compiler RHadoop

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

A Straightforward Author Profiling Approach in MapReduce

Spark Presentation.

Natural Language Processing (NLP)

External libraries A very complete list can be found at PyPi the Python Package Index: To install, use pip, which comes with.

Basic machine learning background with Python scikit-learn

Prepared by Kimberly Sayre and Jinbo Bi

Network Visualization

Machine Learning & Data Science

Python Visualization Tools: Pandas, Seaborn, ggplot

Brief Intro to Python for Statistics

CS110: Discussion about Spark

Data Science with Python

Overview of big data tools

Spark and Scala.

Charles Tappert Seidenberg School of CSIS, Pace University

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Natural Language Processing (NLP)

Python for Data Analysis

The Student’s Guide to Apache Spark

MapReduce: Simplified Data Processing on Large Clusters

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Igor Stančin, Alan Jović to: {igor.stancin,

Natural Language Processing (NLP)

An Introduction to Data Science using Python

Presentation transcript:

Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON

 Data scientist at Reonomy  University of Michigan graduate  NYC Python organizer  PyGotham organizer ABOUT ME

 Bird’s-eye overview: not comprehensive explanation of these tools!  Take data from start-to-finish  Preprocessing: Pandas  Analysis: scikit-learn  Analysis: nltk  Data pipeline: MRjob  Visualization: matplotlib  What next? ABOUT THIS TALK

 So many tools  Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability  Community support  “Easy” language to learn  Both a scripting and production-ready language WHY PYTHON?

 How to find the best tool(s)?  The 90/10 rule  Simple is better than complex FROM POINT A TO POINT…X?

 Available resources  Documentation, tutorials, books, videos  Ease of use (with a grain of salt)  Community support and continuous development  Widely used WHY I CHOSE THESE TOOLS

 The importance of data preprocessing  AKA wrangling, munging, manipulating, and so on  Preprocessing is also getting to know your data  Missing values? Categorical/continuous? Distribution? PREPROCESSING

 Data analysis and modeling  Similar to R and Excel  Easy-to-use data structures  DataFrame  Data wrangling tools  Merging, pivoting, etc PANDAS

 Keep everything in Python  Community support/resources  Use for preprocessing  File I/0, cleaning, manipulation, etc  Combinable with other modules  NumPy, SciPy, statsmodel, matplotlib PANDAS

 File I/O PANDAS

 Finding missing values PANDAS

 Removing missing values PANDAS

 Pivoting PANDAS

 Other things  Statistical methods  Merge/join like SQL  Time series  Has some visualization functionality PANDAS

 Application of algorithms that learn from examples  Representation and generalization  Useful in everyday life  Especially useful in data analysis MACHINE LEARNING

 Supervised learning  Classification and regression  Unsupervised learning  Clustering and dimensionality reduction MACHINE LEARNING

 Machine learning module  Open-source  Built-in datasets  Good resources for learning SCIKIT-LEARN

 Scikit-learn: your data has to be continuous  Here’s what one observation/label looks like: SCIKIT-LEARN

 Transform categorical values/labels SCIKIT-LEARN

 Classification SCIKIT-LEARN

 Classification SCIKIT-LEARN

 Other things  Very comprehensive of machine learning algorithms  Preprocessing tools  Methods for testing the accuracy of your model SCIKIT-LEARN

 Concerned with interactions between computers and human languages  Derive meaning from text  Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING

 Natural Language ToolKit  Access to over 50 corpora  Corpus: body of text  NLP tools  Stemming, tokenizing, etc  Resources for learning NLTK

 Stopword removal NLTK

 Stopword removal NLTK

 Stemming NLTK

 Other things  Lemmatizing, tokenization, tagging, parse trees  Classification  Chunking  Sentence structure NLTK

 Data that takes too long to process on your machine  Not “big data” but larger data  Solution: MapReduce!  Processing large datasets with a parallel, distributed algorithm  Map step  Reduce step PROCESSING LARGE DATA

 Map step  Takes series of key/value pairs  Ex. Word counts: break line into words, return word and count within line  Reduce step  Once for each unique key: iterates through values associated with that key  Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA

 Write MapReduce jobs in Python  Test code locally without installing Hadoop  Lots of thorough documentation  A few things to know  Keep everything in one class  MRJob program in a separate file  Output to new file if doing something like word counts MRJOB

 Stemmed file  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  And so on… MRJOB

Map  Line 1: (‘miss’, 2), (‘taylor’, 1)  Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1)  Line 3: (‘first’, 1), (‘wed’, 1)  Line 4: (‘father’, 1)  Line 5: (‘father’, 1) Reduce  (‘miss’, 2)  (‘taylor’, 2)  (‘first’, 2)  (‘wed’, 2)  (‘father’, 2) MRJOB

 Let’s count all words in the Gutenberg file  Map step MRJOB

 Reduce (and run) step MRJOB

 Results  Mapped counts reduced  Key/val pairs MRJOB

 Other things  Run on Hadoop clusters  Can write highly complex jobs  Works with Elasticsearch MRJOB

 The “final step”  Conveying your results in a meaningful way  Literally see what’s going on DATA VISUALIZATION

 2D visualization library  Very VERY widely used  Wide variety of plots  Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB

 Remember this? MATPLOTLIB

 Bar chart of distribution MATPLOTLIB

 Let’s graph our word count frequencies  (Hint: It’s a power law distribution!) MATPLOTLIB

 High frequency of low numbers, low frequency of high numbers MATPLOTLIB

 Other things  Many different kinds of graphs  Customizable  Time series MATPLOTLIB

 Phew!  Which tool to choose depends on your needs  Workflow:  Preprocess  Analyze  Visualize WHAT NEXT?

 Pandas   scikit-learn   NLTK   MRJob   matplotlib  RESOURCES

 Twitter  LinkedIn   NYC Python  CONTACT ME!

Questions? THE END!