Text Analytics and Machine Learning Workshop Machine Learning Session

Slides:



Advertisements
Similar presentations
Machine Learning Homework
Advertisements

Training Course: Task List. Agenda Overview of the Task List Screen Icons across the top Making Appointments Viewing Appointments & Filters Working Your.
Indistar® Leadership Team Meetings. Where can we plan a meeting? Choose ‘Plan Your Meeting’ from the main menu screen Click on Meeting Agenda Setup.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
WELCOME TO THE ANALYSIS PLATFORM V4.1. HOME The updated tool has been simplified and developed to be more intuitive and quicker to use: 3 modes for all.
TimeML Annotation Tool Suite Tutorial Using Callisto and Tango for TimeML Annotation 10/26/04.
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports [Your company name] presents:
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports Guangzhou Newelink Technology Co,. Ltd.
Using ProQuest Databases Jackson Community College Atkinson Library.
New School Websites Teacher Pages. Visit the SCUSD Website for videos tutorials: For more information.
Working with SharePoint Document Libraries. What are document libraries? Document libraries are collections of files that you can share with team members.
1 Lesson 6 Exploring Microsoft Office 2007 Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Be active! Cell phones ON! Take care of yourself!
FrontPage Introduction Presented by: Audrey Marshall for Interactive Multimedia Design.
Microsoft ® Office OneNote ® 2003 Training Get to know OneNote CGI presents:
How to Create a PowerPoint Presentation Starting PowerPoint Click Start, Programs, Microsoft PowerPoint. Click Blank Presentation. Click OK. Choose the.
Training: Data Analysis By: Mercy Aycart, South Miami Senior High.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
0 eCPIC Admin Training: OMB Submission Packages and Annual Submissions These training materials are owned by the Federal Government. They can be used or.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Creating A Survey Using Office of Student Affairs Assessment The University of Georgia A-Team Training-Skills Session 1 October 30, 2007.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Blackboard Rabie A. Ramadan Session 3.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
E-PORTFOLIOS E-PORTFOLIOS Building For Your Future.
Word and the Writing Process. To create a document 1.On the Start menu, point to Programs, and then click Microsoft Word. A new document opens in Normal.
GroupMap Starter’s Guide Think Better Together Plan, brainstorm, discuss and prioritise for action. © GroupMap Pty Ltd |
PowerTeacher Gradebook PTG and PowerTeacher Pro PT Pro A Comparison The following slides will give you an overview of the changes that will occur moving.
AdisInsight User Guide July 2015
5 tips for a simpler way to work
Lesson 11 Exploring Microsoft Office 2007
Understanding Search Engines
Uninstalling Google App Sync
Basic Training May 2016.
5 tips for a simpler way to work
Introduction to Lime Survey
Word and the Writing Process
9/14/2018 6:28 AM How to create Learning Plans in Partner University Mary Sutton October 2017 © 2014 Microsoft Corporation. All rights reserved. MICROSOFT.
Google Docs Workshop Jan. 2014
Bountiful High School MAP Ethics Research Project
Student Registration/ Personal Needs Profile
Reports: Pivot Table ©2015 SchoolCity, Inc. All rights reserved.
Collaboration with Google Docs
A few tricks to take you beyond the basics of Microsoft Office 2010
The Lightroom Sessions A Quick Start Review – Pittwater Camera Club
Intro to Machine Learning
Office of Education Improvement and Innovation
Microsoft® Office Word 2007 Training
Introduction to the New SSA OnePoint Online Website
Module 5: Data Cleaning and Building Reports
Lets Build a Nearpod An Idea that Rocks.
Analysing your pat data
Text Analytics and Machine Learning Workshop
CSCI N317 Computation for Scientific Applications Unit Weka
Teacher Training Module Three Teacher Tools: Tools & Analysis
5 tips for a simpler way to work
Intro to Machine Learning
5 tips for a simpler way to work
Business Decision Support Software For The 21st Century
Topic 11 Lesson 1 - Analyzing Data in Access
Student Registration/ Personal Needs Profile
VISUAL COMMUNICATION USING ADOBE PHOTOSHOP CREATIVE SUITE 5
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
5 tips for a simpler way to work
EZ-VOTE Connect Quick Start Guide.
SET-UP AND MODIFY SEARCH AGENTS
Presentation transcript:

Text Analytics and Machine Learning Workshop Machine Learning Session October 26, 2018 CPA Canada - Toronto, Canada

Machine Learning Session: Agenda A little more on the machine learning Hands-on with WordStat 8: import text visualize term frequencies & associations pre-process text (stem, filter) create a term document matrix Hands-on with RapidMiner import a term document matrix Develop and cross-validate several classification algorithms Compare the accuracy using out-of-sample observations

The Machine Learning Process Identify a task or prediction problem Collect data Prepare data Identify “features” – X and Y variables Train & cross-validate several models Evaluate models – out-of-sample tests Select & deploy model Interpret, infer, predict, or decide Re-evaluate and update the model

The Machine Learning Process Tips: Ask specific questions that can be answered with a yes/no, the name of category/person/group, a number, what action to take next. Google “data-science-for-beginners-the-5-questions-data-science-answers” for Microsoft’s video suggestions for formulating questions. Identify a task or prediction problem Collect data Prepare data Identify “features” – X and Y variables Train & cross-validate several models Evaluate models – out-of-sample tests Select & deploy model Interpret, infer, predict, or decide Re-evaluate and update the model

The Machine Learning Process Identify a task or prediction problem Collect data Prepare & explore data Identify “features” – X and Y variables Train & cross-validate several models Evaluate models – out-of-sample tests Select & deploy model Interpret, infer, predict, or decide Re-evaluate and update the model For textual data: Create a corpus: Collect documents from the Web etc. Save each as a string variable in a list The “sample.csv” file downloaded for this workshop is a corpus of text.

WordStat 8 – import the sample text Step 2 Step 3 Step 1 start a new WordStat project search and open the “sample.csv” file Click “save” to create a new project Step 4 Step 5 Click “Analyze” to start the text analysis Click “next” then “import” on next 2 screens

WordStat 8: start with the simple interface 1. Select the “Text” box and click “Explore”. Three tabs (Frequencies, Phrases, and Topics) will be added beside the Data tab. 2. Click the “Frequencies” tab to view word frequencies and a word cloud. We’ll switch to “Expert Mode” later … don’t click for now or you’ll see different screens

Challenge 1: can you discover … How many times does the word “error” appears in the sample? How many documents contain the word “investigation”? What proportion of the restatements that correct unintentional error and the restatements that correct intentional misstatements contain each word? hint: use comparison panel at upper right and select variable AAA_label Which company uses the infrequently occurring word “writedown” hint: check “leftover words” and click “frequency” to see infrequently used words; the three horizontal bar icon to the left of the data tab allows drilling down to view “keyword-in-context”)

The Machine Learning Process often 80% of process effort “clean”, transform, filter, aggregate, impute, structure For text often includes: Remove noise Filter rare/common words Normalize Remove punctation convert to lower case stem words remove “stop words” Tokenize split text strings create term document matrix Identify a task or prediction problem Collect data Prepare & explore data Identify “features” – X and Y variables Train & cross-validate several models Evaluate models – out-of-sample tests Select & deploy model Interpret, infer, predict, or decide Re-evaluate and update the model

Click here to switch to “expert mode”: more advanced visualizations and analyses to preprocess text 2. Later, we will come back to the “data” tab and click here to export a term document matrix.

Challenge 2: can you … “stem” words using English Porter stemming? hint: click pre-processing Ignore words occurring in less than 10 documents or more than 70% of the documents? hint: click post-processing Determine how many word stems are “included” in subsequent analyses? hint: click the Frequencies tab … check the bottom of the screen

The Machine Learning Process Challenge 3: a gentle intro to one type of machine learning classification Click “classification” Under classification options select AAA_LABEL as the variable to predict and 10 folds cross-validation as the validation method Click “Run” Click “Learn and Test” Observe Accuracy % and number “incorrect” in the confusion matrix Hand up when done … and help your neighbor  (we’ll work together on next step of this challenge) Identify a task or prediction problem Collect data Prepare & explore data Identify “features” – X and Y variables Train & cross-validate several models Evaluate models – out-of-sample tests Select & deploy model Interpret, infer, predict, or decide Re-evaluate and update the model

So many different models to choose from! Simple methods may underfit -> Poor prediction Complex methods may overfit -> not generalizable Experiment to find the simplest one that predicts well enough

The Machine Learning Process Identify a task or prediction problem Collect data Prepare & explore data Identify “features” – X and Y variables Train & cross-validate several models Evaluate models – out-of-sample tests Select & deploy model Interpret, infer, predict, or decide Re-evaluate and update the model Export the term document matrix from WordStat: Click “fFrequencies” Tab Click the small disc icon below “Frequencies” Save the exported file to your Desktop Start RapidMiner Studio and import the term document matrix saved to your Desktop (next slide).

Launch RapidMiner and import the term document matrix created with WordStat Step 1 Step 2

Follow the prompts: Click Predict to begin When prompted, click the AAA_LABEL column to select the variable to predict Click “next” after each subsequent choice to run several classification algorithms

Overview of results for sample data: We’ll “drill down” to see model details and open the RapidMiner detailed design for further modification.

The Machine Learning Process Identify a task or prediction problem Collect data Prepare & explore data Identify “features” – X and Y variables Train & cross-validate several models Evaluate models – out-of-sample tests Select & deploy model Interpret, infer, predict, or decide Re-evaluate and update the model Many models produce similar results. Model choice depends not only on overall accuracy of out-of-sample tests but also on factors such as ease of interpretation and deployment in real-time systems (if applicable)

Interprebility Ask yourself: Do I understand my data? Do I understand the model developed? Do I trust these answers? Simpler models that are easier to interpret and deploy are often a wise choice. “Unfortunately, the complexity that bestows the extraordinary predictive abilities on machine learning algorithms also makes the answers the algorithms produce hard to understand, and maybe even hard to trust.” Hall, Phan, and Ambati “Ideas on interpreting machine learning” O'Reilly Media March 15, 2017.

A few final thoughts…. Tools are rapidly evolving Currently Python and R are the most widely used A team approach is best … hiring one data scientist won’t make it happen Success depends on asking the right questions and hiring/training/retaining the best people CPAs are well positioned to craft the right machine learning strategies and to ask the right questions … to plan investment in machine learning, to begin machine learning projects, and to interpret results