TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval.

Slides:



Advertisements
Similar presentations
Machine Learning Homework
Advertisements

Florida International University COP 4770 Introduction of Weka.
University of Sheffield NLP Module 11: Advanced Machine Learning.
CPIT 102 CPIT 102 CHAPTER 1 COLLABORATING on DOCUMENTS.
Tutorial 8: Developing an Excel Application
Programming in Visual Basic
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Three kinds of learning
TagHelper: User’s Manual Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center.
An Overview of Programming Logic and Design
An Exercise in Machine Learning
Study Tips for COP 4531 Ashok Srinivasan Computer Science, Florida State University Aim: To suggest learning techniques that will help you do well in this.
TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
1 / 12 PSLC Summer School, June 21, 2007 Identifying Students’ Gradual Understanding of Physics Concepts Using TagHelper Tools Nava L.
TagHelper and InfoMagnets Technologies for Exploring the effect of Language Interactions in Learning Carolyn Penstein Rosé, Jaime Arguello, Yue Cui, Rohit.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Computer Programming TCP1224 Chapter 3 Completing the Problem-Solving Process and Getting Started with C++
Introduction of Geoprocessing Topic 7a 4/10/2007.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
CPSC1301 Computer Science 1 Overview of Dr. Java.
Moving Ahead: Creative Feature Extraction and Error Analysis Techniques Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh.
Microsoft Office Outlook 2013 Microsoft Office Outlook 2013 Courseware # 3252 Lesson 6: Organizing Information.
Today Ensemble Methods. Recap of the course. Classifier Fusion
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
An Exercise in Machine Learning
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Machine Learning in Practice Lecture 13 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction of Geoprocessing Lecture 9 3/24/2008.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Chapter 2: The Visual Studio.NET Development Environment Visual Basic.NET Programming: From Problem Analysis to Program Design.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
TagHelper Track Overview Carolyn Penstein Rosé Carnegie Mellon University Language Technologies Institute & Human-Computer Interaction Institute School.
Machine Learning in Practice Lecture 14 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 14 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Oracle Advanced Analytics
Core LIMS Training: Project Management
Machine Learning with Spark MLlib
Development Environment
Advanced data mining with TagHelper and Weka
Microsoft® Office Word 2007 Training
Topics Introduction to File Input and Output
Machine Learning in Practice Lecture 11
SAMPLE PRESENTATION Company Name presents PowerPoint Basics
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 7
Machine Learning in Practice Lecture 6
Overview of deep learning
Evaluating Classifiers
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Topics Introduction to File Input and Output
GeoPlanner: Site Suitability Analysis
Presentation transcript:

TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division

Outline  Setting up your data  Creating a trained model  Evaluating performance  Using a trained model  Overview of basic feature extraction from text

Setting Up Your Data

How do you know when you have coded enough data? What distinguishes Questions and Statements? Not all questions end in a question mark. Not all WH words occur in questions I versus you is not a reliable predictor You need to code enough to avoid learning rules that won’t work

Creating a Trained Model

Training and Testing  Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder  You will then see the following tool pallet  The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data  Click on Train New Models

Loading a File First click on Add a File Then select a file

Simplest Usage  Click “GO!”  TagHelper will use its default setting to train a model on your coded examples  It will use that model to assign codes to the uncoded examples

More Advanced Usage  The second option is to modify the default settings  You get to the options you can set by clicking on >> Options  After you finish that, click “GO!”

Output  You can find the output in the OUTPUT folder  There will be a text file named Eval_[name of coding dimension]_[name of input file].txt  This is a performance report  E.g., Eval_Code_SimpleExample.xls.txt  There will also be a file named [name of input file]_OUTPUT.xls  This is the coded output  E.g., SimpleExample_OUTPUT.xls

Using the Output file Prefix  If you use the Output file prefix, the text you enter will be prepended to the output files  There will be a text file named [prefix]_Eval_[name of coding dimension]_[name of input file].txt  E.g., Prefix1_Eval_Code_SimpleExample.xls.txt  There will also be a file named [prefix]_[name of input file]_OUTPUT.xls  E.g., Prefix1_SimpleExample.xls

Evaluating Performance

Performance report  The performance report tells you:  What dataset was used  What the customization settings were  At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made

Performance report  The performance report tells you:  What dataset was used  What the customization settings were  At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made

Performance report  The performance report tells you:  What dataset was used  What the customization settings were  At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made

Output File  The output file contains  The codes for each segment  Note that the segments that were already coded will retain their original code  The other segments will have their automatic predictions  The prediction column indicates the confidence of the prediction

Using a Trained Model

Applying a Trained Model  Select a model file  Then select a testing file

Applying a Trained Model  Testing data should be set up with ? on uncoded examples  Click Go! to process file

Results

Overview of Basic Feature Extraction from Text

Customizations  To customize the settings:  Select the file  Click on Options

Setting the Language You can change the default language from English to German Chinese requires an additional license to Academia Sinica in Taiwan

Preparing to get a performance report You can decide whether you want it to prepare a performance report for you. (It runs faster when this is disabled.)

TagHelper Customizations  Typical classification algorithms  Naïve Bayes  SMO (Weka’s implementation of Support Vector Machines)  J48 (decision trees)  Rules of thumb:  SMO is state-of-the-art for text classification  J48 is best with small feature sets – also handles contingencies between features well  Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules

TagHelper Customizations  Feature Space Design  Think like a computer!  Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful  Look for approximations  If you want to find questions, you don’t need to do a complete syntactic analysis  Look for question marks  Look for wh-terms that occur immediately before an auxilliary verb  Look for topics likely to be indicative of questions (if you’re talking about ice cream, and someone mentions flavor without mentioning a specific flavor, it might be a question)

TagHelper Customizations  Feature Space Design  Punctuation can be a “stand in” for mood  “you think the answer is 9?”  “you think the answer is 9.”  Bigrams capture simple lexical patterns  “common denominator” versus “common multiple”  POS bigrams capture stylistic information  “the answer which is …” vs “which is the answer”  Line length can be a proxy for explanation depth

TagHelper Customizations  Feature Space Design  Contains non-stop word can be a predictor of whether a conversational contribution is contentful  “ok sure” versus “the common denominator”  Remove stop words removes some distracting features  Stemming allows some generalization  Multiple, multiply, multiplication  Removing rare features is a cheap form of feature selection  Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space