Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar.

Slides:



Advertisements
Similar presentations
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
CAHE Technology Help Desk ● (505) ● Week 4:  Microsoft PowerPoint (Part 2)
Windows XP Basics OVERVIEW Next.
AUTOMATIC ORGANIZING AND FORMATTING FOR LECTURE NOTES SHIQING (LICIA) HE ADIVISOR: PROF.KRISTINA STRIEGNITZ SPRING 2014 STRUCTURING THE UNSTRUCTURED NOTE:
An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)
Incorporating Machine Learning in your application: Text classification Vivek Srikumar.
Intro to CIT 594
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. COMPSCI 125 Introduction to Computer Science I.
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Operating Systems and Cross Platform Issues. Apple/Macintosh OS Steve Jobs and Steve Wozniak created the Apple I computer in their garage Apple.
Last time 3 main components to a computer system Types of computers Talked about software – task oriented What are some kinds of data that a computer works.
WordSieve: Learning Task Differentiating Keywords Automatically Travis Bauer Sandia National Laboratories (Research discussed today was done at Indiana.
Computer Skills Preparatory Year Presented by: L.Obead Alhadreti.
- where great ideas begin-. BrilNet is a web conferencing solution that allows you to always stay connected. With BrilNet, you can schedule, host and.
Dragon Naturally Speaking Tutorial What is Dragon Naturally Speaking? Dragon is a dictation software, students can dictate a paper rather than type it.
Intro to CIT 594
Computer Science 102 Data Structures and Algorithms V Fall 2009 Lecture 1: administrative details Professor: Evan Korth New York University 1.
Welcome to CompSci 100! As You Arrive… Make sure you grab a syllabus packet. Read through it. I will be covering the most essential points in my talk,
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
PDF accessibility Susannah Pike
Lab 8 – C# Programming Adding two numbers CSCI 6303 – Principles of I.T. Dr. Abraham Fall 2012.
Information guide.
Programming Logic Program Design. Objectives Steps in program development Algorithms and Pseudocode Data Activity: Alice program.
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
Faye Business Systems Group presents: The Top 10 Reasons Why CRM Implementations Fail.
Lecture # 3 HTML and Arrays. Today Questions: From notes/reading/life? From Lab # 2 – Preview of Lab # 2 1.Introduce: How do you make a Web Page?: HTML.
Overview In this tutorial you will: learn different ways to conduct a web search learn how to save and print search results learn about social bookmarking.
Customer Service and Support Sutherland Global Services Consultant Learning Services Microsoft Store.
To Use This System To schedule a training or for classroom support, please call the Library Computing Helpdesk at WAYNE STATE UNIVERSITY -
Microsoft Internet Explorer and the Internet Using Microsoft Explorer 5.
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
CSC 556– DBMS II, Spring 2013, Week 7 Bayesian Inference Paul Graham’s Plan for Spam, + A Patent Application for Learning Mobile Preferences, + some text.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
1 Please switch off your mobile phones. 2 WELCOME To ESC101N: Fundamentals of Computing Instructor: Mainak Chaudhuri
CSCI 51 Introduction to Computer Science Dr. Joshua Stough January 20, 2009.
Chapter 1 Getting Started. 2Practical PC 5 th Edition Chapter 1 Getting Started In this Chapter, you will learn: − How to power up the computer − About.
Computer Science 102 Data Structures and Algorithms CSCI-UA.0102 Fall 2012 Lecture 1: administrative details Professor: Evan Korth New York University.
M150: Data, Computing and information Outline 1.Unit two. 2.What’s next. 3.Some questions. 4.Your questions. 1.
IT Services Getting Started on your iPad Created by Michael Mackenzie.
Procedural Programming. Programming Process 1.Understand the problem 2.Outline a general solution 3.Decompose the general solution into manageable component.
Mtivity Client Support System Quick start guide. Mtivity Client Support System We are very pleased to announce the launch of a new Client Support System.
Technology Vocabulary By: Rakeysha Patterson. Search Engine  A computer program that searches documents, especially on the World Wide Web, for a specified.
WELCOME to CS244 Brent M. Dingle, Ph.D Game Design and Development Program Mathematics, Statistics and Computer Science University of Wisconsin -
Modifying HTML attributes and CSS values. Learning Objectives By the end of this lecture, you should be able to: – Select based on a class (as opposed.
SMART Boards in the World Language Classroom Amanda Robustelli-Price 9/20/11.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Computer Software Operating Systems – Programs. Computer Language - Review We learnt that computers are made up of millions of tiny switches that can.
Today: Student will be able to describe the basics of their class and computing Tell me about you and how you use computers. Lesson 1 Slide 1.
CITRIX REVIEW Presented by Mary Kay Black and Christy Randall.
Naïve Bayes Classification Christina Wallin Computer Systems Research Lab
Data Structures and Algorithms in Java AlaaEddin 2012.
Let’s Get Savvy about Technology Mrs. George Dondero School.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Hardware research By Hollie Willis.
Introduction to Programming
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Done Done Course Overview What is AI? What are the Major Challenges?
Computer Science 102 Data Structures CSCI-UA
Welcome to CS 1010! Algorithmic Problem Solving.
Instructor : Saeed Shiry & Mitchell Ch. 6
Tagging Review Comments Rationale #10 Week 13
Citation Map Visualizing citation data in the Web of Science
Computer Science A Level
Office 365.
Knowledge Transfer via Multiple Model Local Structure Mapping
Presentation transcript:

Hands on Classification with Learning Based Java Gourab Kundu Adapted from a talk by Vivek Srikumar

Goals of this tutorial At the end of these lectures, you will be able to 1. Get started with Learning Based Java 2. Use a generic, black box text classifier for different applications …and write your own text classifier, if needed 3. Understand how features can impact the classifier performance … and add features to improve your application 4. Build a badge classifier based on character features

A Quick Recap Given: Examples (x,f(x)) of some unknown function f Find: A good approximation of f x provides some representation of the input  The process of mapping a domain element into a representation is called Feature Extraction. (Hard; ill-understood; important)  x €{0,1} n or x € R n The target function (label)  f(x) € {-1,+1} Binary Classification  f(x) € {1,2,3,.,k-1} Multi-class classification

What is text classification? ✓ ✗ ✗ ✗ A document Some labels A classifier (black box)

Several applications fit this framework Spam detection Sentiment classification What else can you do, if you had such a black box system that can classify text? Try to spend 30 seconds brainstorming

Outline of this session Getting started with LBJ Writing our first classifier: Spam/Ham Playing with features Looking inside the black box classifier for feature weights

LEARNING BASED JAVA Writing classifiers

What is Learning Based Java? A modeling language for learning and inference Supports  Programming using learned models  High level specification of features and constraints between classifiers  Inference with constraints  Different learning algorithms The learning operator  Classifiers are functions defined in terms of data  Learning happens at compile time

What does LBJ do for you? Abstracts away the feature representation, learning and inference Allows you to write learning based programs Application developers can reason about the application at hand

Demo A learning based program First, we will write an application that assumes the existence of a black box classifier

SPAM DETECTION

Spam detection Which of these (if any) are spam? Subject: save over 70 % on name brand software ppharmacy devote fink tungstate brown lexicon pawnshop crescent railroad distaff cytosine barium cain application elegy donnelly hydrochloride common embargo shakespearean bassett trustee nucleolus chicano narbonne telltale tagging swirly lank delphinus bragging bravery cornea asiatic susanne Subject: please keep in touch just like to say that it has been great meeting and working with you all. i will be leaving enron effective july 5 th to do investment banking in hong kong. i will initially be based in new york and will be moving to hong kong after a few months. do contact me when you are in the vicinity. How do you know?

What do we need to build a classifier? 1. Annotated documents * 2. A feature representation of the documents 3. A learning algorithm * Here we are dealing with supervised learning

Our first LBJ program /** A learned text classifier; its definition comes from data. */ discrete TextClassifier(Document d) <- learn TextLabel using WordFeatures from new DocumentReader("data/spam/train") with SparseAveragedPerceptron { learningRate = 0.1 ; thickness = 3.5; } 5 rounds testFrom new DocumentReader("data/spam/test”) end Defines a classifier The object being classified The function being learned The feature representation The source of the training data The learning algorithm

Demo Let’s build a spam detector  How to train?  How do different learning algorithms perform? Does this choice matter much?

Features Our current spam detector uses words as features Can we do better? Let’s try it out

MORE TEXT CLASSIFICATION

Sentiment classification Which of these product reviews is positive? I recently made the switch from PC to Mac, and I can say that I'm not sure why I waited so long. Considering that I have only had my computer a few weeks I can't say much about the durability and longevity of the hardware, but I can say that the operating system (mine shipped with Lion) and software is top notch. I've been an Apple user for a long time, but my most recent MacBook Pro purchase has convinced me to reconsider. I've had several hardware issues, including a failed keyboard, battery failure, and a bad DVD drive. Now, the backlight on the display fails to turn on when waking from sleep How do you know?

Classifying news groups Which mailing list should this message be posted to? I am looking for Quick C or Microsoft C code for image decoding from file for VGA viewing and saving images from/to GIF, TIFF, PCX, or JPEG format. I have scoured the Internet, but its like trying to find a Dr. Seuss spell checker TSR. It must be out there, and there's no need to reinvent the wheel. How do you know? alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc

Demo Converting our spam classifier into a  Sentiment classifier  A newsgroup classifier Note: How different are these at the implementation level?

Most of the engineering lies in the features ✓ ✗ ✗ ✗ A document Some labels A classifier (black box)

Summary What is LBJ? How do we use it? Writing a simple spam detector Playing with features How much do we need to change to move to a different application?

Assignment before Next Class (Not Graded) Download the code & data ( ) for this class and play with it Try to solve the Badges game puzzle with LBJ  Think about what features are needed  Write a parser for reading the data  Write a classifier for solving the puzzle

Next Class We will solve the Badges Game puzzle by Machine Learning We will look at more text classification examples We will think about a famous people classifier Questions

Badge Classifier Brainstorm the possible Features  Characters in entire name  Two consecutive Characters  Character as Vowel, Character as Consonant  ….  … Feature Engineering is Important (especially if labeled data is small) What is the baseline? 70 +, 24 -

THE FAMOUS PEOPLE CLASSIFIER

The Famous People Classifier f( ) = Politician f( ) = Athlete f( ) = Corporate Mogul

The NLP version of the fame classifier All sentences in the news, which the string Barack Obama occurs All sentences in the news, which the string Roger Federer occurs All sentences in the news, which the string Bill Gates occurs Represented by

Our goal Find famous athletes, corporate moguls and politicians Athlete Michael Schumacher Michael Jordan … Politician Bill Clinton George W. Bush … Corporate Mogul Warren Buffet Larry Ellison …

Let’s brainstorm How do we build a fame classifier? Remember, we start off with just raw text from a news website

One solution Let us label entities using features defined on mentions Identify mentions using the named entity recognizer Define features based on the words, parts of speech and dependency trees Train a classifier All sentences in the news, which the string Barack Obama occurs

Summary 1. Get started with Learning Based Java 2. Use a generic, black box text classifier for different applications …and write your own text classifier, if needed 3. Understand how features can impact the classifier performance … and add features to improve your application 4. Build a badge classifier based on character features Questions