An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Classification Classification Examples
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Chapter 5: Introduction to Information Retrieval
Information Retrieval in Practice
Chapter 1: Introduction to Pattern Recognition
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Presented by Zeehasham Rasheed
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Introduction to Machine Learning Approach Lecture 5.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Introduction to machine learning
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Module 3: Business Information Systems Chapter 11: Knowledge Management.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Introduction to Programming Lecture Number:. What is Programming Programming is to instruct the computer on what it has to do in a language that the computer.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
COMP Introduction to Programming Yi Hong May 13, 2015.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Ontology-Based Information Extraction: Current Approaches.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
Universit at Dortmund, LS VIII
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Property of Jack Wilson, Cerritos College1 CIS Computer Programming Logic Programming Concepts Overview prepared by Jack Wilson Cerritos College.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Image Classification for Automatic Annotation
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
Advanced Higher Computing Science
Information Retrieval in Practice
Information Organization: Overview
Information Retrieval (in Practice)
Introduction Machine Learning 14/02/2017.
Object-Oriented Software Engineering Using UML, Patterns, and Java,
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Introductory Seminar on Research: Fall 2017
Data Mining: Concepts and Techniques Course Outline
Information Retrieval
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 11 Practical Methodology
Speech recognition, machine learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Information Organization: Overview
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Speech recognition, machine learning
Presentation transcript:

An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo) 8/24/ /26/2010

Some reasonably reliable facts…* 1.7 ZB (= 10^21 bytes) of new digital information added worldwide in % of this is unstructured (e.g. not database entries) 25% is images 6EB is (1,000 EB = 1 ZB ) How to manage/access the tiny relevant fraction of this data? *Source: Dr. Joseph Kielman, DHS: projected figures taken from presentation in Summer 2010

Keyword search is NOT the answer to every problem… Abstract/Aggregative queries:  E.g. find news reports about visits by heads of state to other countries  E.g. find reviews for movies that some person might like, given a couple of examples Enterprise Search:  Private collections of documents, e.g. all the documents published by a corporation  Much lower redundancy than web documents  Need to search for concepts, not words  e.g. looking for proposals with similar research goals: wording may be very different (different scientific disciplines, different emphasis, different methodology)

Even when keyword search is a good start… Waaaaay too many documents that match the key words Solution: Filter data that is irrelevant (for task in hand) Some examples…  Different languages  Spam vs. non-spam  Forum/blog post topic How to solve the Spam/non-Spam problem? Suggestions?

Machine Learning could help… Instead of writing/coding rules (“expert systems”), use statistical methods to “learn” rules that perform a classification task, e.g.  given a blog post, which of N different topics is it most relevant to?  Given an , is it spam or not?  Given a document, which of K different languages is it in? … and now, a demonstration… The demonstration shows what a well-designed classifier can achieve. Here’s a very high-level view of how classifiers work in the context of NLP.

Motivating example: blog topics Blog crawl

What we need: f( ) = “politics” f( ) = “sports” f( ) = “business”

Where to get it: Machine Learning Feature Functions Learning Algorithm Data → “politics” → “sports” → “business”

So, what are “feature functions”? Take same input as Indicate some property of the input a.k.a., a feature Typical NLP feature functions  Binary Appearance of a given word Appearance of two words consecutively a.k.a., a bigram Appearance of a word with a given part of speech Appearance of a named entity (e.g. “Barack Obama”)  Real Counts of binary features TFIDF (a statistical measure of a document)

What does the Learning Algorithm do? Training input: a feature-based representation of examples, together with the labels we want to be able to predict  E.g. 1,000 feature sets labeled either “spam” or “non-spam”  Each feature set is extracted from an using the feature functions  Labels are typically assigned by human annotator Computes statistics over features, relating features to labels Generates a classifier (statistical model) that predicts a label based on the features that are active.  E.g. if feature “word-’VIAGRA’-is-present” is active, predict “SPAM” Training output: the classifier. Now it will take feature representations of new s (no label!) and predict a label.  Sometimes, it will be wrong!

Update Rules Decision rule: Linear Threshold Function Winnow – mistake driven update rules:  Promotion: if,  Demotion: if, activation threshold

Cognitive Computations Software Tutorial 128/25/05 One Basic System: One vs. All Linear Threshold Function Targets (concepts) Features Weighted edges, instead of weight vectors  Prediction is “one vs. all”

A Training Example – 3 Newsgroups 1, 1 Health, disease, review: 1, 1, 1 Computers, process, terminal, linux: 1, 1 Movies, blood, review: diseaseprocesscrimebloodexceedreviewterminallinuxgun MoviesHealthComputers 2, 22, 2, 2 Health, blood, terminal: 2, 1, 2, 1 2, 2, 2, 2 2, 22, 1, 1, 2 Movies, blood, exceed, gun: 2, 1, 2, 1 Update rule: Winnow α = 2, β = ½, θ = 3.5 ?, disease, exceed, terminal: = 4= 2= 1

A Training Example, abstracted… 1, 1 1, 1001, 1006: 1, 1, 1 2, 1002, 1007, 1008: 1, 1 3, 1006, 1004: , 22, 2, 2 1, 1004, 1007: 2, 1, 2, 1 2, 2, 2, 2 2, 22, 1, 1, 2 3, 1004, 1005, 1009: 2, 1, 2, 1 Update rule: Winnow α = 2, β = ½, θ = , 1005, 1007: = 4= 2= 1

Adapting to New Task: Spam Filtering What if we want to learn Spam vs. Non-Spam? Demonstration showed that the same black box can be adapted to multiple problems… so what happens internally? Feature functions are generic pattern extractors…  for a new set of documents, new features will be extracted  E.g. for Spam, we’d expect to see some features like “word- ’VIAGRA’-is-present” New documents come with their own set of labels (again, assigned by human annotators) So we reuse the same code, but generate a new classifier…

A Training Example – Spam Filtering 1, 1 Spam, V1AGRA, medicine: 1, 1 Non-Spam, buy, medicine: V1AGRAbaldingbenefitbuytimemedicinehugepanicavoid Non-SpamSpam 2, 2 Spam, buy, huge: 2, 1, 2, 1 2, 2, 2, 2 2, 22, 1, 1, 2 Non-Spam, buy, time, avoid: 2, 1, 2, 1 Update rule: Winnow α = 2, β = ½, θ = 3.5 ?, V1AGRA, time, huge: = 4= 1

Some Analysis… We defined a very generic feature set – ‘bag-of-words’ We did reasonably well on three different tasks Can we do better on each task? …of course. If we add good feature functions, the learning algorithm will find more useful patterns. Suggestions for patterns for…  Spam filtering?  Newsgroup classification?  …are the features we add for Spam filtering good for Newsgroups?  When we add specialized features, are we “cheating”? In fact, a lot of time is usually spent engineering good features for individual tasks. It’s one way to add domain knowledge.

A Caveat It’s often a lot of work to learn to use a new tool set It can be tempting to think it would be easier to just implement what you need yourself  Sometimes, you’ll be right  But probably not this time Learning a tool set is an investment: payoff comes later  It’s easy to add new functionality – it may already be a method in some class in a library; if not, there’s infrastructure to support it  You will avoid certain errors: someone already made them and coded against them  Probably, it’s a lot more work than you think to DIY

Homework (!) To prepare for tomorrow’s tutorial, you should: Log in to the DSSI server via SSH Check that you can transfer a file to your home directory from your laptop If you have any questions, ask Tim or Yuancheng:  Tim:  Yuancheng: Bring your laptop to the tutorial!!!