Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net.

Slides:

Advertisements

Similar presentations

Artificial Intelligence 12. Two Layer ANNs

Advertisements

Cognitive Systems, ICANN panel, Q1 What is machine intelligence, as beyond pattern matching, classification and prediction. What is machine intelligence,

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Information Retrieval in Practice

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.

Software Requirements

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.

Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Data Mining – Intro.

Overview of Search Engines

Clustering Unsupervised learning Generating “classes”

Data Mining Techniques

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Associative Pattern Memory (APM) Larry Werth July 14, 2007

Invitation to Computer Science, Java Version, Second Edition.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

CS 445/545 Machine Learning Winter, 2012 Course overview: –Instructor Melanie Mitchell –Textbook Machine Learning: An Algorithmic Approach by Stephen Marsland.

Introduction to Science Informatics Lecture 1. What Is Science? a dependence on external verification; an expectation of reproducible results; a focus.

© 2008 SRI International Systems Learning for Complex Pattern Problems Omid Madani AI Center, SRI International.

Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Research Prediction Games in Infinitely Rich Worlds Omid Madani Yahoo! Research.

ITGS Databases.

Bain on Neural Networks and Connectionism Stephanie Rosenthal September 9, 2015.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

Introduction to Neural Networks and Example Applications in HCI Nick Gentile.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.

LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hongkun Zhao, Weiyi.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.

8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.

1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.

18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.

Finite State Machines (FSM) OR Finite State Automation (FSA) - are models of the behaviors of a system or a complex object, with a limited number of defined.

Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Jump to first page Inferring Sample Findings to the Population and Testing for Differences.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,

Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.

XuanTung Hoang 1 Something to discuss Feedbacks on Midterm Exam Final exam and term project  Final exam requires solid knowledge/skills in Java  Be more.

GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Chapter 9 Knowledge. Some Questions to Consider Why is it difficult to decide if a particular object belongs to a particular category, such as “chair,”

Introduction toData structures and Algorithms

Text Based Information Retrieval

What is an ANN ? The inventor of the first neuro computer, Dr. Robert defines a neural network as,A human brain like system consisting of a large number.

Dynamical Models of Decision Making Optimality, human performance, and principles of neural information processing Jay McClelland Department of Psychology.

Algorithm Discovery and Design

Artificial Intelligence 12. Two Layer ANNs

Presentation transcript:

Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research

Research Goal Convey a taste of the: motivations/considerations/assump tions/speculations/hopes,… The game, a 1 st system, and its algorithms

Research Talk Overview 1.Motivational part 2.The approach: The game (categories, …) Algorithms Some experiments

Research Fill in the Blank(s)! Would ---- like ?yourcoffeewithsugaryou

Research What is this object?

Research “Well, categorization is one of the most basic functions of living creatures. We live in a categorized world – table, chair, male, female, democracy, monarchy – every object and event is unique, but we act towards them as members of classes.” From an interview with Eleanor Rosch (Psychologist, a pioneer on the phenomenon of “basic level” concepts) “Concepts are the glue that holds our mental world together.” From “The Big Book of Concepts”, Gregory Murphy Categorization is Fundamental!

Research “Rather, the formation and use of categories is the stuff of experience.” Philosophy in the Flesh, Lakoff and Johnson.

Research Repeated and rapid classification… … in the presence of myriad classes classification system In the presence of myriad categories: 1. How to categorize efficiently? 2. How to efficiently learn to categorize efficiently? ? Two Questions Arise

Research Now, a 3 rd Question.. How can so many inter-related categories be acquired? Programming them unlikely to be successful/scale: Limits of our explicit/conscious knowledge Unknown/unfamiliar domains The required scale.. Making the system operational..

Research Learn? … How? “Supervised” learning (explicit human involvement) likely inadequate: Required scale, or a good sign post: ~millions of categories and beyond.. Billions of weights, and beyond.. Inaccessible “knowledge” (see last slide!) Other approaches likely do not meet the needs (incomplete, different goals, etc): active learning, semi-supervised learning, clustering, density learning, RL, etc..

Research Desiderata/Requirements (or Speculations) Higher intelligence, such as advanced “advanced” pattern recognition/generation (e.g. vision), may require Long term learning (weeks, months, years,…) Cumulative learning (learn these first, then these, then these,…) Massive Learning: Myriad inter-related categories/concepts Systems learning Autonomy (relatively little human involvement) What’s the learning task? ?

Research This Work: An Exploration An avenue: “prediction games in infinitely rich worlds” Exciting part: World provides unbounded learning opportunity! (world is the validator, the system is the experimenter!.. and actively builds much of its own concepts) World enjoys many regularities (e.g. “hierarchical”) Based in part on “supervised” techniques!! (“discriminative”, “feedback driven”, supervisory signal doesn’t originate from humans )

Research In a Nutshell Prediction System … …. After a while (much learning) predict observe & update Prediction System observe & update predict low level or “hard-wired” categories higher level categories (bigger chunks) (Text: characters,.. Vision: edges, curves,…) (e.g. words, digits, phrases, phone numbers, faces, visual objects, home pages, sites,…)

Research The Game Repeat Hide part(s) of the stream Predict (use context) Update Move on Objective: predict better... subject to efficiency constraints In the process: categories at different levels of size and abstraction should be learned

Research Research Goals Conjecture: There is much value to be attained from this task Beyond language modeling: more advanced pattern recognition/generation If so, should yield a wealth of new problems (=> Fun)

Research Overview Goal: Convey a taste of the motivations/considerations, the system and algorithms,.. Motivation The approach: The game (categories, …) Algorithms Some experiments

Research Upshot Takes streams of text Make categories (strings) Approx three hours on 800k documents Large-scale discriminative learning (evidence better than than language modeling)

Research Caveat Emptor! Exploratory research Many open problems (many I’m not aware of … ) Chosen algorithms, system org, or objective/performance measures, etc., etc… are likely not even near the best possible

Research Categories Building blocks (atoms!) of intelligence? Patterns that frequently occur External Internal.. Useful for predicting other categories! They can have structure/regularities, in particular: 1.Composition (~conjunctions) of other categories (Part-Of) 2.Grouping (~disjunctions)(Is-A relations)

Research Categories Low level “primitive” examples: 0 and 1 or characters (“a”, “b”,..,“0”, “-”,..) Provided to the system (easy to detect) Higher/composite levels: Sequence of bits/characters Words Phrases More general: Phone number, contact info, resume,...

Research Example Concept Area code is a concept that involves both composition and grouping: Composition of 3 digits A digit is a grouping, i.e., the set {0,1,2,…,9} ( 2 is a digit ) Other example concepts: phone number, address, resume page, face (in visual domain), etc.

Research Again, our goal, informally, is to build a system that acquires millions of useful concepts on its own.

Research Questions for a First System Functionality? Architecture? Org? Would many-class learning scale to millions of concepts? Choice of concept building methods? How would various learning processes interact?

Research Expedition: a First System Plays the game in text Begins at character level No segmentation, just a stream Makes and predicts larger sequences, via composition No grouping yet

Research … New Jersey in … predictors (active categories) window containing context and target target (category to predict) … New Jersey in … next time step predictors target Learning Episodes In this example, context contains one category on each side

Research … loves New York life … predictors window containing context and target target (category to predict).. Some Time Later.. In terms of supervised learning/classification, in this learning activity (prediction games): The set of concepts grows over time Same for features/predictors (concepts ARE the predictors!) Instance representation (segmentation of the data stream) changes/grows over time..

Research Prediction/Recall 1. Features are “activated” featurescategories c1 c2 c3 c4 c5 f1 f2 f3 f4 2. Edges are activated 3. Receiving categories are activated 4. Categories sorted/ranked 1.Like use of inverted indices 2. Sparse dot products

Research Updating a Feature’s Connections featurescategories c1 c2 c3 c4 c5 f1 f2 f3 f4 1. Identify connection 2. Increase weight 3. Normalize/weaken weights 4. Drop tiny weights Degrees are constrained Kronecker delta

Research “ther ” Example Category Node (from Jane Austen’s) “and ” “heart” “nei” “toge” “ far” “ bro” “love ” 0.10 “by ” A category nodes keeps track of various weights, such as edge (or prediction) weights, and predictiveness weights, and other statistics (e.g. frequency, first/last time seen), and updates them when it is activated as a predictor or target (keep local statistics) prediction weights categories appearing before

Research Network Categories and their edges form a network (a directed weighted graph, with different kinds of edges... ) The network grows over time: millions of nodes and beyond

Research When and How to Compose? Two major approaches: 1.Pre-filter: don’t compose if certain conditions are not met (simplest: only consider possibilities that you see) 2.Post-filter: compose and use, but remove if certain conditions are not met (e.g. if not seen recently enough, remove) I expect both are needed …

Research Some Composition (Prefilter) Heuristics FRAC: If you see c1 then c2 in the stream, then, with some probability, add c=c1c2 MU: use the pointwise mutual information between c1 and c2 IMPROVE: take string lengths into account and see whether joining is better BOUND: Generate all strings under length Lt.

Research Prediction Objective Desirable: learn higher level categories (bigger/abstract categories are useful externally) Question: how does this relate to improving predictions? 1.Higher level categories improve “context” and can save memory 2.Bigger, save time in playing the game (categories are atomic)

Research Objective (evaluation criterion) The Matching Performance: Number of bits (characters) correctly predicted per unit time or per prediction action Subject to constraints (space, time,..) How about entropy/perplexity? Categories are structured, so perplexity seems difficult to use..

Research Linearity and Non-Linearity (a motivation for new concept creation) n e w new Versus Which one predicts better? (better constrains what comes next) Aggregate the votes of “n”, “e”, and “w” to predict what comes next new????

Research Data Reuters RCV1 800k news articles Several online books of Jane Austen, etc. Web search query logs

Research Some Observations Ran on Reuters RCV1 (text body) ( simply zcat dir/file* ) ~800k articles >= 150 million learning/prediction episodes Over 10 million categories built 3-4 hours each pass (depends on parameters)

Research Observations Performance on held out (one of the Reuters files): 8-9 characters long to predict on average Almost two characters correct on average, per prediction action Can overfit/memorize! (long categories) Current: stop category generation after first pass

Research

Research Some Example Categories (in order of first time appearance and increasing length) cat name= "<" cat name= " t" cat name= ".</" cat name= "p>- " cat name= " the " cat name= "ation " cat name= "of the " cat name= "ing the " cat name= ""The " cat name= "company said " cat name= ", the company " cat name= "said on Tuesday" cat name= "," said one " cat name= "," he said. cat name= " " cat name= " " cat name= " cat name= ". Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Tuesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Thursday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Wednesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "within 10 percentage points in either direction of the key 225-share Nikkei average over the next six month" cat name= "ing and selling rates for leading world currencies and gold against the dollar on the London foreign exchange and bullion "

Research Example “Recall” Paths From processing one month of Reuters: "Sinn Fei" (0.128) "n a seat" (0.527) " in the " (0.538) "talks." (0.468) " B" (0.0185) "rokers " **** The end: connection weight less than: 0.04 " Gas in S" (1) "cotland" (1.04) " and north" (1.18) "ern E" (0.572) "ngland" (0.165) "," a " (0.0542) "spokeswo" (0.551) "man said " (0.044) "the idea" (0.0869) " was to " (0.144) "quot" (0.164) "e the d" (0.0723) "ivision" (0.0671) " in N" (0.397) "ew York" (0.062) " where " (0.0557) "the main " (0.0474) "marque" (0.229) "s were " (0.253) "base" (0.264) "d. "" (0.0451) "It will " (0.117) "certain" (0.0691) "ly b" (0.0892) "e New " (0.353) "York" (0.112) " party" (0.0917) "s is goin" (0.559) "g to " (0.149) "end."" (0.239) " T" (0.104) "wedish " (0.125) "Export" (0.0211) " Credi" **** The end: connection weight less than: 0.04

Research Search Query Logs "bureoofi" (1) "migration" (1.13) "andci" (1.04) "tizenship." (0.31) "com www," (0.11) "ictions" (0.116) "zenship." **** The end: this concept wasn't seen in last time points. Random Recall: "bureoofi" (1) "migration" (0.0129) "dept.com" **** The end: this concept wasn't seen in last time points.

Research Much Related Work! Online learning, cumulative learning, feature and concept induction, neural networks, clustering, Bayesian methods, language modeling, deep learning, “hierarchical” learning, importance/ubiquity of predictions/anticipations in the brain (“On Intelligence”, “natural computations”,…), models of neocortex (“circuits of the mind”), concepts and conceptual phenomena (e.g. “big book of concepts”), compression, ….

Research Summary Large-scale learning and classification (data hungry, efficiency paramount) A systems approach: Integration of multiple learning processes The system makes it own classes Driving objective: Improve prediction (currently: “matching” performance) The underlying goal: effectively acquire complex concepts See

Research Current/Future Much work: Integrate learning of groupings Recognize/use “structural” categories? (learn to “parse”/segment?) Prediction objective.. ok? Control over input stream, etc.. Category generation.. What are good methods? Other domains (vision,…) Compare: language modeling, etc