Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net.

Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net

Research Goal Convey a taste of the: motivations/considerations/assump tions/speculations/hopes,… The game, a 1 st system, and its algorithms

Research Talk Overview 1.Motivational part 2.The approach: The game (categories, …) Algorithms Some experiments

Research Fill in the Blank(s)! Would ---- like ------ ------- ----- ------ ?yourcoffeewithsugaryou

Research What is this object?

Research “Well, categorization is one of the most basic functions of living creatures. We live in a categorized world – table, chair, male, female, democracy, monarchy – every object and event is unique, but we act towards them as members of classes.” From an interview with Eleanor Rosch (Psychologist, a pioneer on the phenomenon of “basic level” concepts) “Concepts are the glue that holds our mental world together.” From “The Big Book of Concepts”, Gregory Murphy Categorization is Fundamental!

Research “Rather, the formation and use of categories is the stuff of experience.” Philosophy in the Flesh, Lakoff and Johnson.

Research Repeated and rapid classification… … in the presence of myriad classes classification system In the presence of myriad categories: 1. How to categorize efficiently? 2. How to efficiently learn to categorize efficiently? ? Two Questions Arise

Research Now, a 3 rd Question.. How can so many inter-related categories be acquired? Programming them unlikely to be successful/scale: Limits of our explicit/conscious knowledge Unknown/unfamiliar domains The required scale.. Making the system operational..

Research Learn? … How? “Supervised” learning (explicit human involvement) likely inadequate: Required scale, or a good sign post: ~millions of categories and beyond.. Billions of weights, and beyond.. Inaccessible “knowledge” (see last slide!) Other approaches likely do not meet the needs (incomplete, different goals, etc): active learning, semi-supervised learning, clustering, density learning, RL, etc..

Research Desiderata/Requirements (or Speculations) Higher intelligence, such as advanced “advanced” pattern recognition/generation (e.g. vision), may require Long term learning (weeks, months, years,…) Cumulative learning (learn these first, then these, then these,…) Massive Learning: Myriad inter-related categories/concepts Systems learning Autonomy (relatively little human involvement) What’s the learning task? ?

Research This Work: An Exploration An avenue: “prediction games in infinitely rich worlds” Exciting part: World provides unbounded learning opportunity! (world is the validator, the system is the experimenter!.. and actively builds much of its own concepts) World enjoys many regularities (e.g. “hierarchical”) Based in part on “supervised” techniques!! (“discriminative”, “feedback driven”, supervisory signal doesn’t originate from humans )

Research In a Nutshell Prediction System …. 0011101110000…. After a while (much learning) predict observe & update Prediction System observe & update predict low level or “hard-wired” categories higher level categories (bigger chunks) (Text: characters,.. Vision: edges, curves,…) (e.g. words, digits, phrases, phone numbers, faces, visual objects, home pages, sites,…)

Research The Game Repeat Hide part(s) of the stream Predict (use context) Update Move on Objective: predict better... subject to efficiency constraints In the process: categories at different levels of size and abstraction should be learned

Research Research Goals Conjecture: There is much value to be attained from this task Beyond language modeling: more advanced pattern recognition/generation If so, should yield a wealth of new problems (=> Fun)

Research Overview Goal: Convey a taste of the motivations/considerations, the system and algorithms,.. Motivation The approach: The game (categories, …) Algorithms Some experiments

Research Upshot Takes streams of text Make categories (strings) Approx three hours on 800k documents Large-scale discriminative learning (evidence better than than language modeling)

Research Caveat Emptor! Exploratory research Many open problems (many I’m not aware of … ) Chosen algorithms, system org, or objective/performance measures, etc., etc… are likely not even near the best possible

Research Categories Building blocks (atoms!) of intelligence? Patterns that frequently occur External Internal.. Useful for predicting other categories! They can have structure/regularities, in particular: 1.Composition (~conjunctions) of other categories (Part-Of) 2.Grouping (~disjunctions)(Is-A relations)

Research Categories Low level “primitive” examples: 0 and 1 or characters (“a”, “b”,..,“0”, “-”,..) Provided to the system (easy to detect) Higher/composite levels: Sequence of bits/characters Words Phrases More general: Phone number, contact info, resume,...

Research Example Concept Area code is a concept that involves both composition and grouping: Composition of 3 digits A digit is a grouping, i.e., the set {0,1,2,…,9} ( 2 is a digit ) Other example concepts: phone number, address, resume page, face (in visual domain), etc.

Research Again, our goal, informally, is to build a system that acquires millions of useful concepts on its own.

Research Questions for a First System Functionality? Architecture? Org? Would many-class learning scale to millions of concepts? Choice of concept building methods? How would various learning processes interact?

Research Expedition: a First System Plays the game in text Begins at character level No segmentation, just a stream Makes and predicts larger sequences, via composition No grouping yet

Research … New Jersey in … predictors (active categories) window containing context and target target (category to predict) … New Jersey in … next time step predictors target Learning Episodes In this example, context contains one category on each side

Research … loves New York life … predictors window containing context and target target (category to predict).. Some Time Later.. In terms of supervised learning/classification, in this learning activity (prediction games): The set of concepts grows over time Same for features/predictors (concepts ARE the predictors!) Instance representation (segmentation of the data stream) changes/grows over time..

Research Prediction/Recall 1. Features are “activated” featurescategories c1 c2 c3 c4 c5 f1 f2 f3 f4 2. Edges are activated 3. Receiving categories are activated 4. Categories sorted/ranked 1.Like use of inverted indices 2. Sparse dot products

Research Updating a Feature’s Connections featurescategories c1 c2 c3 c4 c5 f1 f2 f3 f4 1. Identify connection 2. Increase weight 3. Normalize/weaken weights 4. Drop tiny weights Degrees are constrained Kronecker delta

Research “ther ” Example Category Node (from Jane Austen’s) “and ” “heart” “nei” “toge” “ far” “ bro” 0.087 0.07 0.057 0.052 0.13 0.11 “love ” 0.10 “by ” A category nodes keeps track of various weights, such as edge (or prediction) weights, and predictiveness weights, and other statistics (e.g. frequency, first/last time seen), and updates them when it is activated as a predictor or target.. 7.1 0.41 (keep local statistics) prediction weights categories appearing before

Research Network Categories and their edges form a network (a directed weighted graph, with different kinds of edges... ) The network grows over time: millions of nodes and beyond

Research When and How to Compose? Two major approaches: 1.Pre-filter: don’t compose if certain conditions are not met (simplest: only consider possibilities that you see) 2.Post-filter: compose and use, but remove if certain conditions are not met (e.g. if not seen recently enough, remove) I expect both are needed …

Research Some Composition (Prefilter) Heuristics FRAC: If you see c1 then c2 in the stream, then, with some probability, add c=c1c2 MU: use the pointwise mutual information between c1 and c2 IMPROVE: take string lengths into account and see whether joining is better BOUND: Generate all strings under length Lt.

Research Prediction Objective Desirable: learn higher level categories (bigger/abstract categories are useful externally) Question: how does this relate to improving predictions? 1.Higher level categories improve “context” and can save memory 2.Bigger, save time in playing the game (categories are atomic)

Research Objective (evaluation criterion) The Matching Performance: Number of bits (characters) correctly predicted per unit time or per prediction action Subject to constraints (space, time,..) How about entropy/perplexity? Categories are structured, so perplexity seems difficult to use..

Research Linearity and Non-Linearity (a motivation for new concept creation) n e w new Versus Which one predicts better? (better constrains what comes next) Aggregate the votes of “n”, “e”, and “w” to predict what comes next new????

Research Data Reuters RCV1 800k news articles Several online books of Jane Austen, etc. Web search query logs

Research Some Observations Ran on Reuters RCV1 (text body) ( simply zcat dir/file* ) ~800k articles >= 150 million learning/prediction episodes Over 10 million categories built 3-4 hours each pass (depends on parameters)

Research Observations Performance on held out (one of the Reuters files): 8-9 characters long to predict on average Almost two characters correct on average, per prediction action Can overfit/memorize! (long categories) Current: stop category generation after first pass

Research

Research Some Example Categories (in order of first time appearance and increasing length) cat name= "<" cat name= " t" cat name= ".</" cat name= "p>- " cat name= " the " cat name= "ation " cat name= "of the " cat name= "ing the " cat name= ""The " cat name= "company said " cat name= ", the company " cat name= "said on Tuesday" cat name= "," said one " cat name= "," he said. cat name= "--------------------------------" cat name= "--------------------------------------------------------" cat name= "--------------------------------------------------------------- cat name= ". Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Tuesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Thursday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "press on Wednesday. Reuters has not verified these stories and does not vouch for their accuracy. cat name= "within 10 percentage points in either direction of the key 225-share Nikkei average over the next six month" cat name= "ing and selling rates for leading world currencies and gold against the dollar on the London foreign exchange and bullion "

Research Example “Recall” Paths From processing one month of Reuters: "Sinn Fei" (0.128) "n a seat" (0.527) " in the " (0.538) "talks." (0.468) " B" (0.0185) "rokers " **** The end: connection weight less than: 0.04 " Gas in S" (1) "cotland" (1.04) " and north" (1.18) "ern E" (0.572) "ngland" (0.165) "," a " (0.0542) "spokeswo" (0.551) "man said " (0.044) "the idea" (0.0869) " was to " (0.144) "quot" (0.164) "e the d" (0.0723) "ivision" (0.0671) " in N" (0.397) "ew York" (0.062) " where " (0.0557) "the main " (0.0474) "marque" (0.229) "s were " (0.253) "base" (0.264) "d. "" (0.0451) "It will " (0.117) "certain" (0.0691) "ly b" (0.0892) "e New " (0.353) "York" (0.112) " party" (0.0917) "s is goin" (0.559) "g to " (0.149) "end."" (0.239) " T" (0.104) "wedish " (0.125) "Export" (0.0211) " Credi" **** The end: connection weight less than: 0.04

Research Search Query Logs "bureoofi" (1) "migration" (1.13) "andci" (1.04) "tizenship." (0.31) "com www," (0.11) "ictions" (0.116) "zenship." **** The end: this concept wasn't seen in last 1000000 time points. Random Recall: "bureoofi" (1) "migration" (0.0129) "dept.com" **** The end: this concept wasn't seen in last 1000000 time points.

Research Much Related Work! Online learning, cumulative learning, feature and concept induction, neural networks, clustering, Bayesian methods, language modeling, deep learning, “hierarchical” learning, importance/ubiquity of predictions/anticipations in the brain (“On Intelligence”, “natural computations”,…), models of neocortex (“circuits of the mind”), concepts and conceptual phenomena (e.g. “big book of concepts”), compression, ….

Research Summary Large-scale learning and classification (data hungry, efficiency paramount) A systems approach: Integration of multiple learning processes The system makes it own classes Driving objective: Improve prediction (currently: “matching” performance) The underlying goal: effectively acquire complex concepts See www.omadani.net

Research Current/Future Much work: Integrate learning of groupings Recognize/use “structural” categories? (learn to “parse”/segment?) Prediction objective.. ok? Control over input stream, etc.. Category generation.. What are good methods? Other domains (vision,…) Compare: language modeling, etc

Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net.

Similar presentations

Presentation on theme: "Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net.

Similar presentations

Presentation on theme: "Research Exploring Massive Learning via a Prediction System Omid Madani Yahoo! Research www.omadani.net."— Presentation transcript:

Similar presentations

About project

Feedback