And you can too!. 1972 2011 1992 SBS  Introduction  Evidence for Statistics  Bays Law  Informative Priors  Joint Models  Inference  Conclusion.

Slides:



Advertisements
Similar presentations
Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Data-driven methods: Texture (Sz 10.5) Cs129 Computational Photography James Hays, Brown, Spring 2011 Many slides from Alexei Efros.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Estimation in Sampling
Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Texture. Edge detectors find differences in overall intensity. Average intensity is only simplest difference. many slides from David Jacobs.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Image Quilting for Texture Synthesis & Transfer Alexei Efros (UC Berkeley) Bill Freeman (MERL) +=
Announcements Big mistake on hint in problem 1 (I’m very sorry).
Chapter 7 Sampling and Sampling Distributions
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Announcements For future problems sets: matlab code by 11am, due date (same as deadline to hand in hardcopy). Today’s reading: Chapter 9, except.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
ACADEMIC VOCABULARY 7 TH HONORS. ANALYZE Definition: break something down into its parts Synonyms: examine, study, scrutinize, explore.
Albert Gatt Corpora and Statistical Methods Lecture 9.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Texture Synthesis by Non-parametric Sampling Alexei Efros and Thomas Leung UC Berkeley.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Graphical models for part of speech tagging
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Statistical Learning in Infants (and bigger folks)
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
High Frequency Words.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Section 9.1 First Day The idea of a significance test What is a p-value?
A Bayesian approach to word segmentation: Theoretical and experimental results Sharon Goldwater Department of Linguistics Stanford University.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Chapter 6 Sampling and Sampling Distributions
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Text Based Information Retrieval
Erasmus University Rotterdam
Non-Parametric Models
Artificial Intelligence
Texture Synthesis by Non-parametric Sampling
CSCI 5832 Natural Language Processing
CAP 5636 – Advanced Artificial Intelligence
Generalized Spatial Dirichlet Process Models
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
CS 430: Information Discovery
Texture.
CS 188: Artificial Intelligence
Image Quilting for Texture Synthesis & Transfer
Hidden Markov Models Lirong Xia.
Speech recognition, machine learning
Costa’s Levels of Questioning
Artificial Intelligence 2004 Speech & Natural Language Processing
Sequential Learning with Dependency Nets
Speech recognition, machine learning
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

And you can too!

SBS

 Introduction  Evidence for Statistics  Bays Law  Informative Priors  Joint Models  Inference  Conclusion

Two examples that seem to indicate that the brain is indeed processing statistical information

 Saffran, Aslin, Newport. “Statistical Learning in 8-Month-Old Infants”  The infants listen to strings of nonsense words with no auditory clues to word boundaries.  E.g., “bidakupa …” where “bidaku is the first word.  They learn to distinguish words from other combinations that occur (with less frequency) over word boundaries.

Speaker Light Child

 Based on Rosenholtz et. al. (2011) A B

 Based on Rosenholtz et. al. (2011) A N O B E L

 A much better idea than spatial subsampling Original patch~1000 pixels

Original patch  A rich set of statistics can capture a lot of useful information Patch synthesized to match ~1000 statistical parameters (Portilla & Simoncelli, 2000)

 Balas, Nakano, & Rosenholtz, JoV, 2009

To my mind, at least, it packs a lot of information

P(M|E) = P(M) P(E|M) P(E) M = Learned Model of the world E = Learner’s environment (sensory input)

P(M|E) =P(M) P(E|M) P(E) It divides up responsibility correctly. It requires a generative model. (big, joint) It (obliquely) suggests that as far as learning goes we ignore the programs that use the model. But which M?

 Don’t pick M. Integrate over all of them.  Pick the M that maximizes P(M)P(E|M).  Pick the average P(M) (Gibbs sampling). P(E) = Σ P(M)P(E|M) M

Don’t sweat it.

Three examples where they are critical

trees skyscrapersky bell dome temple buildings sky

Cut random surfaces (samples from a GP) with thresholds (as in Level Set Methods) Assign each pixel to the first surface which exceeds threshold (as in Layered Models) Duan, Guindani, & Gelfand, Generalized Spatial DP, 2007

Comparison: Potts Markov Random Field

 Based on the work of Goldwater et. al.  Separate one “word” from the next in child-directed speech.  E.g., yuwanttusiD6bUk You want to see the book

 Generative Story For each utterance: For each word w (or STOP) pick with probability P(w) If w=STOP break If we pick M to maximize P(E|M) the model memorizes the data. I.e., It creates one “word” which is the concatenation of all the words in that sentence.

Precision: 61.6Recall: 47.6 Example: youwant to see thebook

 Primarily based on Clark (2003)  Given a sequence of words, deduce their parts of speech (e.g., DT, NN, etc.)  Generative story: For each word position (i) in the text 1) propose part-of-speech (t) p(t|t-1) 2) propose a word (w) using p(w|t)

 We could put a Dirichlet prior on P(w|t)  But what we really want is sparse P(t|w)  Almost all words (by type) have only one part-of-speech  We do best by only allowing this.  E.g., “can” is only a model verb (we hope!)  Putting a sparse prior on P(word-type|t) also helps.

Two examples that show the strengths of modeling many phenomena jointly.

 Clark pos tagger also includes something sort of like a morphology model.  It assumes POS tags are correlated with spelling.  True morphology would recognize that “ride” “riding” and “rides” share a root.  I do not know of any true joint tagging- morphology model.

 Based on Haghighi & Klein 2010 Weiner said the problems were all Facebook’s fault. They should never have given him an account. (person) Type1 (organization) Type2 Obama, Weiner, father IBM, Facebook, company

Otherwise know as hardware.

 More generally it is not any mechanism that requires tracking all expectations.  Consider the word boundary. Between every two phonemes there may or may not be a boundary. abcde a|bcde ab|cde abc|de abcd|e a|b|cde …

 Start out with random guesses. Do (roughly) forever: Pick a random point. Compute p(split) and p(join). Pick r, 0<r<1: if p(split) > r split, p(split)+p(join) else join.

 First, the nice properties only hold for “exchangeable” distributions. It seems likely that most of the ones we care about are not (e.g., Haghighi & Klein)  But critically it assumes we have all the training data at once and go over it many times.

 Or something like it.  At the level of detail here, just think “beam search.”

NP NNS Dogs VBS like NP NNS bones VP S Information Barrier

 Or something like it.  At the level of detail here, just think “beam search.” (ROOT (Root (S (NP (NNS Dogs) (ROOT (NP (NNS Dogs) (ROOT (S (NP (NNS Dogs)) (VP (VBS eat)

 The brain operates by manipulating probabilities.  World-model induction is governed by Bayes Law  This implies we have a large joint generative model  It seems overwhelmingly likely that we have a very informative prior.  Something like particle filtering is the inference/use mechanism.