CS Fall 2011, Stuart Russell

CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview

CS 194-10 Fall 2011, Stuart Russell
People Avital Steinitz 2nd year CS PhD student Stuart Russell 30th-year CS PhD student Mert Pilanci 2nd year EE PhD student Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Administrative details
Web page Newsgroup Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Course outline Overview of machine learning (today) Classical supervised learning Linear regression, perceptrons, neural nets, SVMs, decision trees, nearest neighbors, and all that A little bit of theory, a lot of applications Learning probabilistic models Probabilistic classifiers (logistic regression, etc.) Unsupervised learning, density estimation, EM Bayes net learning Time series models Dimensionality reduction Gaussian process models Language models Bandits and other exciting topics Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Lecture outline Goal: Provide a framework for understanding all the detailed content to come, and why it matters Learning: why and how Supervised learning Classical: finding simple, accurate hypotheses Probabilistic: finding likely hypotheses Bayesian: updating belief in hypotheses Data and applications Expressiveness and cumulative learning CTBT Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning is…. … a computational process for improving performance based on experience Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning: Why? Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning: Why? The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion … [William James, 1890] Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning: Why? The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion … [William James, 1890] Learning is essential for unknown environments, i.e., when the designer lacks omniscience Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning: Why? Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain. Presumably the child brain is something like a notebook as one buys it from the stationer's. Rather little mechanism, and lots of blank sheets. [Alan Turing, 1950] Learning is useful as a system construction method, i.e., expose the system to reality rather than trying to write it down Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning: How? Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Structure of a learning agent
Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Design of learning element
Key questions: What is the agent design that will implement the desired performance? Improve the performance of what piece of the agent system and how is that piece represented? What data are available relevant to that piece? (In particular, do we know the right answers?) What knowledge is already available? Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Examples Agent design Component Representation Feedback Knowledge Alpha-beta search Evaluation function Linear polynomial Win/loss Rules of game; Coefficient signs Logical planning agent Transition model (observable envt) Successor-state axioms Action outcomes Available actions; Argument types Utility-based patient monitor Physiology/sensor model Dynamic Bayesian network Observation sequences Gen physiology; Sensor design Satellite image pixel classifier Classifier (policy) Markov random field Partial labels Coastline; Continuity scales Supervised learning: correct answers for each training instance Reinforcement learning: reward sequence, no correct answers Unsupervised learning: “just make sense of the data” Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Supervised learning 3/11/2017 3:36:17 PM To learn an unknown target function f Input: a training set of labeled examples (xj,yj) where yj = f(xj) E.g., xj is an image, f(xj) is the label “giraffe” E.g., xj is a seismic signal, f(xj) is the label “explosion” Output: hypothesis h that is “close” to f, i.e., predicts well on unseen examples (“test set”) Many possible hypothesis families for h Linear models, logistic regression, neural networks, decision trees, examples (nearest-neighbor), grammars, kernelized separators, etc etc Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Example: object recognition
3/11/2017 3:36:17 PM x f(x) giraffe giraffe giraffe llama llama llama Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Example: object recognition
3/11/2017 3:36:17 PM x f(x) giraffe giraffe giraffe llama llama llama X= f(x)=? Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Example: curve fitting

Basic questions Which hypothesis space H to choose? How to measure degree of fit? How to trade off degree of fit vs. complexity? “Ockham’s razor” How do we find a good h? How do we know if a good h will predict well? Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Philosophy of Science (Physics)
Which hypothesis space H to choose? Deterministic hypotheses, usually mathematical formulas and/or logical sentences; implicit relevance determination How to measure degree of fit? Ideally, h will be consistent with data How to trade off degree of fit vs. complexity? Theory must be correct up to “experimental error” How do we find a good h? Intuition, imagination, inspiration (invent new terms!!) How do we know if a good h will predict well? Hume’s Problem of Induction: most philosophers give up Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Kolmogorov complexity (also MDL, MML)
Which hypothesis space H to choose? All Turing machines (or programs for a UTM) How to measure degree of fit? Fit is perfect (program has to output data exactly) How to trade off degree of fit vs. complexity? Minimize size of program How do we find a good h? Undecidable (unless we bound time complexity of h) How do we know if a good h will predict well? (recent theory borrowed from PAC learning) Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Classical stats/ML: Minimize loss function
Which hypothesis space H to choose? E.g., linear combinations of features: hw(x) = wTx How to measure degree of fit? Loss function, e.g., squared error Σj (yj – wTx)2 How to trade off degree of fit vs. complexity? Regularization: complexity penalty, e.g., ||w||2 How do we find a good h? Optimization (closed-form, numerical); discrete search How do we know if a good h will predict well? Try it and see (cross-validation, bootstrap, etc.) Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Probabilistic: Max. likelihood, max. a priori
Which hypothesis space H to choose? Probability model P(y | x,h) , e.g., Y ~ N(wTx,σ2) How to measure degree of fit? Data likelihood Πj P(yj | xj,h) How to trade off degree of fit vs. complexity? Regularization or prior: argmaxh P(h) Πj P(yj | xj,h) (MAP) How do we find a good h? Optimization (closed-form, numerical); discrete search How do we know if a good h will predict well? Empirical process theory (generalizes Chebyshev, CLT, PAC…); Key assumption is (i)id Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Bayesian: Computing posterior over H
Which hypothesis space H to choose? All hypotheses with nonzero a priori probability How to measure degree of fit? Data probability, as for MLE/MAP How to trade off degree of fit vs. complexity? Use prior, as for MAP How do we find a good h? Don’t! Bayes predictor P(y|x,D) = Σh P(y|x,h) P(D|h) P(h) How do we know if a good h will predict well? Silly question! Bayesian prediction is optimal!! Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Neon sculpture at Autonomy Corp.

Lots of data Web: estimated Google index 45 billion pages Clickstream data: TB/day Transaction data: 5-50 TB/day Satellite image feeds: ~1TB/day/satellite Sensor networks/arrays CERN Large Hadron Collider ~100 petabytes/day Biological data: 1-10TB/day/sequencer TV: 2TB/day/channel; YouTube 4TB/day uploaded Digitized telephony: ~100 petabytes/day Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

This is what an ICU looks like: ventilator, fluids, monitors; ~200 medical procedures per day, many potentially fatal Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Real data are messy Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Arterial blood pressure (high/low/mean) 1s

Application: satellite image analysis

Application: Discovering DNA motifs
...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTCACACGGAAGACGCAGCCACCGGTTGTGATG TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATAGTAAAACGCAGGGGAGGTTCTTCAGTAGTA TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGGGTGCATTTTTGCCAAGTGCCTTACTGTTAT CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGCGAATCGATTCAATTAAGTTATGTCACAAGG GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAATAACCCCTCCCTGCACTGACCTTGACTGAAT AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGATCACGACCGCTTAAGACCTGAGTTGGAGTT GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAAGTGTGTTAACTAAGACTGGTATTTCCACTA GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGTCCACGAGAGGTCTCCTGATTACATCTTGAA GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGCTGGTAAACGAGTTCCGGTCGCGGAGATAAC TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTAACCCAAAGACCGGCCCAGCGTCAGTGA... Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Application: User website behavior from clickstream data (from P. Smyth, UCI) , -, 3/22/00, 10:35:11, W3SVC, SRVR1, , 781, 363, 875, 200, 0, GET, /top.html, -, , -, 3/22/00, 10:35:16, W3SVC, SRVR1, , 5288, 524, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 10:35:17, W3SVC, SRVR1, , 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 16:18:50, W3SVC, SRVR1, , 60, 425, 72, 304, 0, GET, /top.html, -, , -, 3/22/00, 16:18:58, W3SVC, SRVR1, , 8322, 527, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 16:18:59, W3SVC, SRVR1, , 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:54:37, W3SVC, SRVR1, , 140, 199, 875, 200, 0, GET, /top.html, -, , -, 3/22/00, 20:54:55, W3SVC, SRVR1, , 17766, 365, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 20:54:55, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:55:07, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:55:36, W3SVC, SRVR1, , 1061, 382, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 20:55:36, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:55:39, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:56:03, W3SVC, SRVR1, , 1081, 382, 414, 200, 0, POST, /spt/main.html, -, , -, 3/22/00, 20:56:04, W3SVC, SRVR1, , 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, , -, 3/22/00, 20:56:33, W3SVC, SRVR1, , 0, 262, 72, 304, 0, GET, /top.html, -, , -, 3/22/00, 20:56:52, W3SVC, SRVR1, , 19598, 382, 414, 200, 0, POST, /spt/main.html, -, User 1 2 3 2 2 3 3 3 1 1 1 3 1 3 3 3 3 User 2 3 3 3 1 1 1 User 3 7 7 7 7 7 7 7 7 User 4 1 5 1 1 1 5 1 5 1 1 1 1 1 1 User 5 5 1 1 5 … … Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Application: social network analysis
HP Labs data 500 users, 20k connections evolving over time Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Application: spam filtering
200 billion spam messages sent per day Asymmetric cost of false positive/false negative Weak label: discarded without reading Strong label (“this is spam”) hard to come by Standard iid assumption violated: spammers alter spam generators to evade or subvert spam filters (“adversarial learning” task) Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning 3/11/2017 3:36:18 PM Learning knowledge data Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning 3/11/2017 3:36:18 PM prior knowledge Learning knowledge data Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning 3/11/2017 3:36:18 PM prior knowledge Learning knowledge data Crucial open problem: weak intermediate forms of knowledge that support future generalizations Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Example Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Example – arriving at Sao Paulo, Brazil
Bem-vindo! Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Example – arriving at Sao Paulo, Brazil
Bem-vindo! Bem-vindo! Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Weak prior knowledge In this case, people in a given country (and city) tend to speak the same language Where did this knowledge come from? Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Weak prior knowledge In this case, people in a given country (and city) tend to speak the same language Where did this knowledge come from? Experience with other countries “Common sense” – i.e., knowledge of how societies and languages work Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Weak prior knowledge In this case, people in a given country (and city) tend to speak the same language Where did this knowledge come from? Experience with other countries “Common sense” – i.e., knowledge of how societies and languages work And where did that knowledge come from? Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Knowledge? What is knowledge? All I know is samples!! [V. Vapnik]
All knowledge derives, directly or indirectly, from experience of individuals Knowledge serves as a directly applicable shorthand for all that experience – better than requiring constant review of the entire sensory/evolutionary history of the human race Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Expressiveness Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

The world has things in it!!
3/11/2017 3:36:18 PM Expressive language => concise models => fast learning, sometimes fast reasoning E.g., rules of chess: 1 page in first-order logic On(color,piece,x,y,t) ~ pages in propositional logic WhiteKingOnC4Move12 ~ pages as atomic-state model R.B.KB.RPPP..PPP..N..N…..PP….q.pp..Q..n..n..ppp..pppr.b.kb.r [Note: chess is a tiny problem compared to the real world] Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Brief history of expressiveness
3/11/2017 3:36:18 PM probability logic atomic propositional first-order/relational Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM probability 5th C B.C. logic atomic propositional first-order/relational Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM 17th C probability 5th C B.C. logic atomic propositional first-order/relational Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM 17th C probability 5th C B.C. 19th C logic atomic propositional first-order/relational Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM 17th C 20th C probability 5th C B.C. 19th C logic atomic propositional first-order/relational Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM 17th C 20th C 21st C probability 5th C B.C. 19th C logic atomic propositional first-order/relational Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM Bernoulli Categorical Uni. Gaussian (H)MMs Bayes nets MRFs Multi. Gaussians DBNs Kalman filters RPMs BLOG MLNs (DBLOG) probability First-order logic Database systems Programs First-order STRIPS Temporal logic OBDDs, k-CNF Decision trees Perceptrons Propositional STRIPS Register circuits logic Finite automata atomic propositional first-order/relational Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

CTBT: Comprehensive Nuclear-Test-Ban Treaty
3/11/2017 3:36:18 PM Bans testing of nuclear weapons on earth Allows for outside inspection of 1000km2 182/195 states have signed 153/195 have ratified Need 9 more ratifications including US, China US Senate refused to ratify in 1998 “too hard to monitor” Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

2053 nuclear explosions 3/11/2017 3:36:18 PM Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

254 monitoring stations 3/11/2017 3:36:18 PM Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

The problem Given waveform traces from all seismic stations, figure out what events occurred when and where Traces at each sensor station may be preprocessed to form “detections” (90% are not real) ARID ORID STA PH BEL DELTA SEAZ ESAZ TIME TDEF AZRES ADEF SLORES SDEF WGT VMODEL LDDATE WRA P d d d IASP :54:27 FITZ P d d n IASP :54:27 MKAR P d d d IASP :54:27 ASAR P d d d IASP :54:27 Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

What do we know? Events happen randomly; each has a time, location, depth, magnitude; seismicity varies with location Seismic waves of many kinds (“phases”) travel through the Earth Travel time and attenuation depend on phase and source/destination Arriving waves may or may not be detected, depending on sensor and local noise environment Local noise may also produce false detections Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

3/11/2017 3:36:18 PM # SeismicEvents ~ Poisson[TIME_DURATION*EVENT_RATE]; IsEarthQuake(e) ~ Bernoulli(.999); EventLocation(e) ~ If IsEarthQuake(e) then EarthQuakeDistribution() Else UniformEarthDistribution(); Magnitude(e) ~ Exponential(log(10)) + MIN_MAG; Distance(e,s) = GeographicalDistance(EventLocation(e), SiteLocation(s)); IsDetected(e,p,s) ~ Logistic[SITE_COEFFS(s,p)](Magnitude(e), Distance(e,s); #Arrivals(site = s) ~ Poisson[TIME_DURATION*FALSE_RATE(s)]; #Arrivals(event=e, site) = If IsDetected(e,s) then 1 else 0; Time(a) ~ If (event(a) = null) then Uniform(0,TIME_DURATION) else IASPEI(EventLocation(event(a)),SiteLocation(site(a)),Phase(a)) + TimeRes(a); TimeRes(a) ~ Laplace(TIMLOC(site(a)), TIMSCALE(site(a))); Azimuth(a) ~ If (event(a) = null) then Uniform(0, 360) else GeoAzimuth(EventLocation(event(a)),SiteLocation(site(a)) + AzRes(a); AzRes(a) ~ Laplace(0, AZSCALE(site(a))); Slow(a) ~ If (event(a) = null) then Uniform(0,20) else IASPEI-SLOW(EventLocation(event(a)),SiteLocation(site(a)) + SlowRes(site(a)); Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Learning with prior knowledge
Instead of learning a mapping from detection histories to event bulletins, learn local pieces of an overall structured model: Event location prior (A6) Predictive travel time model (A1) Phase type classifier (A2) Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Event location prior (A6)

Travel time prediction (A1)
How long does it take for a seismic signal to get from A to B? This is the travel time T(A,B) If we know this accurately, and we know the arrival times t1, t2, t3, … at several stations B1, B2, B3, …, we can find an accurate estimate of the location A and time t for the event, such that T(A,Bi) ≈ ti – t for all i Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Earth 101 Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Seismic “phases” (wave types/paths)
Seismic energy is emitted in different types of waves; there are also qualitatively distinct paths (e.g., direct vs reflected from surface vs. refracted through core). P and S are the direct waves; P is faster Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

IASP 91 reference velocity model
Spherically symmetric, Vphase(depth); from this, obtain Tpredicted(A,B). Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

IASP91 inaccuracy is too big!
Earth is inhomogeneous: variations in crust thickness and rock properties (“fast” and “slow”) Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

Travel time residuals (Tactual – Tpredicted)
Residual surface (wrt a particular station) is locally smooth; estimate by local regression Lecture 1 8/25/11 CS Fall 2011, Stuart Russell

CS Fall 2011, Stuart Russell

Similar presentations

Presentation on theme: "CS Fall 2011, Stuart Russell"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS Fall 2011, Stuart Russell

Similar presentations

Presentation on theme: "CS Fall 2011, Stuart Russell"— Presentation transcript:

Similar presentations

About project

Feedback