A Tutorial on ML Basics and Embedding Chong Ruan

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Support Vector Machines
Machine learning continued Image source:
Supervised Learning Recap
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Artificial Neural Networks
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Collaborative Filtering Matrix Factorization Approach
Artificial Neural Networks
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山 助教: 熊信寬
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Vector Semantics Dense Vectors.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Distributed Representations for Natural Language Processing
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Deep Feedforward Networks
Data Mining, Neural Network and Genetic Programming
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
ECE 5424: Introduction to Machine Learning
Intro to NLP and Deep Learning
Multimodal Learning with Deep Boltzmann Machines
Classification with Perceptrons Reading:
Neural networks (3) Regularization Autoencoder
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Deep learning and applications to Natural language processing
Vector-Space (Distributional) Lexical Semantics
Data Mining Lecture 11.
Word2Vec CS246 Junghoo “John” Cho.
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Word Embedding Word2Vec.
Vector Representation of Text
Neural networks (1) Traditional multi-layer perceptrons
Machine learning overview
Neural networks (3) Regularization Autoencoder
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Word embeddings (continued)
Deep learning enhanced Markov State Models (MSMs)
Vector Representation of Text
CS249: Neural Language Model
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

A Tutorial on ML Basics and Embedding Chong Ruan

Machine Learning Basics What is machine learning? Traditionally: Input + Model => Output What ML does: Input (+ Output) => Model

Machine Learning Basics Example: How to write a classifier which can distinguish between a watermelon and an apple? The traditional way: Use hand-crafted features and manually code how they are related to the expected output Feature: express a sample as some values (typically a vector) E.g.: The color, shape, weight of a fruit If (weight > 2kg) print “Watermelon”; else print “Apple”. Difficulty Sometimes the relation between inputs and outputs are too complicated to be formulated Say, how to write a program to recognize a cat?

Machine Learning Basics Example: How to write a classifier which can distinguish between a watermelon and an apple? The ML way: Collect some data with labels, say we have: (7.5, watermelon), (8.1, watermelon), (0.5, apple), (0.6, apple) Propose a hypothesis (model space): If (weight > T) print “Class A”; else print “Class B” (T: unknown parameter) You may want to use a more complicated hypothesis for some more difficult tasks Training/Learning the model: Typically achieved by optimize an objective function Find the best threshold T, such that most samples are classified correctly.

Machine Learning Basics The difficulty of the ML approach: How to collect (labelled) data What kind of hypothesis to use … The first (and the most important) step is: How to choose features

Machine Learning Basics Look back on the fruit example aforementioned: How to represent a fruit: Its weight, color, flavor, size, etc. Feature selection Which subset of features is useful for our purpose (classification)? In this example, weight (or/and size) is a great choice (due to our prior knowledge)

Machine Learning Basics More on feature selection: Sometimes we do not need to select (or construct) proper features Just feed all available features to a model Let the model learn automatically! But too many feature may: slow down the training and predicting procedure Cause model to overfit or misfit the dataset So we want to reduce the dimension of the collected data…

Machine Learning Basics Examples of huge feature set: In NLP: Use TF-IDF feature to express a document In music recommendation: Use user-music matrix to predict users’ preference Typically a really huge matrix …

Dimension Reduction PCA: Principle Component Analysis A well-known dimension reduction algorithm An example:

Dimension Reduction PCA: Principle Component Analysis A well-known dimension reduction algorithm An example:

Dimension Reduction PCA: Principle Component Analysis Algorithm:

Dimension Reduction PCA: Principle Component Analysis For the aforementioned example: Matrix X: Mean value of each dimension is already 0 Covariance matrix C:

Dimension Reduction PCA: Principle Component Analysis For the aforementioned example: Eigenvalue decomposition: Choose top k rows from the matrix P and left multiply X: In this case k=1

Dimension Reduction

Embedding

Graph embedding Suppose you have a graph with m vertices Each vertex represents a data point (say, a person, a word, etc.) And a similarity matrix W Wij measures to what extent vertex i and j are similar Say, # of mutual friends, mutual information, etc. Use the row it self to represent each sample is too cumbersome Say, a social network may have millions of users a corpus may have tens of thousands of words The purpose: Assign a low dimensional vector to each vertex of the graph that preserves similarities between the vertex pairs

Embedding Word embedding (Word vector) Suppose you have a corpus The pattern words occur in the corpus reflect their meanings Similar/Related words are likely to appear together The purpose: Assign a low dimensional vector to each word which preserves the information of the word Say, its polarity, grammatical function, semantic property, etc. If two words are similar, there word embeddings are similar

Embedding The advantages of embedding: Compact Word embedding vs. one-hot expression Handy for numerical operation Say, calculate similarity between samples Easy to visualize Use PCA/LDA/… to transform data points to a 2/3 dimensional space and plot them

Embedding Obtain embeddings with neural networks: Wishful thinking: Consider: how do you write a recursive function? Suppose we already have an embedding (randomly initialization or use some heuristics) Set a proper objective function and optimize it The embeddings will converge to a reasonable positions An example: PageRank When considering a (probabilistic) model, be sure to distinguish between: Representation (Modelling) Learning (Training) Inference (Predicting)

Neural Networks (Modelling) What is a neural network: A special kind of function Inspired by neurons and their connection To recap, what is ML: Collect data Propose a hypothesis Fit the model to the data Optimize a objective function Say, minimize the misclassification rate (for a classification problem), minimize intra-cluster distance while maximize inter-cluster distance (for a clustering problem), minimize reconstruction error (representation learning), etc.

Neural Networks (Modelling)

Illustration (Photo credit: Andrew Ng):

Neural Networks (Modelling)

Validation: What if the relation between inputs and outputs are not of this form? No need to worry: A 3-layer neural network is a universal approximator, can approximate any continuous function within any accuracy Analogy: If you have some 2-dimensional data and you want to fit them You can always use a polynomial Any accuracy can be achieved given the degree of the polynomial is high enough So we can always use a neural network (of more than 3 layers) to fit any data Say, the function that predict the next word given its previous words, if exists

Neural Networks (Training) Set an objective function (cost function) Say, mean square error, cross entropy, etc. Find a group of parameters which can best interpret the data Numerical optimization methods, typically gradient descent Need to compute cost function and its gradient

Neural Networks (Training) Gradient descent (illustration)

Neural Networks (Training) Calculate the cost function Forward Propagation z: value before activation a: value after activation

Neural Networks (Training)

Calculate the gradient of the objective function Back Propagation: the chain rule

Neural Networks (Training)

Remark on Back Propagation Intuition: Propagate error over the graph A more formal and neat way to understand Back Propagation: Then use the chain rule

Neural Networks (Training) Take-home message: Modelling: Propose a hypothesis that the relation between the inputs and outputs can be approximated by a special kind of function with parameters Training: Find a group of parameters which can interpret the observed data best Achieved by optimization algorithms, typically gradient descent Gradient can be computed efficiently by Back Propagation (The Chain Rule) Use random initialization (not all zeros!)

Neural Networks (Inference) Given a new example, feed it to the trained neural network Use forward propagation to compute the output Take the output as your prediction

Neural Networks (Application) Multiple classification

Neural Networks (Application) What is flexible: The network architecture

Neural Networks (Application)

Neural Networks in NLP How to express a word as a vector: A unique integer as word ID: Can be used for indexing Not a semantic embedding, because improper order is introduced One-hot expression Use a vector of length |V| to represent a word Only one component is 1, the rest are 0 Compare: 酸,甜,苦,辣 不辣,微辣,中辣,特辣

Neural Networks in NLP Distributed expression Express a word as an n-dimensional real vector (more on this later) Pros: Compact: a dimension of several hundred works well Can code semantic property of words: similar words have similar vectors Cons Hard to interpret the “meaning” of its components

Neural Networks in NLP How to obtain word vector? Hypothesis: Words in a sentence are not chosen randomly There is a formula that: Given first n words, this formula gives the distribution of next word A.k.a. Language Model (which models the probability of a given sentence) And the formula can be approximated by a neural network A.k.a. neural network language model (NNLM)

Neural Networks in NLP More on language model: Suppose we have: 我 爱 … What should be the next word? P( 你 ) = P( 北京 ) = P( 睡觉 ) = P( 我 ) = 3.8e-9 P( 在 ) = 1.2e-10 … Useful for machine translation, input method, spell checking, etc.

Neural Networks in NLP

Neural network language model: Use a neural network to define the distribution of the next word The network structure is flexible (up to you)

Neural Networks in NLP

An example (2003, Bengio): Hidden layer: concatenate the input word embeddings

Neural Networks in NLP

An example (2003, Bengio): Objective function: Maximize the likelihood of the corpus

Neural Networks in NLP Don’t be too afraid of it Keep in mind: A neural network is a special kind of function We want to find a group of parameters to optimize the objective function Now we have two kinds of parameters: network weight and word embeddings The learning algorithm: Randomly initialization Use gradient descent to update parameters Note: include both network weight and word embeddings Use back propagation to calculate gradients Pros: Automatically smoothing

Neural Networks in NLP Visualize word embeddings using t-SNE: ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png

Neural Networks in NLP Visualize word embeddings using t-SNE: ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png

Neural Networks in NLP Visualize word embeddings using t-SNE: Bilingual embedding Socher et al. 2013

Neural Networks in NLP A more efficient network Word2vec (Mikolov et al., 2013) Continuous bag of words (CBOW) + hierarchical softmax (HS)

Neural Networks in NLP CBOW Given context, predict current word

Neural Networks in NLP

HS An example: For leaf “ 足球 ”: Then:

Neural Networks in NLP HS Use the Huffman tree: Intuitively, frequent words should have larger probability So make their paths shorter

Neural Networks in NLP The whole model

Neural Networks in NLP Learning procedure: SGD (details are omitted) Quite efficient Computation cost from the hidden layer to the output layer are reduced greatly Open source: For Linux: Works on Mac, with minor modification For windows: Require C++11 support See for referencehttp://blog.csdn.net/heyongluoyao8/article/details/

Neural Networks in NLP Something more appealing: Analogy property Man – Woman = King – Queen = …

Something more crazy Embedding words and images into the same space Socher et al. 2013

Something more crazy Embedding sentences/paragraphs/documents… Cho et al. 2014

Something more crazy Embedding sentences/paragraphs/documents… Cho et al. 2014

The End Thanks for your listening!