CS 179 Lecture 15 Set 5 & machine learning.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Dimensionality Reduction PCA -- SVD
Machine Learning and Data Mining Clustering
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Color/Intensity
HCC class lecture 14 comments John Canny 3/9/05. Administrivia.
Clustering Unsupervised learning Generating “classes”
Non Negative Matrix Factorization
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Advanced Algorithms Analysis and Design
Auburn University
Dimensionality Reduction and Principle Components Analysis
Semi-Supervised Clustering
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Deep Feedforward Networks
Machine Learning and Data Mining Clustering
Deep learning David Kauchak CS158 – Fall 2016.
Parallel processing is not easy
Large-scale Machine Learning
Spark Presentation.
Data Mining K-means Algorithm
Deep Learning Libraries
Multimodal Learning with Deep Boltzmann Machines
Real-Time Ray Tracing Stefan Popov.
CS 179 Lecture 15 Set 5 & Machine Learning 1.
Basic machine learning background with Python scikit-learn
Lecture 5: GPU Compute Architecture
CS 179 Lecture 17 Options Pricing.
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Efficient Estimation of Word Representation in Vector Space
Neural Networks and Backpropagation
Pipeline parallelism and Multi–GPU Programming
Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.
CS 179 Lecture 14.
Introduction to cuBLAS
Lecture 5: GPU Compute Architecture for the last time
Hierarchical clustering approaches for high-throughput data
GPU Implementations for Finite Element Methods
CS 179 Project Intro.
Hidden Markov Models Part 2: Algorithms
Objective of This Course
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Logistic Regression & Parallel SGD
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
HCC class lecture 13 comments
of the Artificial Neural Networks.
KAIST CS LAB Oh Jong-Hoon
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
MNIST Dataset Training with Tensorflow
Parallelization of Sparse Coding & Dictionary Learning
Neural Networks Geoff Hulten.
Text Categorization Berlin Chen 2003 Reference:
Machine Learning and Data Mining Clustering
Junheng, Shengming, Yunsheng 11/09/2018
Word embeddings (continued)
Deep Learning Libraries
Machine Learning and Data Mining Clustering
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
CS249: Neural Language Model
Presentation transcript:

CS 179 Lecture 15 Set 5 & machine learning

Set 5 goals Practice overlapping computation with data movement on a streaming workload… while building a machine learning system

Set 5 description Cluster a stream of business review from Yelp.

Representing text as vectors Yelp reviews are text. How can we quantify the similarity between two snippets of text? “The doctor has horrible bedside manner” “The enchiladas were the best I’ve ever had!” One solution is to represent the snippets as numerical vectors.

Bag of words approach A very simple but very common approach. Count occurrences of each word Loses all information on ordering of words. cats dogs hats love eat cats love hats 1 cats eat hats cat eat cats 2 document-term matrix Want to normalize words when computing document-term matrix. cats = cat. Runs = run = running. Called stemming. Also do things like lowercase. For the Yelp dataset, there were about 350,000 total terms (didn’t do stemming). Through out all words appearing less than 10 times and got down to 65,000 terms. 1.5 million documents. sparse matrix

Latent Semantic Analysis Idea: Compute the singular value decomposition (SVD) of the document-term matrix. Singular values correspond to words that commonly appear together, which oftentimes correspond to our notion of a topic. This is called latent semantic analysis. Represent each document as coefficients of top-k singular vectors. Did SVD on term-document matrix to represent each vector as a 50 component vector.

Clustering Group data points based on some sort of similarity. Clustering is a very common unsupervised learning problem. Many variants of clustering: hard/soft hierarchical centroid distribution (Gaussian mixture models) density hard/soft: does each point belong to 1 cluster or multiple clusters to different degrees (confidences) hierarchical - clusters are part of other clusters. top-down (divisive) and bottom-up (agglomerative) algorithms exist centroid - represent each cluster as a single point distribution - a cluster is a probability distribution density - a cluster is an area with a higher density of data points next to areas of lower density can think of LSA as a form of soft clustering (component values indicate how much point belongs to cluster)

k-means clustering Very popular centroid-based hard clustering algorithm. def k_means(data, k): randomly initialize k “cluster centers” while not converged: for each data point: assign data point to closest cluster center for each cluster center: move cluster center to average of points in cluster return cluster centers Hard points: choice of k, random initialization. NP-hard to find optimal solution (minimizing distance to cluster centers), this just finds local optima. can get bad outputs. How could we parallelize this?

Stream clustering Can we cluster if we can only see each data point once? def sloppy_k_means(int k): randomly initialize k “cluster centers” for each data point: assign data point to closest cluster center update closest cluster center with respect to data point return cluster centers This algorithm will give poor clustering.

Cheating at streaming algorithms Idea: Use a streaming algorithm to create a smaller dataset (sketch) with properties similar to stream. Use expensive algorithm on sketch. k_means(sloppy_k_means(stream, 20 * k), k) Strategy described here, used by Apache Mahout. If length of stream is known (and finite), use k*log(n) clusters for sketch.

What can we parallelize here? Parallelization def sloppy_k_means(int k): randomly initialize k “cluster centers” for each data point: assign data point to closest cluster center update closest cluster center with respect to data point return cluster centers What can we parallelize here?

Batching for parallelization def sloppy_k_means(int k): randomly initialize k “cluster centers” for each batch of data points: par-for point in batch: assign data point to closest cluster center update cluster centers with respect to batch return cluster centers

Batching Changing the computation slightly to process a batch of data at a time is a common theme in parallel algorithms. Although it does change the computation slightly, batching still leads to some sort of local minima of the loss function. When you already aren’t going to find an optimal solution, cutting corners isn’t that bad :)

Data transfer issues Your program for set 5 will read LSA representations of reviews from stdin and will sloppy cluster them. Your tasks: overlap data transfer and computation between host and device (hopefully saturate interface on haru) implement sloppy clustering algorithm analyze latency and throughput of system

Final comments on set Set 5 should be out Tuesday evening. Details are still getting worked out. Might involve using multiple GPUs. Will be relatively open-ended.

General machine learning on GPU Already seen k-means clustering… Many machine learning numerical algorithms rely on just a few common computational building blocks. element-wise operations on vectors/matrices/tensors reductions along axes of vectors/matrices/tensors matrix multiplication solving linear systems convolution (FFT)

Computational building blocks These building blocks are why scientific scripting (MATLAB or Numpy) is so successful. Often want to use the GPU by using a framework rather than writing your own CUDA code. image from Todd Lehman Have super fast implementations of building blocks, and then use a convenient language to tie the blocks together.

When to NOT write your own CUDA Heuristic: If you could write it efficiently in “clean” MATLAB, you could likely accelerate it solely through using libraries either from NVIDIA (cuBLAS, cuSPARSE, cuFFT) or the community (Theano, Torch) Better to write less CUDA and then call into a lot Scientists in general want to use code to enable their research, not just write code. The less time they spend programming, the better. Writing in C/CUDA is more time consuming than writing in Python. Think of CUDA as a way to make libraries rather than applications. Less CUDA code is often better. To get truly maximum performance, you might need to move larger and larger components to CUDA. Abstractions aren’t perfect.

When to write your own CUDA Vectorized scripting languages must use a lot of memory when considering all combinations of input. Example: What is the maximum distance between pairs of n points? Most vectorized MATLAB implementations will store all n2 distances, and then compute the maximum over these. Quadratic memory and time. An implementation in a language with loops (that you actually want to use) takes linear memory and quadratic time. k-means clustering in an example of this.

What this week will cover Today: Theano (Python library that makes GPU computing easy) Rest of week: parallelizing machine learning yourself, how to write your own CUDA code

Theano Python library designed primarily for neural nets, but useful for any sort of machine learning that involves training by gradient descent. Key idea: Express computation as a directed graph through a Numpy-like interface.

A Theano computation graph (upper part computes exp(w*x - b) )

Benefits of computational graph Due to abstraction, can execute graph on any hardware for which all graph nodes are implemented. Multi-threaded C, GPU, etc. Can also optimize graph itself (and correct for numerical instability). Can automatically compute gradient of any graph node output with respect to any other graph node (automatic differentiation) Computational graphs are the future in my opinion. Need to generate code at runtime to do as well as possible.

Simple Theano example import theano import theano.tensor as T input = T.matrix() output = 1 / (1 + T.exp(-x)) logistic = theano.function([input], output) logistic([[0.0, 1.0], [3.14, -1.0]])

Conclusion Set 5 will hopefully be fun (esp since set 6 will be based on same code) When working on an application, don’t write CUDA unless the complexity is necessary. Theano is a useful Python library for doing scientific scripting on GPU