CS 179 Project Ideas.

Slides:



Advertisements
Similar presentations
Fast Fourier Transform for speeding up the multiplication of polynomials an Algorithm Visualization Alexandru Cioaca.
Advertisements

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Advanced Topics in Algorithms and Data Structures Lecture 7.2, page 1 Merging two upper hulls Suppose, UH ( S 2 ) has s points given in an array according.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.
Including Uncertainty Models for Surrogate-based Global Optimization
Computational Methods for Management and Economics Carla Gomes
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
reconstruction process, RANSAC, primitive shapes, alpha-shapes
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Lecture 17 Today: Start Chapter 9 Next day: More of Chapter 9.
11/26/02CSE FFT,etc CSE Algorithms Polynomial Representations, Fourier Transfer, and other goodies. (Chapters 28-30)
Space-Filling DOEs Design of experiments (DOE) for noisy data tend to place points on the boundary of the domain. When the error in the surrogate is due.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
 Optimal Packing of High- Precision Rectangles By Eric Huang & Richard E. Korf 25 th AAAI Conference, 2011 Florida Institute of Technology CSE 5694 Robotics.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Identifying Reversible Functions From an ROBDD Adam MacDonald.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
12/4/2001CS 638, Fall 2001 Today Managing large numbers of objects Some special cases.
Scientific Communication
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 4. Least squares.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
CAS 721 Course Project Implementing Branch and Bound, and Tabu search for combinatorial computing problem By Ho Fai Ko ( )
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Week 8 - Wednesday.  What did we talk about last time?  Level order traversal  BST delete  2-3 trees.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
INCLUDING UNCERTAINTY MODELS FOR SURROGATE BASED GLOBAL DESIGN OPTIMIZATION The EGO algorithm STRUCTURAL AND MULTIDISCIPLINARY OPTIMIZATION GROUP Thanks.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
Introduction to Algorithm Complexity Bit Sum Problem.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Lecture 10 CUDA Instructions
CMPT 438 Algorithms.
Generalized and Hybrid Fast-ICA Implementation using GPU
Analysis of Sparse Convolutional Neural Networks
CUDA Interoperability with Graphical Environments
CPS216: Data-intensive Computing Systems
Multiplicative updates for L1-regularized regression
GPU Computing CIS-543 Lecture 10: CUDA Libraries
Understanding Communication with a Robot? Activity (60 minutes)
CALCULATION OF DECIMAL LOGORATHIM
Bisection and Twisted SVD on GPU
CS 179 Lecture 15 Set 5 & machine learning.
Lecture 5: GPU Compute Architecture
L21: Putting it together: Tree Search (Ch. 6)
Neural Networks and Backpropagation
An Introduction to Support Vector Machines
Introduction to cuBLAS
Lecture 5: GPU Compute Architecture for the last time
CS 179 Project Intro.
Academic Communication Lesson 3
Quadtrees 1.
CS 179: GPU Programming Lecture 19: Projects 1
Logistic Regression & Parallel SGD
CS 201 Fundamental Structures of Computer Science
Singular Value Decomposition
Augmented Data Structures and the Test
Parallelization of Sparse Coding & Dictionary Learning
~ Least Squares example
~ Least Squares example
Understanding the TigerSHARC ALU pipeline
Major Design Strategies
Parallel Programming in C with MPI and OpenMP
Derivatives and Gradients
Presentation transcript:

CS 179 Project Ideas

Proposal Guidelines 1-3 sentence summary of project team members (pair or solo) Hoping to do 3 week or 5 week? 1-3 paragraph explanation of project with background Why is this challenging? Has it been done before? What tricky things are you going to have to figure out? 1-2 paragraphs What are the deliverables? Goals? 1 paragraph Week by week timeline: What are you going to do each week?

Available CUDA libraries cuBLAS: dense linear algebra cuSPARSE: sparse linear algebra cuRAND: random numbers, good for MCMC simulations cuFFT: Fast Fourier Transform cuSOLVER: dense and sparse factorization and system solvers cuDNN: common operations for deep neural nets Might be useful for project planning to to check out what they provide!

16 bit matrix multiplication Some applications (such as deep neural nets) don’t need 32 bits of precision. 16 bit advantages: speed up matrix multiply due to less IO, fit twice as much data in GPU memory If you create a fast implementation, there’s a good chance the deep learning community will use it a lot! Talk to Eric for

Cryptocurrency Find some cryptocurrency with an proof of work algorithm that hasn’t been optimized to death, and then optimize it to death An interesting read on the topic

Randomized Matrix Factorizations Can quickly approximate SVD, QR decomposition, etc using randomized algorithms! Good project for someone who has taken ACM 106a Method for least squares solutions, ultra-high dim’l spaces that represent highly constrained systems Can compare performance to cuSOLVER, for instance, for different size and different types of problems Good survey paper on the subject (PDF)

Branch and Bound Systems Make global B&B solution-finding environment. An N-dimensional box-like “volume” is tested for a criterion. If the box passes the test, it is put into a list of “solution boxes.” If the box does not pass, it is subdivided into children boxes which are then tested. Method finds all such boxes. The criterion/test will be run once per box, suitable to run on GPU Can include octree or K-D tree algorithms for representing surfaces and solids, for instance, http://www.nvidia.com/docs/IO/88889/laine2010i3d_paper.pdf or N-body gravitational systems that subdivide space with one body per box http://www.cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/nbody-problem.pdf

Interval Analysis “Corner Form” A method for global root finding -- an interval-based B&B test that guarantees that a “box” does not contain a root of f(x,y,z,w) = 0 where f() is a polynomial in x, y, z, w (or more variables). Intervals of f, given input intervals, are computed by “inclusion function” The Corner Taylor Form (inclusion function) is more accurate than the Midpoint Taylor Form for large input regions, eliminating many boxes early in the process. See http://thesis.library.caltech.edu/view/author/Gavriliu-M.html

Finding global roots of cos2x sin3y + sin3x cos2y - cos2x cos3y + sin3xsin2y = 0 (First turn function into polynomial, with error term). Sol’ns in yellow. On left, “Natural Extension” inclusion function: too many potential solutions. In the middle, “Centered Form” is much improved: On right, “Corner Form” is even better. Note upper right corner. Large region excluded quickly, without need for further subdivision.

Hash Table (and/or malloc) Implement a concurrent hash table that lives in global or shared memory. Implement malloc for global or shared memory. These will be tricky parallel programming problems!

Build an assembler Reverse engineer Fermi or Kepler binaries and build an assembler. Already done for Maxwell: https://github.com/NervanaSystems/maxas

“Speed dating” Talk to another person about your ideas for 3 minutes. Will cycle several times. Goals: hear lots of ideas, connect people with similar ideas, offer suggestions to each other ideas