Programming with Data Lab 6

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
Regularization David Kauchak CS 451 – Fall 2013.
Least squares CS1114
Algorithms + L. Grewe.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
PHYS2020 NUMERICAL ALGORITHM NOTES ROOTS OF EQUATIONS.
458 Interlude (Optimization and other Numerical Methods) Fish 458, Lecture 8.
KNAPSACK PROBLEM A dynamic approach. Knapsack Problem  Given a sack, able to hold K kg  Given a list of objects  Each has a weight and a value  Try.
September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly.
CS 4700: Foundations of Artificial Intelligence
10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.
Collaborative Filtering Matrix Factorization Approach
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Classification Part 3: Artificial Neural Networks
Numerical Methods Applications of Loops: The power of MATLAB Mathematics + Coding 1.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Linear Discrimination Reading: Chapter 2 of textbook.
Exam 1 Oct 3, closed book Place ITE 119, Time:12:30-1:45pm
Machine Learning 5. Parametric Methods.
The bin packing problem. For n objects with sizes s 1, …, s n where 0 < s i ≤1, find the smallest number of bins with capacity one, such that n objects.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
CS 2750: Machine Learning Linear Regression Prof. Adriana Kovashka University of Pittsburgh February 10, 2016.
LECTURE 2 Python Basics. MODULES So, we just put together our first real Python program. Let’s say we store this program in a file called fib.py. We have.
1 2 Linear Programming Chapter 3 3 Chapter Objectives –Requirements for a linear programming model. –Graphical representation of linear models. –Linear.
CMSC201 Computer Science I for Majors Lecture 19 – Recursion
Physics 114: Lecture 16 Least Squares Fit to Arbitrary Functions
Copyright © Cengage Learning. All rights reserved.
Fall 2004 Backpropagation CS478 - Machine Learning.
deterministic operations research
Deep Feedforward Networks
Supervised Learning in ANNs
Deep Learning.
Lecture 2 Python Basics.
May 17th – Comparison Sorts
Artificial Neural Networks
CSCE 411 Design and Analysis of Algorithms
Algorithm Design Methods
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Greedy Algorithms Basic idea Connection to dynamic programming
Dr. Arslan Ornek IMPROVING SEARCH
Higher-Order Procedures
Roberto Battiti, Mauro Brunato
Merge Sort 11/28/2018 2:18 AM The Greedy Method The Greedy Method.
The Greedy Method Spring 2007 The Greedy Method Merge Sort
Merge Sort 11/28/2018 8:16 AM The Greedy Method The Greedy Method.
Collaborative Filtering Matrix Factorization Approach
CSCI B609: “Foundations of Data Science”
Chapter 7 Optimization.
Instructor :Dr. Aamer Iqbal Bhatti
Merge Sort 1/17/2019 3:11 AM The Greedy Method The Greedy Method.
CSC115 Introduction to Computer Programming
Algorithm Design Methods
Backpropagation David Kauchak CS159 – Fall 2019.
More on HW 2 (due Jan 26) Again, it must be in Python 2.7.
Computer Vision Lecture 19: Object Recognition III
Local Search Algorithms
Merge Sort 5/2/2019 7:53 PM The Greedy Method The Greedy Method.
Programming with Data (PWD 2019) Revision Class
Knapsack Problem A dynamic approach.
Reinforcement Learning (2)
Algorithm Design Methods
Reinforcement Learning (2)
Patterson: Chap 1 A Review of Machine Learning
Analyzing Multivariable Change: Optimization
Presentation transcript:

Programming with Data Lab 6 Wednesday, 28 Nov. 2018 Stelios Sotiriadis Prof. Alessandro Provetti

Optimization

General format Instance: A collection Solution: (Often) a choice from the collection under some constraints Measure: A goal, i.e., a cost function to be minimized, or a utility function to be maximized. For this class of problems a mathematical assessment should precede any coding effort: subtle changes on the specification might bring huge changes in the computational cost.

Typical strategies for solving optimization problems Greedy Randomized methods, e.g., Gradient descent. Dynamic programming Approximation

The Greedy principle Make the local choice that maximizes a local (easy to check) criterion In the hope that the thus-generated solution will maximise the global (costly to check) criterion Local: take as much as possible of the most precious, ounce-by-ounce, bar/bullion available Global: take the combination of bars that gives the maximum aggregated value under W

Does it always work? Greedy does not work on KNAPSACK 0-1 Underlying principle: Greedy works when local min/maximization does not prevent us from later reaching the global optimum. p/w = 6 5 4

Does it always work? In the example, choice of Item 1 excludes the actually-optimal sol. from consideration. Only some sufficient conditions are known for the applicability of Greedy Approximation and randomization are the methods of choice for KNAPSACK 0-1.

A look at the solution: Class5-knapsack-list of pairs elements = ['Platinum', 'Gold', 'Silver', 'Palladium’] instance = [[20, 711], [15, 960], [2000, 12], [130, 735]] Problem: sorting instance breaks the positional connection between Platinum and [20, 711] Possible solution: the powerful zip operations by python.

Gradient Descent

Glance at machine learning… In linear algebra we have: 𝑦=2𝑥+3 x = [1,2,3,4…] y = [ 5, 7, 9, 11…] In machine learning: We have data! departments= [1,2,3,4…] sales= [ 5, 7, 9, 11…] We are looking for the equation! y= mx+b e.g. y =2x+3 sales We are looking for the best fit line! departments

Which line is the best fit line? Draw a line (random) and calculate the error, between the point and the line… 1 𝑛 𝑖=1 𝑛 (𝑒 𝑖 ) 2 Mean square error (mse) 𝑚𝑠𝑒= 1 𝑛 𝑖=1 𝑛 (𝑦 𝑖 − 𝑦′ 𝑖 ) 2 This is our cost function! sales e4 e3 y'2 e2 y2 e1 years

What is gradient descent? A method to optimize a function, in our example minimize the error (mse) to find the best fit line!

Another example 𝑓 𝑥 = 𝑥 2 −2𝑥+2= 𝑥−1 2 +1 𝑓 𝑥 = 𝑥 2 −2𝑥+2= 𝑥−1 2 +1 When x=1, f(1) = 1, this is our min! I know from calculus: Minimize when the derivative of f(x) equals 0 𝜕𝑦 𝜕𝑥 =2𝑥−2=0, 𝑥=1 so at x=1! Min! 1

With gradient descent x=3 x=2.2 Step 1: Take a random point e.g. x=3, Step 2: Take the derivate at this point of 𝑥−1 2 +1, 𝝏𝒚 𝝏𝒙 = 𝟐 𝟑 −𝟐=𝟒 4 is positive number so function gets larger! Lets say we take -1, then 𝜕𝑦 𝜕 𝑥 0 =2 −1 −2=−4 , so function gets smaller! Step 3: Next guess (on x=3 example). 𝒙 𝒊+𝟏 = 𝒙 𝒊 −𝒂 𝝏𝒚 𝝏 𝒙 𝒊 , Where “a” is a small step e.g. 𝑎=0.2 e.g. 𝑥 1 = 𝑥 𝑜 −𝑎 𝜕𝑦 𝜕 𝑥 0 =3−0.2∗4=2.2 𝝏𝒚 𝝏 𝒙 𝟏 =𝟐 𝟐.𝟐 −𝟐=𝟏.𝟕𝟐 , We moved closer! Step 4: Repeat! Again and again… We need a software to calculate this! x=3 x=2.2

Gradient descent for the best fit example You take small steps to minimize the error e.g. arbitrary step 0.5… Step is too big! We lost minimum! mse b

Gradient descent You take small steps to minimize the error The step is minimized while we go… We need to find the slopes! Derivatives introduction. mse b

Gradient descent You take small steps to minimize the error The step is minimized while we go… We need to find the slopes! We calculate partial derivatives 𝑚𝑠𝑒= 1 𝑛 𝑖=1 𝑛 (𝑦 𝑖 − 𝑦𝑝𝑟𝑒𝑑 𝑖 ) 2 𝑤ℎ𝑒𝑟𝑒 𝑦𝑝𝑟𝑒𝑑=𝑚 𝑥 𝑖 +𝑏 𝑚𝑠𝑒= 1 𝑛 𝑖=1 𝑛 (𝑦 𝑖 − 𝑚 𝑥 𝑖 +𝑏 𝑖 ) 2 mse b1 𝜕 𝜕𝑚 = 2 𝑛 𝑖=1 𝑛 − 𝑥 𝑖 (𝑦 𝑖 −(𝑚 𝑥 𝑖 +𝑏)) b2 =b1 – learning rate * b1’ 𝜕 𝜕𝑏 = 2 𝑛 𝑖=1 𝑛 − (𝑦 𝑖 −(𝑚 𝑥 𝑖 +𝑏)) b

Gradient descent in Python! Lets see: Class6-grad_descent(ax+b).py And: Class6-gradient_descent_visualize.py

Study Chapter 8! In Chapter 8 of his book, Grus introduces minimization and gradient descent. The intended application is error minimization. But let’s see the details… Book chapter is online: https://www.dcs.bbk.ac.uk/~stelios/pwd2018/code/Class6-grus-ch8- gradient_descent.pdf

Understanding *args and **kwargs # *args for variable number of arguments def myFun(*args):      for arg in args:          print (arg)      myFun(‘Hi!', ‘I', ‘pass', ‘many’, ‘args!’) Output: Hi! I pass many args! # *kwargs for variable number of keyword arguments    def myFun(**kwargs):      for key, value in kwargs.items():         print ("%s == %s" %(key, value)) myFun(first ='Key', mid ='value', last='pair')    first == Key mid == value last == pair

Functionals Negate a function def negate(f): """return a function that for any input x returns -f(x)""" return lambda *args, **kwargs: -f(*args, **kwargs) Example: def myincrementor(n): return n+1 g = negate(myincrementor) # g is a new function print g(6)

List comprehensions and Mappings unit_prices = [711, 960, 12, 735] print([int(price*1.10) for price in unit_prices]) OR: def myinflator(n): return int(n*1.10) new_unit_prices = map(myinflator, unit_prices) print([i for i in new_unit_prices]) Both print the same!

Objectives of Chapter 8 Grus forgot his Maths and now would like to find the minimum value of function X2 for values around 0: argmin 𝑓 𝑥 𝑥∈[−1,1] def square(x): return x * x def derivative(x): return 2 * x - Lets see how it works using the Class6-gradient_descent_visualize.py

Further reading: Learn Gradient Descent Try: gradient_descent.py with its companion module linear algebra on functions of your choice. Try f(x) = x3+3x2−2x+1 over [−4, 2]. Hint: derivative is 3x2+6x−2 Hint: global minimum at -4 Code: Class6-gradient_descent-Gruss_code.py

More on gradient descent… Batch gradient descent: Use all data, in Class6-grad_descent(ax+b).py we have 5 points But what happens if we have 1000 or 1 billion points? Algorithm becomes very slow! Mini-batch Instead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on the batch size. Stochastic gradient descent Shuffle the training data and uses a single randomly picked training example

When comparing Batch gradient descent Stochastic gradient descent Much slower More accurate Stochastic gradient descent Much faster Slightly off (noise data)

Resources Try online function visualization