Recitation for BigData Jay Gu Jan 10 HW1 preview and Java Review
Outline HW1 preview Review of java basics An example of gradient descent for linear regression in Java
HW1 Preview On ~1 million size data. Warm up exercise Stochastic Gradient Descent for Logistic Regression SGD with Hashing Kernel Extra credit: Personalized Logistic Regression
Starter Code – Class for parsing the input file and iterate over the dataset. Dataset dataset = new Dataset(your_path, is_training, size) While(dataset.hasNext()) { DataInstance d = dataset.next(); … some action on d … }
Starter Code public class DataInstance { int clicks; // number of clicks, -1 if it is testing data. int impressions; // number of impressions, -1 if it is testing data. // Feature of the session int depth; // depth of the session. int[] query; // List of token ids in the query field // Feature of the ad …. // Feature of the user …. }
Starter Code public class Weights { double w0; /* * query.get("123") will return the weight for the feature: * "token 123 in the query field". */ Map query; Map title; Map keyword; Map description; double wPosition; double wDepth; double wAge; double wGender; }
BigData is often sparse Be as lazy as you can … Update only when necessary…
Avoid O(d): Sparse and lazy update Although the feature space d is huge, each data point only has a few tokens. – Only update what is changed. But even so, regularization should be applied to all d weights at each step. – Delay and batch the regularization.
Java Review Not required but good to know: Interface, Inheritance, Access Modifier, I/O,… Language: Class, Object, variable, method Data Structure: Java Collections – Array – List : ArrayList – Map: HashMap
Class public class DataInstance { // Feature of the session int[] query …. // Feature of the ad int[] title … DataInstance(String line, … ) { // parse the line, and set the field } public void print() { System.out.println( “title: “); for (int token : title) System.out.print(token + “\t”); } Members or fields Constructor Method
Object DataInstance data = new DataInstance(); int clicked = data.clicked data.print()
Collections Array – int[] tokens – double[] weights ArrayList – ArrayList HashMap – HashMap Fixed Length, Most compact Dynamically Increasing (double the size every time) Constant time key value look up Dynamically Increasing, use more memory
Variables “Everything” in Java is an Object – Except for primitive types : int, double All object variables are reference/pointers to the Object Function passes variables by value
Example: SGD for linear regression Demo