CPS 100 4.1 Data processing example l Scan a large (~ 10 7 bytes) file l Print the 20 most frequently used words together with counts of how often they.

Slides:



Advertisements
Similar presentations
Data Structures: A Pseudocode Approach with C
Advertisements

Razdan with contribution from others 1 Algorithm Analysis What is the Big ‘O Bout? Anshuman Razdan Div of Computing.
Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Introduction to Analysis of Algorithms
 Last lesson  Arrays for implementing collection classes  Performance analysis (review)  Today  Performance analysis  Logarithm.
Computer Science 2 Data Structures and Algorithms V section 2 Intro to “big o” Lists Professor: Evan Korth New York University 1.
Data Structures CS 310. Abstract Data Types (ADTs) An ADT is a formal description of a set of data values and a set of operations that manipulate the.
Elementary Data Structures and Algorithms
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
CPS 100, Spring From data to information to knowledge l Data that’s organized can be processed  Is this a requirement?  What does “organized”
COMP s1 Computing 2 Complexity
SIGCSE Tradeoffs, intuition analysis, understanding big-Oh aka O-notation Owen Astrachan
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Analysis of Algorithm Lecture 3 Recurrence, control structure and few examples (Part 1) Huma Ayub (Assistant Professor) Department of Software Engineering.
Week 2 CS 361: Advanced Data Structures and Algorithms
CPS What’s easy, hard, first? l Always do the hard part first. If the hard part is impossible, why waste time on the easy part? Once the hard part.
CPS 100, Spring Analysis: Algorithms and Data Structures l We need a vocabulary to discuss performance and to reason about alternative algorithms.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
CS 1704 Introduction to Data Structures and Software Engineering.
Analysis of Algorithms
Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.
Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.
CPS 100, Spring From data to information l Data that’s organized can be processed  Is this a requirement?  What does “organized” means l Purpose.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
CSC 211 Data Structures Lecture 13
Algorithm Analysis CS 400/600 – Data Structures. Algorithm Analysis2 Abstract Data Types Abstract Data Type (ADT): a definition for a data type solely.
Dynamic Array. An Array-Based Implementation - Summary Good things:  Fast, random access of elements  Very memory efficient, very little memory is required.
CompSci 100e 7.1 Plan for the week l Review:  Union-Find l Understand linked lists from the bottom up and top-down  As clients of java.util.LinkedList.
CPS From sets, toward maps, via big-Oh l Consider the MemberCheck weekly problem  Find elements in common to two vectors  Alphabetize/sort resulting.
(c) University of Washington16-1 CSC 143 Java Linked Lists Reading: Ch. 20.
APS105 Lists. Structures Arrays allow a collection of elements –All of the same type How to collect elements of different types? –Structures; in C: struct.
CompSci Analyzing Algorithms  Consider three solutions to SortByFreqs, also code used in Anagram assignment  Sort, then scan looking for changes.
1/6/20161 CS 3343: Analysis of Algorithms Lecture 2: Asymptotic Notations.
1 Today’s Material List ADT –Definition List ADT Implementation: LinkedList.
CompSci 100E 15.1 What is a Binary Search?  Magic!  Has been used a the basis for “magical” tricks  Find telephone number ( without computer ) in seconds.
CS 150: Analysis of Algorithms. Goals for this Unit Begin a focus on data structures and algorithms Understand the nature of the performance of algorithms.
Algorithm Analysis. What is an algorithm ? A clearly specifiable set of instructions –to solve a problem Given a problem –decide that the algorithm is.
Searching Topics Sequential Search Binary Search.
CPS 100e 5.1 Inheritance and Interfaces l Inheritance models an "is-a" relationship  A dog is a mammal, an ArrayList is a List, a square is a shape, …
CompSci 100E 19.1 Getting in front  Suppose we want to add a new element  At the back of a string or an ArrayList or a …  At the front of a string.
Compsci 100, Fall From data to information to knowledge l Data that’s organized can be processed  Is this a requirement?  What does “organized”
Analysis of Algorithms Spring 2016CS202 - Fundamentals of Computer Science II1.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
CompSci 100e 7.1 Plan for the week l Review:  Boggle  Coding standards and inheritance l Understand linked lists from the bottom up and top-down  As.
CPS 100e 6.1 What’s the Difference Here? l How does find-a-track work? Fast forward?
CPS 100, Spring Tools: Solve Computational Problems l Algorithmic techniques  Brute-force/exhaustive, greedy algorithms, dynamic programming,
Data Structures and Algorithm Analysis Dr. Ken Cosh Linked Lists.
CC 215 DATA STRUCTURES LINKED LISTS Dr. Manal Helal - Fall 2014 Lecture 3 AASTMT Engineering and Technology College 1.
AP National Conference, AP CS A and AB: New/Experienced A Tall Order? Mark Stehlik
TAFTD (Take Aways for the Day)
Algorithm Analysis 1.
Analysis: Algorithms and Data Structures
Analysis of Algorithms
Plan for the Week Review Big-Oh
From data to information
What’s the Difference Here?
Compsci 201, Mathematical & Emprical Analysis
Owen Astrachan Jeff Forbes October 19, 2017
What's a pointer, why good, why bad?
Algorithm Analysis CSE 2011 Winter September 2018.
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
Algorithm Analysis (not included in any exams!)
Building Java Programs
Algorithm design and Analysis
Building Java Programs
Searching, Sorting, and Asymptotic Complexity
Data Structures & Algorithms
Linked lists Low-level (concrete) data structure, used to implement higher-level structures Used to implement sequences/lists (see CList in Tapestry) Basis.
Java Basics – Arrays Should be a very familiar idea
Presentation transcript:

CPS Data processing example l Scan a large (~ 10 7 bytes) file l Print the 20 most frequently used words together with counts of how often they occur l Need more specification? l How do you do it?

CPS Possible solutions 1. Use heavy duty data structures (Knuth)  Hash tries implementation  Randomized placement  Lots o’ pointers  Several pages 2. UNIX shell script (Doug Mclroy) tr -cs “[:alpha:]” “[\n*]” < FILE | \ sort | \ uniq -c | \ sort -n -r -k 1,1 | \ head -20 Which is better?  K.I.S.?

CPS Analyzing Algorithms l Consider three solutions to SortByFreqs or ZipfsLaw  Sort, then scan looking for changes  Insert into Set, then count each unique string  Find unique elements without sorting, sort these, then count each unique string  Use a Map (TreeMap or HashMap) l We want to discuss trade-offs of these solutions  Ease to develop, debug, verify  Runtime efficiency  Vocabulary for discussion

CPS Dropping Glass Balls l Tower with N Floors l Given 2 glass balls l Want to determine the lowest floor from which a ball can be dropped and will break l How? l What is the most efficient algorithm? l How many drops will it take for such an algorithm (as a function of N)?

CPS Glass balls revisited (more balls) l Assume the number of floors is 100 l In the best case how many balls will I have to drop to determine the lowest floor where a ball will break? l In the worst case, how many balls will I have to drop? If there are n floors, how many balls will you have to drop? ( roughly )

CPS What is big-Oh about? (preview) l Intuition: avoid details when they don’t matter, and they don’t matter when input size (N) is big enough  For polynomials, use only leading term, ignore coefficients y = 3x y = 6x-2 y = 15x + 44 y = x 2 y = x 2 -6x+9 y = 3x 2 +4x The first family is O(n), the second is O(n 2 )  Intuition: family of curves, generally the same shape  More formally: O(f(n)) is an upper-bound, when n is large enough the expression cf(n) is larger  Intuition: linear function: double input, double time, quadratic function: double input, quadruple the time

CPS Recall adding to list (class handout) l Add one element to front of ArrayList  Shift all elements  Cost N for N -element list  Cost … + N = N(N+1)/2 if repeated l Add one element to front of LinkedList  No shifting, add one link  Cost is independent of N, constant-time cost  Cost … + 1 = N if repeated

CPS More on O-notation, big-Oh l Big-Oh hides/obscures some empirical analysis, but is good for general description of algorithm  Allows us to compare algorithms in the limit 20N hours vs N 2 microseconds: which is better? O-notation is an upper-bound, this means that N is O(N), but it is also O(N 2 ) ; we try to provide tight bounds. Formally:  A function g(N) is O(f(N)) if there exist constants c and n such that g(N) n cf(N) g(N) x = n

CPS Which graph is “best” performance?

CPS Big-Oh calculations from code l Search for element in an array:  What is complexity of code (using O-notation)?  What if array doubles, what happens to time? for(int k=0; k < a.length; k++) { if (a[k].equals(target)) return true; }; return false; l Complexity if we call N times on M-element vector?  What about best case? Average case? Worst case?

CPS Amortization: Expanding ArrayLists Expand capacity of list when add() called Calling add N times, doubling capacity as needed l What if we grow size by one each time? Item #Resizing costCumulative cost Resizing Cost per item Capacity After add m m+1 2 m+1 2 m+2 -2around 22 m+1

CPS Some helpful mathematics l … + N  N(N+1)/2, exactly = N 2 /2 + N/2 which is O(N 2 ) why? l N + N + N + …. + N (total of N times)  N*N = N 2 which is O(N 2 ) l N + N + N + …. + N + … + N + … + N (total of 3N times)  3N*N = 3 N 2 which is O(N 2 ) l … + 2 N  2 N+1 – 1 = 2 x 2 N – 1 which is O(2 N ) l Impact of last statement on adding 2 N +1 elements to a vector  … + 2 N + 2 N+1 = 2 N+2 -1 = 4x2 N -1 which is O(2 N ) resizing + copy = total (let x = 2 N )

CPS Running 10 6 instructions/sec NO(log N)O(N)O(N log N)O(N 2 ) , , min 100, hr 1,000, day 1,000,000, min18.3 hr318 centuries

CPS Getting in front l Suppose we want to add a new element  At the back of a string or an ArrayList or a …  At the front of a string or an ArrayList or a …  Is there a difference? Why? What's complexity? l Suppose this is an important problem: we want to grow at the front (and perhaps at the back)  Think editing film clips and film splicing  Think DNA and gene splicing l Self-referential data structures to the rescue  References, reference problems, recursion, binky

CPS What’s the Difference Here? l How does find-a-track work? Fast forward?

CPS Contrast LinkedList and ArrayList l See ISimpleList, SimpleLinkedList, SimpleArrayList  Meant to illustrate concepts, not industrial-strength  Very similar to industrial-strength, however l ArrayList --- why is access O(1) or constant time?  Storage in memory is contiguous, all elements same size  Where is the 1 st element? 40 th ? 360 th ?  Doesn’t matter what’s in the ArrayList, everything is a pointer or a reference (what about null?)

CPS What about LinkedList? l Why is access of N th element linear time? l Why is adding to front constant-time O(1)? front

CPS ArrayLists and linked lists as ADTs l As an ADT (abstract data type) ArrayLists support  Constant-time or O(1) access to the k-th element  Amortized linear or O(n) storage/time with add Total storage used in n-element vector is approx. 2n, spread over all accesses/additions (why?)  Adding a new value in the middle of an ArrayList is expensive, linear or O(n) because shifting required l Linked lists as ADT  Constant-time or O(1) insertion/deletion anywhere, but…  Linear or O(n) time to find where, sequential search l Good for sparse structures: when data are scarce, allocate exactly as many list elements as needed, no wasted space/copying (e.g., what happens when vector grows?)

CPS Linked list applications l Remove element from middle of a collection, maintain order, no shifting. Add an element in the middle, no shifting  What’s the problem with a vector (array)?  Emacs visits several files, internally keeps a linked-list of buffers  Naively keep characters in a linked list, but in practice too much storage, need more esoteric data structures l What’s (3x 5 + 2x 3 + x + 5) + (2x 4 + 5x 3 + x 2 +4x) ?  As a vector (3, 0, 2, 0, 1, 5) and (0, 2, 5, 1, 4, 0)  As a list ((3,5), (2,3), (1,1), (5,0)) and ________?  Most polynomial operations sequentially visit terms, don’t need random access, do need “splicing” l What about (3x ) ?

CPS Linked list applications continued l If programming in C, there are no “growable-arrays”, so typically linked lists used when # elements in a collection varies, isn’t known, can’t be fixed at compile time  Could grow array, potentially expensive/wasteful especially if # elements is small.  Also need # elements in array, requires extra parameter  With linked list, one pointer used to access all the elements in a collection l Simulation/modeling of DNA gene-splicing  Given list of millions of CGTA… for DNA strand, find locations where new DNA/gene can be spliced in Remove target sequence, insert new sequence

CPS Linked lists, CDT and ADT l As an ADT  A list is empty, or contains an element and a list  ( ) or (x, (y, ( ) ) ) l As a picture l As a CDT (concrete data type) public class Node { Node p = new Node(); String value; p.value = “hello”; Node next; p.next = null; }; p 0

CPS Building linked lists l Add words to the front of a list (draw a picture)  Create new node with next pointing to list, reset start of list public class Node { String value; Node next; Node(String s, Node link){ value = s; next = link; } }; // … declarations here Node list = null; while (scanner.hasNext()) { list = new Node(scanner.next(), list); } l What about adding to the end of the list?

CPS Dissection of add-to-front l List initially empty l First node has first word l Each new word causes new node to be created  New node added to front l Rhs of operator = completely evaluated before assignment list A list = new Node(word,list); B Node(String s, Node link) { info = s; next = link;} list

CPS Standard list processing (iterative) l Visit all nodes once, e.g., count them or process them public int size(Node list){ int count = 0; while (list != null) { count++; list = list.next; } return count; } l What changes in code above if we change what “process” means?  Print nodes?  Append “s” to all strings in list?

CPS Nancy Leveson: Software Safety Founded the field l Mathematical and engineering aspects  Air traffic control  Microsoft word "C++ is not state-of-the-art, it's only state-of-the-practice, which in recent years has been going backwards" l Software and steam engines: once extremely dangerous? l l THERAC 25: Radiation machine that killed many people l

CPS Building linked lists continued l What about adding a node to the end of the list?  Can we search and find the end?  If we do this every time, what’s complexity of building an N-node list? Why? l Alternatively, keep pointers to first and last nodes of list  If we add node to end, which pointer changes?  What about initially empty list: values of pointers? Will lead to consideration of header node to avoid special cases in writing code l What about keeping list in order, adding nodes by splicing into list? Issues in writing code? When do we stop searching?

CPS Standard list processing (recursive) l Visit all nodes once, e.g., count them public int recsize(Node list) { if (list == null) return 0; return 1 + recsize(list.next); } l Base case is almost always empty list: null pointer  Must return correct value, perform correct action  Recursive calls use this value/state to anchor recursion  Sometimes one node list also used, two “base” cases l Recursive calls make progress towards base case  Almost always using list.next as argument

CPS Recursion with pictures l Counting recursively int recsize(Node list){ if (list == null) return 0; return 1 + recsize(list.next); } recsize(Node list) return 1+ recsize(list.next) recsize(Node list) return 1+ recsize(list.next) recsize(Node list) return 1+ recsize(list.next) recsize(Node list) return 1+ recsize(list.next) System.out.println(recsize(ptr)); ptr

CPS Recursion and linked lists l Print nodes in reverse order  Print all but first node and… Print first node before or after other printing? public void print(Node list) { if (list != null) { } print(list.next); System.out.println(list.info); print(list.next);

CPS Complexity Practice l What is complexity of Build ? (what does it do?) public Node build(int n) { if (null == n) return null; Node first = new Node(n, build(n-1)); for(int k = 0; k < n-1; k++) { first = new Node(n,first); } return first; } l Write an expression for T(n) and for T(0), solve.  Let T(n) be time for build to execute with n-node list  T(n) = T(n-1) + O(n)

CPS Changing a linked list recursively l Pass list to method, return altered list, assign to list  Idiom for changing value parameters list = change(list, “apple”); public Node change(Node list, String key) { if (list != null) { list.next = change(list.next, key); if (list.info.equals(key)) return list.next; else return list; } return null; } l What does this code do? How can we reason about it?  Empty list, one-node list, two-node list, n -node list  Similar to proof by induction

CPS The Power of Recursion: Brute force l Consider a possible APT: What is minimum number of minutes needed to type n term papers given page counts and three typists typing one page/minute? (assign papers to typists to minimize minutes to completion)  Example: {3, 3, 3,5,9,10,10} as page counts l How can we solve this in general? Suppose we're told that there are no more than 10 papers on a given day.  How does the constraint help us?  What is complexity of using brute-force?