Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer

Digital Data  In earlier work with BSTs and various balanced trees, we compared keys for order or equality  Here, we take advantage of structure of key  Use it as an index, or  Decompose string key into characters, or  Treat key as numerical quantity on which we can perform operations

Assumptions  We will construct and manipulate sets that  Are drawn from a universe U of size N  U = {u 0, …u N-1 }  A relatively simple procedure exists by which we can compute, for an element u  U, the index i such that u = u i.  Easy if U is set of integers  Also easy if U is set of characters with character codes in a contiguous interval

Bit Vector  Used to represent a subset S  U  A table of N bits, Bits[0.. N-1]  Bits[i] == 1 if u i  S  Bits[i] == 0 if u i  S  Example: today’s attendance 1 1 0 1 0 1 1 0 1 2 3 4 5 6 -- student number 1 = present 0 = absent

Bit Vectors  Assume:  determining element index takes constant time  accessing position in table takes constant time  May actually take several ops, and depend somewhat on N(size of universe), but not on size of set represented  Then:  Insert, Delete, Member are constant time ops

Bit Vectors  A subset of a set of size N always takes N bits to represent, independent of size of subset  Makes sense if:  N is not too large  need to represent sets of size comparable to N

Storage Efficiency  Bit Vector vs. Binary Trees  Binary Tree, set of size n  Requires n(2p + K) bits  K >= lg N, size of field to represent key value  p = number of bits in a pointer  Bit Vector, takes N bits  If n  N, then bit vector more efficient  If p = K = 32, then tree becomes more space efficient when n/N  1%  Actually, when n(2p + K) = N, which is when n/N = 1/96

When to use Bit Vectors?  When universe is relatively small  When sets are large in relation to size of universe

Advantages of Bit Vectors  O(1) implementation of Insert, Delete, Member  Union and Intersection easy  Implement via Boolean and and or operations  May actually take less than one op/element, as operations are performed on full machine word  If machine word == 32, then one machine operation handles 32 potential elements of set

Disadvantages of Bit Vectors  On some computers access to individual bits can require shifting and masking operations (expensive)  Result is that Member may be much more expensive than Union  Initialization takes  (N) -- zero all the bits in the vector  But can use constant time initialization algorithm  But that makes storage requirement go to 2p + 1 bits per element  So, in practice, just use machine ops to set to zero, which are efficient

Tries and Digital Search Trees  If the key can be decomposed into characters, then the characters of the key can be used as indices  Tries are based on this idea  “trie” is the middle symbol of retrieval, a pun on tree, but pronounced “try”

Tries  Assume k possible character values  A trie is a (k+1)-ary tree  each node a table of k+1 pointers  One pointer for each possible character  One for the end of string character, 

Trie Example

Tries  Path for key of m characters is length m, with pointer at   Don’t need to store key itself.. It is the path followed.  Info field might be pointed to by  element

Tries: Analysis  Let:  n be the number of keys stored in a trie  l be the length(in characters) of the longest key  s be the number of nodes in the trie  k be the size of the alphabet  Pro:  Access time is O(l), independent of k, n and s  Con:  Size -- requires (k+1) * s * p bits  Most pointers are null, so lots of wasted space

Strategies for reducing storage requirements of tries 1.Implement a k-ary trie with m nodes as a 2-D, m by k table A B C D E … M …. P …. T ….  ------1-2-3-- 45----------- 6---7--8----- -----------9- -----------10- 012345012345

Table approach  Number the nodes in the diagram of slide 13 from 1 to m  The table entry corresponding to j th child of i th node is the index of the child node  How does that save space? Just as many nodes and elements as on slide 13  … need only ceil(lg(m)) bits to represent, smaller than a pointer …

Patricia Tree: Another strategy for reducing space in a trie  Patricia tree  Practical Algorithm to Retrieve Information Coded in Alphanumeric  Eliminate nodes with only one nonempty child  Can now skip right from T to  in TURING in our example  Skip from MA …. To E or  in the MENDEL, MENDELEEV chain  But need to store with each node the index of the character on which it discriminates  And need to store the key itself at the leaf

Patricia tree

de la Briandais trees  Another strategy to save space vs. standard tries  Use a linked list instead of a table at the node level  Each pointer labeled with the character it indexes  longer search time than tries; depends on size of character set  saves significant amounts of memory

de la Briandais

Another strategy …  Use tries at the first few levels  Use ordinary BSTs or de la Briandais at the lower levels  reasoning:  speed advantage at the top, but not too much extra memory required  save space at lower levels

Digital Search Trees  Treat keys as bit strings  (strings over the alphabet {0,1})  Binary tree – search directed left on 0, right on 1  Each node contains not only two pointers, but also contains a key that matches that string prefix  Compare for equality before searching left or right  If frequencies are known, store higher frequency keys nearer root  Can be grown dynamically  Expected Search time: O(log n)

Digital Search Tree

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

Similar presentations

Presentation on theme: "Sets of Digital Data CSCI 2720 Fall 2005 Kraemer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

Similar presentations

Presentation on theme: "Sets of Digital Data CSCI 2720 Fall 2005 Kraemer."— Presentation transcript:

Similar presentations

About project

Feedback