Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing and Data Exchange Formats Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 17, 2015.

Similar presentations


Presentation on theme: "Indexing and Data Exchange Formats Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 17, 2015."— Presentation transcript:

1 Indexing and Data Exchange Formats Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 17, 2015

2 Reminders  Homework 1 Milestone 1 due tonight @ 11:59:59 2

3 3 Content Indexing: Unordered and Ordered Lists  Assume that we have entries such as:  What does ordering buy us?  Assume that we adopt a model in which we use:  Do we get any additional benefits?  How about: where we fix the size of the keyword and the number of items?

4 4 A Common Disk-Based Lookup Scheme: Tree-Based Indices Trees have several benefits over lists:  Potentially, logarithmic search time, as with a well- designed sorted list, IF it’s balanced  Ability to handle variable-length records We’ve already seen how trees might make a natural way of distributing data, as well How does a binary search tree fare?  Cost of building?  Cost of finding an item in it?

5 B+ Tree: A Flexible, Height-Balanced, High-Fanout Tree  Insert/delete at log F N cost  (F = fanout, N = # leaf pages)  Keep tree height-balanced  Minimum 50% occupancy (except for root)  Each node contains d <= m <= 2d entries d is called the order of the tree  Can search efficiently based on equality (or also range, though we don’t need that here) Index Entries Data Entries ("Sequence set") (Direct search)

6 Example B+ Tree  Data (inverted list ptrs) is at leaves; intermediate nodes have copies of search keys  Search begins at root, and key comparisons direct it to a leaf  Search for be ↓, bobcat ↓...  Based on the search for bobcat*, we know it is not in the tree! Root best but dog a↓a↓ am ↓ an↓ ant↓art↓be↓ best↓bit↓ bob↓but↓can↓ cry↓dog↓dry↓ elf↓ fox↓ art

7 Inserting Data into a B+ Tree  Find correct leaf L  Put data entry onto L  If L has enough space, done!  Else, must split L (into L and a new node L2)  Redistribute entries evenly, copy up middle key  Insert index entry pointing to L2 into parent of L  This can happen recursively  To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.)  Splits “grow” tree; root split increases height  Tree growth: gets wider or one level taller at top

8 8 Inserting “and ↓ ” Example: Copy up Want to insert here; no room, so split & copy up: a↓ am ↓ an↓ ant↓ and↓ an Entry to be inserted in parent node. (Note that key “an” is copied up and continues to appear in the leaf.) and↓ Root best but dog a↓a↓ am ↓ an↓ ant↓art↓be↓ best↓bit↓ bob↓but↓can↓ cry↓dog↓dry↓ elf↓ fox↓ art

9 9 Inserting “and ↓ ” Example: Push up 1/2 Root art↓be↓ best↓bit↓ bob↓but↓can↓ cry↓ an Need to split node & push up best but dog art a↓ am ↓ dog↓dry↓ elf↓ fox↓ an↓ ant↓ and↓

10 10 Inserting “and ↓ ” Example: Push up 2/2 Root art↓be↓ best↓bit↓ bob↓but↓can↓ cry↓ an butdog best art Entry to be inserted in parent node. (Note that best is pushed up and only appears once in the index. Contrast this with a leaf split.) a↓ am ↓ dog↓dry↓ elf↓ fox↓ an↓ ant↓ and↓

11 11 Copying vs. Splitting, Summarized  Every keyword (search key) appears in at most one intermediate node  Hence, in splitting an intermediate node, we push up  Every inverted list entry must appear in the leaf  We may also need it in an intermediate node to define a partition point in the tree  We must copy up the key of this entry  Note that B+ trees easily accommodate multiple occurrences of a keyword

12 Virtues of the B+ Tree  B+ tree and other indices are quite efficient:  Height-balanced; log F N cost to search  High fanout (F) means depth rarely more than 3 or 4  Almost always better than maintaining a sorted file  Typically, 67% occupancy on average  A very similar structure: ISAM trees (big difference on updates)  Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester:  Interface: open B+ Tree; get and put items based on key  Handles concurrency, caching, etc.

13 13 How Do We Distribute a B+ Tree? A Simple Method  We need to host the root at one machine and distribute the rest  What are the implications for scalability?  Consider building the index as well as searching

14 14 Eliminating the Root  Sometimes we don’t want a tree-structured system because the higher levels can be a central point of congestion or failure  Two strategies:  Modified tree structure (e.g., BATON, Jagadish et al.)  Non-hierarchical structure (distributed hash table, discussed in a couple of weeks)

15 Kinds of Content  Keyword search and inverted indices are great for locating text documents  … But what if we want to index and/or share other kinds of content?  Spreadsheets  Maps  Purchase records  Objects  etc.  Let’s talk about structured data representation and transport, then later indexing and retrieval… 15

16 16 Sending Data  How do we send data within a program?  What is the implicit model?  How does this change when we need to make the data persistent?  What happens when we are coupling systems?  How do we send data between programs on the same machine?  Between different machines?

17 17 Marshalling  Converting from an in-memory data structure to something that can be sent elsewhere  Pointers -> something else  Specific byte orderings  Metadata  Note that the same logical data gets a different physical encoding  A specific case of Codd’s idea of logical-physical separation  “Data model” vs. “data”

18 18 Communication and Streams  When storing data to disk, we have a combination of sequential and random access  When sending data on “the wire”, data is only sequential  “Stream-based communication” based on packets  What are the implications here?  Pipelining, incremental evaluation, …

19 19 Why Data Interchange Is Hard Need to be able to understand:  Data encoding (physical data model)  May have syntactic heterogeneity  Endian-ness, marshalling issues  Impedance mismatches  Data representation (logical data model)  May have semantic heterogeneity  Imprecise and ambiguous values/descriptions

20 20 Examples MP3 ID3 format – record at end of file offsetlengthdescription 03"TAG" identifier string. 330Song title string. 3330Artist string. 6330Album string. 934Year string. 9728Comment string. 1251Zero byte separator. 1261Track byte. 1271Genre byte.

21 21 Examples JPEG “JFIF” header:  Start of Image (SOI) marker -- two bytes (FFD8)  JFIF marker (FFE0)  length -- two bytes  identifier -- five bytes: 4A, 46, 49, 46, 00 (the ASCII code equivalent of a zero terminated "JFIF" string)  version -- two bytes: often 01, 02  the most significant byte is used for major revisions  the least significant byte for minor revisions  units -- one byte: Units for the X and Y densities  0 => no units, X and Y specify the pixel aspect ratio  1 => X and Y are dots per inch  2 => X and Y are dots per cm  X density -- two bytes  Y density -- two bytes  X thumbnail -- one byte: 0 = no thumbnail  Y thumbnail -- one byte: 0 = no thumbnail  (RGB)n -- 3n bytes: packed (24-bit) RGB values for the thumbnail pixels, n = X thumbnail * Y thumbnail

22 22 Finding File Formats  http://www.wikipedia.org/ http://www.wikipedia.org/  http://www.wotsit.org/ http://www.wotsit.org/  etc.

23 23 The Problem  You need to look into a manual to find file formats  (At best, e.g., MS.DOC file format)  The Web is about making data exchange easier… Maybe we can do better!  “The mother of all file formats”

24 24 Desiderata for Data Interchange  Ability to represent many kinds of information Different data structures  Hardware-independent encoding Endian-ness, UTF vs. ASCII vs. EBCDIC  Standard tools and interfaces  Ability to define “shape” of expected data With forwards- and backwards-compatibility!  That’s XML…

25 25 Consumers of XML A myriad of tools and interfaces, including:  DOM – document object model  Standard OO representation of an XML tree  SAX – simple API for XML  An event-driven parser interface for XML  startElement, endElement, etc.  Ant – Java-based “make” tool with XML “makefile”  XPath, XQuery, XSL, XSLT  Web service standards  Anything AJAX (“mash-ups”)

26 26 XML as a Data Model XML “information set” includes 7 types of nodes:  Document (root)  Element  Attribute  Processing instruction  Text (content)  Namespace:  Comment XML data model includes this, plus typing info, plus order info and a few other things

27 27 Example XML Document Kurt P. Brown PRPL: A Database Workload Specification Language 1992 Univ. of Wisconsin-Madison Paul R. McJones The 1995 SQL Reunion Digital System Research Center Report SRC1997-018 1997 db/labs/dec/SRC1997-018.html http://www.mcjones.org/System_R/SQL_Reunion_95/ Processing Instr. Element Attribute Close-tag Open-tag

28 28 XML Data Model Visualized (~ Document Object Model) Root ?xml dblp mastersthesis article mdate key authortitleyearschool editortitleyearjournalvolumeee mdate key 2002… ms/Brown92 Kurt P…. PRPL… 1992 Univ…. 2002… tr/dec/… Paul R. The… Digital… SRC… 1997 db/labs/dec http://www. attribute root p-i element text

29 29 A Few Common Uses of XML Serves as an extensible HTML  Allows custom tags (e.g., used by MS Word, openoffice)  Supplement it with stylesheets (XSL) to define formatting Provides an exchange format for data (still need to agree on terminology)  Tables, objects, etc. Format for marshalling and unmarshalling data in Web Services

30 30 XML as a Super-HTML (MS Word) CIS 550: Database and Information Systems Fall 2003 311 Towne, Tuesday/Thursday 1:30PM – 3:00PM

31 31 XML Easily Encodes Relations idcoursegrade 1330-f03B 23455-s04A 1 330-f03 B 23 455-s04 A Student-course-grade

32 32 It Also Encodes Objects (with Pointers Represented as IDs) Programming Joan Jill www…. …

33 33 XML and Code  Web Services (.NET, Java web service toolkits) are using XML to pass parameters and make function calls – marshalling as part of remote procedure calls  SOAP + WSDL  Why?  Easy to be forwards-compatible  Easy to read over and validate (?)  Generally firewall-compatible  Drawbacks? XML is a verbose and inefficient encoding!  But if the calls are only sending a few 100s of bytes, who cares?

34 34 XML When Tags Are Used by Different Sources  Namespaces allow us to specify a context for different tags  Two parts:  Binding of namespace to URI  Qualified names http://www.fictitious.com/mypath is in default namespace this a different tag

35 35 XML Isn’t Enough on Its Own  It’s too unconstrained for many cases!  How will we know when we’re getting garbage?  How will we query?  How will we understand what we got?


Download ppt "Indexing and Data Exchange Formats Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 17, 2015."

Similar presentations


Ads by Google