A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester.

Slides:



Advertisements
Similar presentations
IVOA, Pune India September Data Access Layer Working Group Pune Workshop Summary Doug Tody National Radio Astronomy Observatory International.
Advertisements

Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
A PPARC funded project The Grid Data Warehouse Description of prototype work in progress by AstroGrid. Access-Grid lecture to Universities of Leeds and.
Part IV: Memory Management
Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
5th July 2004CPM A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK)
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
E-Science Data Information and Knowledge Transformation The BinX Language.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
A PPARC funded project Tony Linde Programme Manager eScience meets eFrameworks 28 th April 2006 NeSC, Edinburgh.
Chapter 11: File System Implementation
Front and Back End: Webpage and Database Management Prepared by Nailya Galimzyanova and Brian J Kapala Supervisor: Prof. Adriano Cavalcanti, PhD College.
File System Implementation
File System Implementation: beyond the user’s view A possible file system layout on a disk.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Managing XML and Semistructured Data Lecture 19: Compressing XML Data Prof. Dan Suciu Spring 2001.
CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.
Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.
BinX and Astronomy Bob Mann Institute for Astronomy and National e-Science Centre.
Efficient XML Interchange. XML Why is XML good? A widely accepted standard for data representation Fairly simple format Flexible It’s not used by everyone,
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.
Nov China-VO 架起 VO 与桌面应用的桥梁 崔辰州 China-VO Project 中科院国家天文台 The Chinese V IRTUAL O BSERVATORY.
S. Derriere et al., ESSW03 Budapest, 2003 May 20 UCDs - metadata for astronomy Sébastien Derriere François Ochsenbein Thomas Boch CDS, Observatoire astronomique.
Lecturer: Ghadah Aldehim
Object and component “wiring” standards This presentation reviews the features of software component wiring and the emerging world of XML-based standards.
Astronomical Data Query Language Simple Query Protocol for the Virtual Observatory Naoki Yasuda 1, William O'Mullane 2, Tamas Budavari 2, Vivek Haridas.
MIPS coding. SPIM Some links can be found such as:
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Compression of Inverted Indexes for Fast Query Evaluation Falk Scholer Hugh Williams John Yiannis Justin Zobel (RMIT University, Melbourne, Australia)
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Adaptive Hypermedia Tutorial System Based on AHA Jing Zhai Dublin City University.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Data Structure & File Systems Hun Myoung Park, Ph.D., Public Management and Policy Analysis Program Graduate School of International Relations International.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Lecture by: Prof. Pooja Vaishnav.  Language Processor implementations are highly influenced by the kind of storage structure used for program variables.
Markup basics. Markup languages add commentary to text files –so that the commentary can be ignored if not understood eg HyperText Markup Language –adds.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
A PPARC funded project Tony Linde Programme Manager AstroGrid in the wider world AstroGrid Consortium meeting Cambridge, UK 10-July-2005.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Create Element, Remove Child. The Document Tree Document Element Root Element Element Element Element Element Text: HelloWorld Attribute “href”
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
XML Extensible Markup Language
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Challenges in XML It’s good… but is it good enough? Siddhesh Bhobe Persistent eBusiness Solutions.
COMP9319: Web Data Compression and Search
In this session, you will learn to:
CHP - 9 File Structures.
CS 315 Data Structures B. Ravikumar Office: 116 I Darwin Hall Phone:
What is FITS? FITS = Flexible Image Transport System
COMP 430 Intro. to Database Systems
Huffman Coding CSE 373 Data Structures.
XML Problems and Solutions
Google Sky.
ASN.1 Compiler for text-based protocols!
Presentation transcript:

A In-Memory Compressed XML Representation of Astronomical Data PPARC UK e-Science Postgraduate School ’05 O’Neil Delpratt – PhD Student University of Leicester O’Neil Delpratt – PhD Student University of Leicester Supervisors: Rajeev Raman, Clive Page, Tony Linde & Mike Watson – University of Leicester Supervisors: Rajeev Raman, Clive Page, Tony Linde & Mike Watson – University of Leicester

Topics: –Outline of Project –Role in Project –Current Work –Conclusion Topics: –Outline of Project –Role in Project –Current Work –ConclusionIntroductionIntroduction

Outline of Project Building a data-grid for UK astronomy. Consist of a wide variety of web/grid services, serving up data to be combined and processed, and ultimately displayed to the user. Astronomers Previously used FITS binary.Astronomers Previously used FITS binary. International Virtual Observatory Alliance ( IVOA) Introduced VOTable xml format.International Virtual Observatory Alliance ( IVOA) Introduced VOTable xml format. Building a data-grid for UK astronomy. Consist of a wide variety of web/grid services, serving up data to be combined and processed, and ultimately displayed to the user. Astronomers Previously used FITS binary.Astronomers Previously used FITS binary. International Virtual Observatory Alliance ( IVOA) Introduced VOTable xml format.International Virtual Observatory Alliance ( IVOA) Introduced VOTable xml format. AstroGrid Project

Main Problem -XML-based files are larger than equivalent formats, -Network bandwidth is limited. -Adhoc approach -Slow searching -XML-based files are larger than equivalent formats, -Network bandwidth is limited. -Adhoc approach -Slow searching Concern: Using the VOTable XML-based astronomical data format we know that it: Is Basic (XML representation of a 2d binary table. ). Has Static Data (features for big-data and grid computing) Allows data to be stored separately, with the remote data link Using the VOTable XML-based astronomical data format we know that it: Is Basic (XML representation of a 2d binary table. ). Has Static Data (features for big-data and grid computing) Allows data to be stored separately, with the remote data link VOTable is designed as a flexible storage and exchange format for tabular data, with emphasis on astronomical tables.

Role in Project Compressed file Representation supporting sequential searching Compression tool In-memory XML representation supporting queries Provide software tool for the Astrogrid Group. Remove the adhoc approach of accessing astronomical data in XML documents. Remove the limitation of parsing large XML documents, allowing us to make ‘proper’ flexible XML files rather than representation of 2D tables.

An existing XML-specific compression tools are: XMILL [H. Liefke and D. Suciu, 2000] Files are inaccessible until uncompressed. Therefore we need to develop a software tool that provides both: Compression Powerful operations on the compressed XML documents.

Requirements Produce a software tool for efficient storage and processing of large XML files that arise in the context of IVOA. Remove adhoc way of accessing XML data. Provide efficient searching facilities of VOTable files. Rapid querying operations. Reduce bandwidth. Allow for merging of XML files.

Topics: Succinct DOM Using PrefixSum Data Structures in DOM Experimental Results Topics: Succinct DOM Using PrefixSum Data Structures in DOM Experimental Results Current Work

Succinct DOM DOM is a specification for representing XML documents in main memory. XML file is parsed and represented using our succinct DOM implementation What is Succinctness? A succinct data structure is one that uses space approaching the information theoretic minimum for the required structure yet ideally supports the desired operations in constant time, unconstrained by the size of the structure.

Parenthesis Data Structure: Components of Succinct DOM <Directory> Basic Basic Basic Basic /05/05 10/05/05 Files details Files details XML files XML files </Directory> Tree Representation Directory NameFileNote FNameCreated Note TimeDate Hour Minutes Currently using pointers Parenthesis we have 2 bits per node. ( ( ) ( ( ) ( ( ( ) ( ) ) ( ) ) ( ) ) ( ) ) As a BitVector: [Geary, Rahman, R. Raman, V Raman. 2004]

Components of Succinct DOM PrefixSum Data Structure Motivation: We can store the text node data from XML files as a concatenated string. Store the offsets for each text node data in an integer array. Problem Here: The text nodes often take most of the XML tree, sometimes 50%. Therefore, the offsets as 32-bit integers in memory would not be space efficient. Text node1 …….....““ offset1offset2offset3offset4 offset5offset6

Components of Succinct DOM PrefixSum Data Structure The requirements for the PrefixSum Data structure to represent with an Abstract Data Type:

PrefixSum Data Structure Continued... We consider three integer-coding methods that are used in inverted file compression: Gamma Code – requiresbits. Delta Code –requires bits. Golomb – requires bits. There are existing integer-coding methods that we can use to encode the number 1,…,x compactly.

Components of Succinct DOM Some Of the DOM methods we will implement: childNodes parentNode Attributes nextSibling previousSibling nodeType nodeName nodeValue

Experimental Results

Challenges that remain Improve the results of the succinct DOM and that of the Prefix-sum data structures. Challenges to bring space usage closer to bzip2. Develop a compressed file and in-memory representation. Relate the XML representation data structures to VOTable files for further compression improvements.