Unit 06 : Index and Distributed Caching

Slides:



Advertisements
Similar presentations
PHP II Interacting with Database Data. The whole idea of a database-driven website is to enable the content of the site to reside in a database, and to.
Advertisements

Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Memory.
Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
PHP (2) – Functions, Arrays, Databases, and sessions.
15 Chapter 15 Web Database Development Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
Memory Management (II)
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
1 File Management Chapter File Management File management system consists of system utility programs that run as privileged applications Input to.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
A Dynamic Caching Mechanism for Hadoop using Memcached Gurmeet Singh Puneet Chandra Rashid Tahir University of Illinois at Urbana Champaign Presenter:
Wide-area cooperative storage with CFS
File Management Chapter 12.
Memcached magic Ligaya Turmelle. What is memcached briefly? memcached is a high-performance, distributed memory object caching system, generic in nature.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Session 5: Working with MySQL iNET Academy Open Source Web Development.
Memcached HTTPd mcd DB HTTPd mcd [memcached 설치 및 구동 ] -- memory object caching system *) 설정 없음 1. 설치 - memcached or /usr/ports/databases/memcached.
ASP.NET Programming with C# and SQL Server First Edition
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Copyright ®xSpring Pte Ltd, All rights reserved Versions DateVersionDescriptionAuthor May First version. Modified from Enterprise edition.NBL.
Chapter 17 Domain Name System
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
ADO.NET A2 Teacher Up skilling LECTURE 3. What’s to come today? ADO.NET What is ADO.NET? ADO.NET Objects SqlConnection SqlCommand SqlDataReader DataSet.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
In the name of Allah The Proxy Pattern Elham moazzen.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Overview: 1. Discussion of the basic architecture of a web application. 2. Discussion of the relevance of using MySQL and PHP in a web application.
1 File Management Chapter File Management n File management system consists of system utility programs that run as privileged applications n Concerned.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Computer Systems Week 14: Memory Management Amanda Oddie.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
XML and Database.
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Session 1 Module 1: Introduction to Data Integrity
Visual Basic for Application - Microsoft Access 2003 Finishing the application.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
20 Copyright © 2008, Oracle. All rights reserved. Cache Management.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Information Retrieval in Practice
Module 11: File Structure
Chapter 12 File Management
Hashing Exercises.
Software Architecture in Practice
Database Performance Tuning and Query Optimization
Main Memory Management
Introduction of Week 3 Assignment Discussion
Chapter 11 Database Performance Tuning and Query Optimization
Database Implementation Issues
CSE 542: Operating Systems
Presentation transcript:

Unit 06 : Index and Distributed Caching COMP 5323 Web Database Technologies and Applications 2014

Doctrine of Fair Use This PowerPoint is prepared for educational purpose and is strictly used in the classroom lecturing. We have adopted the "Fair Use" doctrine in this PowerPoint which allows limited copying of copyrighted works for educational and research purposes.

Learning Objectives Understand different path index techniques to improve the performance. Learn a distributed memory caching system which improves the performance of web database applications

Outline Index for Semi-structured data Distributed Caching

1 Index for Semi-structured Data

Why is Indexing Needed? Allows fast access to data by replicating portions of the data in special purpose structures. Despite the additional cost (storage, maintenance and complexity) they have shown to be useful in evaluating queries.

Index Types Structural index Content index Accessing all elements of given name Ancestor-descendant and parent-child relationship between elements Content index Accessing elements containing given keywords Supporting most text search functionalities

Classical Content Index Classically based on inverted lists For each term, gives the doc.ID + localization Several variations allows different search types Offset, Relative, Proximity Generally stored in a B+-Tree to optimize search for a given word Size is an important issue Memory and Disk Words Localization - t1 : doc1-100, doc1-300, doc3-200, … - t2 : doc2-30, doc4-70, … - t3 : doc4-87, doc5-754, … (word, localization) Fixed entry (word repeated) (word, Frequency, (localization)*) Variable length entry Short Reference: http://www.igi-global.com/dictionary/inverted-index/15654

Problem with XML Query processing Support of updates Support of element addressing Doc.ID should include NodeId (Xpath) + Offset Index size becomes very large XPath are long Support of typed data Integer, float, simple types of XML schema Requires classical indexes for certain elements Query processing Structural joins Text search Exact search Support of updates Incremental updates would be a plus

Path-based approach Represent XML document into tree or graph structure Index XML document directly Without the support of DTD Mainly use the memory as the index storage Properties Keep the structural information to improve query performance Easy to support query with regular path expression

Different Approaches Patricia Trie Cooper et al. 2001 DataGuides J. McHugh et al., 1997 T-Index Tova Milo and Dan Sucin APEX (Adaptive Path Index for XML Data) C. W. Chung et al., 2002 Dewey Structure K-ary Table Path Table OrdPath

Partricia Tries A compact representation of a trie in which any node that is an only child is merged with its parent. Also known as radix tree

Partricia Tries Cooper et al. 2001 Idea: Partitioned Partricia Tries to index strings Encode XPath expressions as strings (encode names, encode atomic values) <book> <author>Whoever</author> <author>Not me</author> <title>No Kidding</title> </book> B A 1 Whoever B A 2 Not me B T No Kidding

DataGuides World-Wide Web demonstrates that much of the information available online is semistructured. Graph-based data model called OEM, for Object Exchange Model

A sample OEM database

Lore Language (Lorel Query) Select Restaurant.Entree returns all entrees served by any restaurant, the set of objects {6, 10, 11} Select Restaurant.Name where Restaurant.Entree = “Burger” The answer to the query is the single object 5.

DataGuide

T-index

1-Index 1-Index DataGuides

2-Index

APEX

Representation of XML Data Structure

DataGuide

APEX HAPEX GAPEX

Dewey - Structure Each node is assigned a label that represents the path from the document’s root to the node. Each component of the label represents the local order of an ancestor node. Nodes with the same number of delimiters (“.”) in their label are in the same level. Bib book paper author Tim Sarah (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0)

Dewey – Supported Queries (1/3) Ancestors / Descendants Node “X” is an ancestor of node “Y” if the label of node “X” is a substring of the label of node “Y”. Bib book paper author Tim Sarah (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0)

Dewey – Supported Queries (2/3) Parent / Child Node “X” is parent of node “Y” if: - The label of node “X” is a substring of the label of node “Y” - And frags(X) = frags(Y) – 1, where frags(X) is the number of delimiters of the label of node X and frags(Y) is the number of delimiters of label of node Y. Bib book paper author Tim Sarah (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0)

Dewey – Supported Queries (3/3) Siblings Nodes “X” and “Y” are siblings if: - They have the same number of delimiters in their labels - And X.prefix = Y.prefix, where prefix is the label of the node without its positional identifier Bib book paper author Tim Sarah (0) (0.0) (0.0.0) (0.0.0.0) (0.1) (0.2) (0.2.0) (0.2.0.0)

Dewey – Updates Insertion of new node The label of the nodes in the subtree rooted at the following sibling need to be updated O(n) nodes need relabeling, where n is the number of nodes of the XML file (0) Bib (0.3) (0.0) (0.2) book paper (0.3.0) (0.2.0) (0.0.0) paper paper (0.2) author (0.1) author (0.0.0.0) (0.3.0.0) (0.2.0.0) Tim Sarah

Dewey Not efficient for dynamic XML files with many updates Need to re-label many nodes As the depth of the tree increases: Label size of a node increases rapidly Storage size increases rapidly It becomes more costly to infer the supported queries between any two nodes (the string prefix matching becomes longer) Overflow problem The original fixed length of bits assigned to store the size of the label is not enough.

Document Tree Lee et al, ACM DL 1996. Represent each document as a k-ary complete tree and assign a UID to each node a 3-ary tree c d e f g h i j real node virtual node e

K-ary table Each document is assigned k, which is the maximum number of siblings in the document tree. Each element has an entry (row) in the K-ary table When a query is issued, the result set has pointers to the K-ary table.

Level and Element Type Number Level means the level in the document tree It gives a clue how many parent function is applied to get to a target element Element type number A unique number is assigned to each element type in DTD It enables to filter out unnecessary elements and accumulate the correct frequencies Element location The unique position of an element instance in a document tree

Element Labeling Document (1) Chapter (4) Title (2) Abstract (3) Section (6) Para (5) Para (7)

Result of assigning UIDs Unique element identifier parent(i) = [(i-2)/k+1] d 3-ary tree c s p e Result of assigning UIDs

XML Index Path Table (Oracle) Some Typos XML Index Path Table (Oracle) <po> <data> <item>foo</item> <pkg>123</pkg> <item>bar</item> </data> </po> BaseRid Path OrderKey Value Locator NumValue Rid1 po po.data 1 11 po.data.item 1.1 “foo” 17 po.data.pkg 1.2 “123” 32 123 1.3 “bar” 46

OrdPath ORDPATHs: Insert-Friendly XML Node Labels Patrick O’Neil, Elizabeth O’Neil1, Shankar Pal, Istvan Cseri, Gideon Schaller, Nigel Westbury SIGMOD 2004 SQL Server 2005 implementation

OrdPath Aims to provide efficient insertion at any position of an XML tree, and also supports extremely high performance query plans for native XML queries. Tree modifications new may be inserted sub-trees be deleted sub-trees may be moved around within the tree

OrdPath Encodes the parent-child relationship by extending the parent’s ORDPATH label with a component for the child. E.g.: 1.5.3.9 might be the parent ORDPATH, 1.5.3.9.1 the child. The various child components reflect the children’s relative sibling order, so that byte-by-byte comparison of the ORDPATH labels of two nodes yields the proper document order. A new node (possibly a root node of a sub-tree) can be inserted under any designated parent node in an existing tree. Its label is generated using an additional intermediate “careting” component that falls between the components of its left and right siblings.

OrdPath At the beginning Inserting in the middle Only positive, odd integers are assigned during an initial load; even-numbered and negative integer component values are reserved for later insertions into an existing tree Inserting in the middle Even numbers are used as carets only. Do not count as components that increase the depth of the nodes. E.g. new nodes in between 3.5.5 and 3.5.7 New siblings: 3.5.6.1, 3.5.6.2, … A subtree: 3.5.6.1, 3.5.6.1.1, 3.5.6.3, 3.5.6.3.1, 3.5.6.3.3, 3.5.6.3.3.1, 3.5.6.3.3.3, 3.5.6.3.5, 3.5.6.5, 3.5.6.5.1

XML

ORDPATH Label of Nodes BOOK 1 Section 1.5 @ISBN 1.1 Section 1.3 tree frogs 1.5.7 Title 1.5.1 Figure 1.5.5 Nobody… 1.3.3 Title 1.3.1 Figure 1.3.5 All right… 1.5.3 CAPTION 1.3.5.1

Infoset Table

2 Memcached

What is memcached briefly? memcached is a high-performance, distributed memory object caching system, generic in nature It is a key-based cache daemon that stores data and objects wherever dedicated or spare RAM is available for very quick access It is a dumb distributed hash table. It does not provide redundancy, failover or authentication. If needed the client has to handle that.

Why was memcached made? It was originally developed by Danga Interactive to enhance the speed of LiveJournal.com It dropped the database load to almost nothing, yielding faster page load times for users, better resource utilization, and faster access to the databases on a memcache miss http://www.danga.com/memcached/

Memcached

Where does memcached reside? Memcache is not part of the database but sits outside it on the server(s). Over a pool of servers

Architecture

When should I use memcached? When your database is optimized to the hilt and you still need more out of it. Lots of SELECTs are using resources that could be better used elsewhere in the DB. Locking issues keep coming up When table listings in the query cache are torn down so often it becomes useless To get maximum “scale out” of minimum hardware

Hit-rate Management anything what is more expensive to fetch from elsewhere, and has sufficient hitrate, can be placed in memcached How often will object or data be used? How expensive is it to generate the data? What is the expected hitrate? Will the application invalidate the data itself, or will TTL be used? How much development work has to be done to embed it? TTL - time to live

Why use memcached? To reduce the load on the database by caching data BEFORE it hits the database Can be used for more then just holding database results (objects) and improve the entire application response time Feel the need for speed Memcache is in RAM - much faster then hitting the disk or the database

Why not use memcached? Memcache is held in RAM. This is a finite resource. Adding complexity to a system just for complexities sake is a waste. If the system can respond within the requirements without it - leave it alone

What are the limits of memcached? Keys can be no more then 250 characters Stored data can not exceed 1M (largest typical slab size) There are generally no limits to the number of nodes running memcache There are generally no limits the the amount of RAM used by memcache over all nodes 32 bit machines do have a limit of 4GB though

Platform You can build and install memcached from the source code directly, or you can use an existing operating system package or installation. on a RedHat, Fedora or CentOS host, use yum: root-shell> yum install memcached on a Debian or Ubuntu host, use apt-get: root-shell> apt-get install memcached on a Gentoo host, use emerge: root-shell> emerge install memcached on OpenSolaris, use the pkg for SUNWmemcached: root-shell> pkg install SUNWmemcached

Port 11211 Get the source from the website: http://www.danga.com/memcached/download.bml Memcache has a dependancy on libevent so make sure you have that also. Decompress, cd into the dir ./configure;make;make install; Memcached listens on port 11211 by default, this can be changed with –p option. http://www.ajohnstone.com/archives/installing-memcached/

How do I start memcached? Memcached can be run as a non-root user if it will not be on a restricted port (<1024) - though the user can not have a memory limit restriction shell> memcached Default configuration - Memory: 64MB, all network interfaces, port:11211, max simultaneous connections: 1024

Memcached options You can change the default configuration with various options. -u <user> : run as user if started as root -m <num> : maximum <num> MB memory to use for items If more then available RAM - will use swap Don’t forget 4G limit on 32 bit machines -d : Run as a daemon -l <ip_addr> : Listen on <ip_addr>; default to INDRR_ANY -p <num> : port

How can I connect to memcached? Memcached uses a protocol that many languages implement with an API. Languages that implement it: Perl, PHP, Python, Ruby, Java, C#, C, Lua, Postgres, MySQL, Chicken Scheme And yes - because it is a protocol you can even use telnet shell> telnet localhost 11211 Protocol at http://code.sixapart.com/svn/memcached/trunk/server/doc/protocol.txt

Memcached protocol Three types of commands Storage - ask the server to store some data identified by a key set, add, replace, append, prepend and cas Retrieval - ask the server to retrieve data corresponding to a set of keys get, gets

Memcached protocol (con’t) All others that don’t involve unstructured data Deletion:delete Statistics: stats, flush_all: always succeeds, invalidate all existing items immediately (by default) or after the expiration specified. version, verbosity, quit

PHP and Memcached Make sure you have a working Apache/PHP install PHP has a memcached extension available through pecl. Installation: shell> pecl install memcache Make sure the pear is installed (debian: apt-get install php-pear) Make sure that you also have php5-dev installed for phpize. shell> apt-get install php5-dev

PHP Script Example Information about the PHP API at http://www.php.net/memcache <?php // make a memcache object $memcache = new Memcache; // connect to memcache $memcache->connect('localhost', 11211) or die ("Could not connect"); //get the memcache version $version = $memcache->getVersion(); echo "Server's version: ".$version."<br/>\n";

PHP Script (con’t) // test data $tmp_object = new stdClass; $tmp_object->str_attr = 'test'; $tmp_object->int_attr = 123; // set the test data in memcache $memcache->set('key', $tmp_object, false, 10) or die ("Failed to save data at the server"); echo "Store data in the cache (data will expire in 10 seconds)<br/>\n"; // get the data $get_result = $memcache->get('key'); echo "Data from the cache:<br/>\n"; echo ‘<pre>’, var_dump($get_result), ‘</pre>’; MEMCACHE_COMPRESSED

PHP Script (con’t) // modify the data $tmp_object->str_attr = ‘boo’; $memcache->replace(‘key’, $tmp_object, false, 10) or die(“Failed to save new data to the server<br/>\n”); Echo “Stored data in the cache changed<br/>\n”; // get the new data $get_result = $memcache->get(‘key’); Echo “New data from the cache:<br/>\n”; Echo ‘<pre>’, var_dump($get_result), “</pre>\n”; // delete the data $memcache->delete(‘key’) or die(“Data not deleted<br/>\n”);

MySQL’s memcached The API is consistent with the other API’s Connect: mysql> SELECT memc_servers_set('192.168.0.1:11211, 192.168.0.2:11211'); The list of servers used by the memcached UDFs is not persistent over restarts of the MySQL server. Set: mysql> SELECT memc_set('myid', 'myvalue'); Retreive: mysql> SELECT memc_get('myid'); Full listing of functions are at http://dev.mysql.com/doc/refman/5.0/en/ha-memcached-interfaces-mysqludf.html

Possible ways to secure memcached It has no authentication system - so protection is important Run as a non-priveledged user to minimize potential damage Specify the ip address to listen on using -l 127.0.0.1, 192.168.0.1, specific ip address Use a non-standard port Use a firewall

Memcached

Memcached Memory Management Slab Allocation: When you start to store data into the cache, memcached does not allocate the memory for the data on an item by item basis. Instead, a slab allocation is used to optimize memory usage and prevent memory fragmentation when information expires from the cache. Lazy Expiration + LRU Lazy Expiration: When an item is requested (a get request) Memcached checks the expiration time to see if the item is still valid before returning it to the client. LRU (least recently used): Memcached is LRU per slab class, but not globally LRU.

Slab Allocation

Memcached Distributed Architecture //Obtain a server ID based on Key value int getServerId(char *key, int serverTotal) { int c, hash = 0; while (c = *key++) { hash += c; } return hash % serverTotal; //a list of servers node[0] => 192.168.0.1:11211 node[1] => 192.168.0.2:11211 node[2] => 192.168.0.3:11211 //get id int id = getServerId("test", 3); //get ip address and port number node[id] == node[1]

Memcached Distributed Architecture

SQL Server Cache SQL Server cache mechanism : - Query plans - pages from the database files but it does NOT cache: - exact results from a query REFERENCE http://searchsqlserver.techtarget.com/tip/SQL-Server-memory-configurations-for-procedure-cache-and-buffer-cache

Reference DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases Index Structure for Path Expressions APEX: An Adaptive Path Index for XML Data

Reference Only https://blog.couchbase.com/memcached-144-windows-32-bit-binary-now-available Memcached should run on Linux. It may not work on some windows

Reference

Reference 0: <flag> 60: timeout 6: length