Unit 06 : Index and Distributed Caching

Name: Unit 06 : Index and Distributed Caching
Uploaded: 2017-07-17T10:58:05+00:00
Duration: PTM31S16
Channel: Kristopher Baker
Description: Unit 06 : Index and Distributed Caching

Unit 06 : Index and Distributed Caching
COMP 5323 Web Database Technologies and Applications 2014

Doctrine of Fair Use This PowerPoint is prepared for educational purpose and is strictly used in the classroom lecturing. We have adopted the "Fair Use" doctrine in this PowerPoint which allows limited copying of copyrighted works for educational and research purposes.

Learning Objectives Understand different path index techniques to improve the performance. Learn a distributed memory caching system which improves the performance of web database applications

Outline Index for Semi-structured data Distributed Caching

1 Index for Semi-structured Data

Why is Indexing Needed? Allows fast access to data by replicating portions of the data in special purpose structures. Despite the additional cost (storage, maintenance and complexity) they have shown to be useful in evaluating queries.

Index Types Structural index Content index
Accessing all elements of given name Ancestor-descendant and parent-child relationship between elements Content index Accessing elements containing given keywords Supporting most text search functionalities

Classical Content Index
Classically based on inverted lists For each term, gives the doc.ID + localization Several variations allows different search types Offset, Relative, Proximity Generally stored in a B+-Tree to optimize search for a given word Size is an important issue Memory and Disk Words Localization - t1 : doc1-100, doc1-300, doc3-200, … - t2 : doc2-30, doc4-70, … - t3 : doc4-87, doc5-754, … (word, localization) Fixed entry (word repeated) (word, Frequency, (localization)*) Variable length entry Short Reference:

Problem with XML Query processing Support of updates
Support of element addressing Doc.ID should include NodeId (Xpath) + Offset Index size becomes very large XPath are long Support of typed data Integer, float, simple types of XML schema Requires classical indexes for certain elements Query processing Structural joins Text search Exact search Support of updates Incremental updates would be a plus

Path-based approach Represent XML document into tree or graph structure Index XML document directly Without the support of DTD Mainly use the memory as the index storage Properties Keep the structural information to improve query performance Easy to support query with regular path expression

Different Approaches Patricia Trie Cooper et al. 2001 DataGuides
J. McHugh et al., 1997 T-Index Tova Milo and Dan Sucin APEX (Adaptive Path Index for XML Data) C. W. Chung et al., 2002 Dewey Structure K-ary Table Path Table OrdPath

Partricia Tries A compact representation of a trie in which any node that is an only child is merged with its parent. Also known as radix tree

Partricia Tries Cooper et al. 2001 Idea:
Partitioned Partricia Tries to index strings Encode XPath expressions as strings (encode names, encode atomic values) <book> <author>Whoever</author> <author>Not me</author> <title>No Kidding</title> </book> B A 1 Whoever B A 2 Not me B T No Kidding

DataGuides World-Wide Web demonstrates that much of the information available online is semistructured. Graph-based data model called OEM, for Object Exchange Model

A sample OEM database

Lore Language (Lorel Query)
Select Restaurant.Entree returns all entrees served by any restaurant, the set of objects {6, 10, 11} Select Restaurant.Name where Restaurant.Entree = “Burger” The answer to the query is the single object 5.

DataGuide

T-index

1-Index 1-Index DataGuides

2-Index

Representation of XML Data Structure

DataGuide

APEX HAPEX GAPEX

Dewey - Structure Each node is assigned a label that represents the path from the document’s root to the node. Each component of the label represents the local order of an ancestor node. Nodes with the same number of delimiters (“.”) in their label are in the same level. Bib book paper author Tim Sarah (0) (0.0) (0.0.0) ( ) (0.1) (0.2) (0.2.0) ( )

Dewey – Supported Queries (1/3)
Ancestors / Descendants Node “X” is an ancestor of node “Y” if the label of node “X” is a substring of the label of node “Y”. Bib book paper author Tim Sarah (0) (0.0) (0.0.0) ( ) (0.1) (0.2) (0.2.0) ( )

Parent / Child Node “X” is parent of node “Y” if: - The label of node “X” is a substring of the label of node “Y” - And frags(X) = frags(Y) – 1, where frags(X) is the number of delimiters of the label of node X and frags(Y) is the number of delimiters of label of node Y. Bib book paper author Tim Sarah (0) (0.0) (0.0.0) ( ) (0.1) (0.2) (0.2.0) ( )

Siblings Nodes “X” and “Y” are siblings if: - They have the same number of delimiters in their labels - And X.prefix = Y.prefix, where prefix is the label of the node without its positional identifier Bib book paper author Tim Sarah (0) (0.0) (0.0.0) ( ) (0.1) (0.2) (0.2.0) ( )

Dewey – Updates Insertion of new node
The label of the nodes in the subtree rooted at the following sibling need to be updated O(n) nodes need relabeling, where n is the number of nodes of the XML file (0) Bib (0.3) (0.0) (0.2) book paper (0.3.0) (0.2.0) (0.0.0) paper paper (0.2) author (0.1) author ( ) ( ) ( ) Tim Sarah

Dewey Not efficient for dynamic XML files with many updates
Need to re-label many nodes As the depth of the tree increases: Label size of a node increases rapidly Storage size increases rapidly It becomes more costly to infer the supported queries between any two nodes (the string prefix matching becomes longer) Overflow problem The original fixed length of bits assigned to store the size of the label is not enough.

Document Tree Lee et al, ACM DL 1996.
Represent each document as a k-ary complete tree and assign a UID to each node a 3-ary tree c d e f g h i j real node virtual node e

K-ary table Each document is assigned k, which is the maximum number of siblings in the document tree. Each element has an entry (row) in the K-ary table When a query is issued, the result set has pointers to the K-ary table.

Level and Element Type Number
Level means the level in the document tree It gives a clue how many parent function is applied to get to a target element Element type number A unique number is assigned to each element type in DTD It enables to filter out unnecessary elements and accumulate the correct frequencies Element location The unique position of an element instance in a document tree

Element Labeling Document (1) Chapter (4) Title (2) Abstract (3)
Section (6) Para (5) Para (7)

Result of assigning UIDs
Unique element identifier parent(i) = [(i-2)/k+1] d 3-ary tree c s p e Result of assigning UIDs

XML Index Path Table (Oracle)
Some Typos XML Index Path Table (Oracle) <po> <data> <item>foo</item> <pkg>123</pkg> <item>bar</item> </data> </po> BaseRid Path OrderKey Value Locator NumValue Rid1 po po.data 1 11 po.data.item 1.1 “foo” 17 po.data.pkg 1.2 “123” 32 123 1.3 “bar” 46

OrdPath ORDPATHs: Insert-Friendly XML Node Labels
Patrick O’Neil, Elizabeth O’Neil1, Shankar Pal, Istvan Cseri, Gideon Schaller, Nigel Westbury SIGMOD 2004 SQL Server 2005 implementation

OrdPath Aims to provide efficient insertion at any position of an XML tree, and also supports extremely high performance query plans for native XML queries. Tree modifications new may be inserted sub-trees be deleted sub-trees may be moved around within the tree

OrdPath Encodes the parent-child relationship by extending the parent’s ORDPATH label with a component for the child. E.g.: might be the parent ORDPATH, the child. The various child components reflect the children’s relative sibling order, so that byte-by-byte comparison of the ORDPATH labels of two nodes yields the proper document order. A new node (possibly a root node of a sub-tree) can be inserted under any designated parent node in an existing tree. Its label is generated using an additional intermediate “careting” component that falls between the components of its left and right siblings.

OrdPath At the beginning Inserting in the middle
Only positive, odd integers are assigned during an initial load; even-numbered and negative integer component values are reserved for later insertions into an existing tree Inserting in the middle Even numbers are used as carets only. Do not count as components that increase the depth of the nodes. E.g. new nodes in between and 3.5.7 New siblings: , , … A subtree: , , , , , , , , ,

ORDPATH Label of Nodes BOOK 1 Section 1.5 @ISBN 1.1 Section 1.3
tree frogs 1.5.7 Title 1.5.1 Figure 1.5.5 Nobody… 1.3.3 Title 1.3.1 Figure 1.3.5 All right… 1.5.3 CAPTION

Infoset Table

2 Memcached

What is memcached briefly?
memcached is a high-performance, distributed memory object caching system, generic in nature It is a key-based cache daemon that stores data and objects wherever dedicated or spare RAM is available for very quick access It is a dumb distributed hash table. It does not provide redundancy, failover or authentication. If needed the client has to handle that.

Why was memcached made? It was originally developed by Danga Interactive to enhance the speed of LiveJournal.com It dropped the database load to almost nothing, yielding faster page load times for users, better resource utilization, and faster access to the databases on a memcache miss

Memcached

Where does memcached reside?
Memcache is not part of the database but sits outside it on the server(s). Over a pool of servers

Architecture

When should I use memcached?
When your database is optimized to the hilt and you still need more out of it. Lots of SELECTs are using resources that could be better used elsewhere in the DB. Locking issues keep coming up When table listings in the query cache are torn down so often it becomes useless To get maximum “scale out” of minimum hardware

Hit-rate Management anything what is more expensive to fetch from elsewhere, and has sufficient hitrate, can be placed in memcached How often will object or data be used? How expensive is it to generate the data? What is the expected hitrate? Will the application invalidate the data itself, or will TTL be used? How much development work has to be done to embed it? TTL - time to live

Why use memcached? To reduce the load on the database by caching data BEFORE it hits the database Can be used for more then just holding database results (objects) and improve the entire application response time Feel the need for speed Memcache is in RAM - much faster then hitting the disk or the database

Why not use memcached? Memcache is held in RAM. This is a finite resource. Adding complexity to a system just for complexities sake is a waste. If the system can respond within the requirements without it - leave it alone

What are the limits of memcached?
Keys can be no more then 250 characters Stored data can not exceed 1M (largest typical slab size) There are generally no limits to the number of nodes running memcache There are generally no limits the the amount of RAM used by memcache over all nodes 32 bit machines do have a limit of 4GB though

Platform You can build and install memcached from the source code directly, or you can use an existing operating system package or installation. on a RedHat, Fedora or CentOS host, use yum: root-shell> yum install memcached on a Debian or Ubuntu host, use apt-get: root-shell> apt-get install memcached on a Gentoo host, use emerge: root-shell> emerge install memcached on OpenSolaris, use the pkg for SUNWmemcached: root-shell> pkg install SUNWmemcached

Port 11211 Get the source from the website: Memcache has a dependancy on libevent so make sure you have that also. Decompress, cd into the dir ./configure;make;make install; Memcached listens on port by default, this can be changed with –p option.

How do I start memcached?
Memcached can be run as a non-root user if it will not be on a restricted port (<1024) - though the user can not have a memory limit restriction shell> memcached Default configuration - Memory: 64MB, all network interfaces, port:11211, max simultaneous connections: 1024

Memcached options You can change the default configuration with various options. -u <user> : run as user if started as root -m <num> : maximum <num> MB memory to use for items If more then available RAM - will use swap Don’t forget 4G limit on 32 bit machines -d : Run as a daemon -l <ip_addr> : Listen on <ip_addr>; default to INDRR_ANY -p <num> : port

How can I connect to memcached?
Memcached uses a protocol that many languages implement with an API. Languages that implement it: Perl, PHP, Python, Ruby, Java, C#, C, Lua, Postgres, MySQL, Chicken Scheme And yes - because it is a protocol you can even use telnet shell> telnet localhost 11211 Protocol at

Memcached protocol Three types of commands
Storage - ask the server to store some data identified by a key set, add, replace, append, prepend and cas Retrieval - ask the server to retrieve data corresponding to a set of keys get, gets

Memcached protocol (con’t)
All others that don’t involve unstructured data Deletion:delete Statistics: stats, flush_all: always succeeds, invalidate all existing items immediately (by default) or after the expiration specified. version, verbosity, quit

PHP and Memcached Make sure you have a working Apache/PHP install
PHP has a memcached extension available through pecl. Installation: shell> pecl install memcache Make sure the pear is installed (debian: apt-get install php-pear) Make sure that you also have php5-dev installed for phpize. shell> apt-get install php5-dev

PHP Script Example Information about the PHP API at <?php // make a memcache object $memcache = new Memcache; // connect to memcache $memcache->connect('localhost', 11211) or die ("Could not connect"); //get the memcache version $version = $memcache->getVersion(); echo "Server's version: ".$version." \n";

PHP Script (con’t) // test data $tmp_object = new stdClass;
$tmp_object->str_attr = 'test'; $tmp_object->int_attr = 123; // set the test data in memcache $memcache->set('key', $tmp_object, false, 10) or die ("Failed to save data at the server"); echo "Store data in the cache (data will expire in 10 seconds) \n"; // get the data $get_result = $memcache->get('key'); echo "Data from the cache: \n"; echo ‘<pre>’, var_dump($get_result), ‘</pre>’; MEMCACHE_COMPRESSED

PHP Script (con’t) // modify the data
$tmp_object->str_attr = ‘boo’; $memcache->replace(‘key’, $tmp_object, false, 10) or die(“Failed to save new data to the server \n”); Echo “Stored data in the cache changed \n”; // get the new data $get_result = $memcache->get(‘key’); Echo “New data from the cache: \n”; Echo ‘<pre>’, var_dump($get_result), “</pre>\n”; // delete the data $memcache->delete(‘key’) or die(“Data not deleted \n”);

MySQL’s memcached The API is consistent with the other API’s
Connect: mysql> SELECT memc_servers_set(' :11211, :11211'); The list of servers used by the memcached UDFs is not persistent over restarts of the MySQL server. Set: mysql> SELECT memc_set('myid', 'myvalue'); Retreive: mysql> SELECT memc_get('myid'); Full listing of functions are at

Possible ways to secure memcached
It has no authentication system - so protection is important Run as a non-priveledged user to minimize potential damage Specify the ip address to listen on using -l , , specific ip address Use a non-standard port Use a firewall

Memcached

Memcached Memory Management
Slab Allocation: When you start to store data into the cache, memcached does not allocate the memory for the data on an item by item basis. Instead, a slab allocation is used to optimize memory usage and prevent memory fragmentation when information expires from the cache. Lazy Expiration + LRU Lazy Expiration: When an item is requested (a get request) Memcached checks the expiration time to see if the item is still valid before returning it to the client. LRU (least recently used): Memcached is LRU per slab class, but not globally LRU.

Slab Allocation

Memcached Distributed Architecture
//Obtain a server ID based on Key value int getServerId(char *key, int serverTotal) { int c, hash = 0; while (c = *key++) { hash += c; } return hash % serverTotal; //a list of servers node[0] => :11211 node[1] => :11211 node[2] => :11211 //get id int id = getServerId("test", 3); //get ip address and port number node[id] == node[1]

Memcached Distributed Architecture

SQL Server Cache SQL Server cache mechanism : - Query plans - pages from the database files but it does NOT cache: - exact results from a query REFERENCE

Reference DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases Index Structure for Path Expressions APEX: An Adaptive Path Index for XML Data

Reference Only Memcached should run on Linux. It may not work on some windows

Reference

Reference 0: <flag> 60: timeout 6: length

Unit 06 : Index and Distributed Caching

Similar presentations

Presentation on theme: "Unit 06 : Index and Distributed Caching"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unit 06 : Index and Distributed Caching

Similar presentations

Presentation on theme: "Unit 06 : Index and Distributed Caching"— Presentation transcript:

Similar presentations

About project

Feedback