DB2 for Linux, Unix, and Windows

DB2 for Linux, Unix, and Windows
pureXML Indexing Overview DB2 9, 9.5, and 9.7 for Linux, Unix, and Windows Christina (Tina) Lee IBM Silicon Valley Laboratory September 2009

Agenda pureXML Basics Regions and Paths Indexes Index on XML column
DB2 9.5 Reject Invalid Values option Common User Errors Queries and XML Indexes Catalog Changes DB2 9.7 XML indexes on Range Partitioned Tables Online Index Create and Online Index Reorg Index Compression

pureXML Basics XML stored in a parsed hierarchical format
No fixed XML schema per XML column required XML Schema validation is optional, per document XML indexes for specific elements/attributes XQuery and SQL/XML Integration create table dept (deptID char(8),…, deptdoc xml); DB2 Storage DB2 9 introduced pureXML support which enables well-formed XML documents to be stored in their hierarchical form within columns of a table. XML columns are defined with the XML data type. By storing XML data in XML columns, the data is kept in its native hierarchical form, rather than stored as text or mapped to a different data model. The XML data is stored as an XML document tree and can be queried in its inherent hierarchical format. Relational Storage page page page pureXML Storage

Storing the XML Document
The DAT object holds the base table rows. The descriptor in the XML column contains a logical reference to the XML data. This logical reference is used to access the system-generated XML Regions Index which resolves the logical reference into the physical addresses of the pages that contain the XML document tree stored in the XDA (XML Data Area) object.

XML Regions and Column Path Indexes
XML Regions Index System generated when first XML column created or added to table Nodes and subtrees in a data page form regions in a document Provides logical mapping of regions to retrieve document data XML Column Path Index System generated for each XML column created or added to table Maps unique paths to path ids for each XML column Used to improve performance during queries

Index on an XML Column An index on an XML column can be used to improve the efficiency of queries on XML documents that are stored in an XML column. In contrast to traditional relational indexes, where index keys are composed of one or more table columns you specify, an index on an XML column uses a particular XML pattern expression to index paths and values in XML documents stored within a single column. The data type of that column must be XML. Because multiple parts of a XML document can satisfy an XML pattern, multiple index keys may be inserted into the index for a single document. A query will access the path index first to find the path ID for the path used in the query predicate. Then the path ID is used to access the index on the XML column and the qualifying keys in the index are used to access the base table rows. Next, the descriptor for the XML column is used to access the XML regions index which contains the physical pointers to the XML document tree in the XDA. The XML document is then traversed to either serialize the document for output or to apply predicates. For the query in this example, we want to find the documents where Person has an Age = ’17’ and then return the Name of that Person. The index on the XML column provides a quick way to find the qualifying document but then the document tree is traversed to find the value of Name to return.

The Big XML Indexing Picture

CREATE INDEX for Index on XML Column
Index created on single XML column Composite keys not supported Only indexes document nodes that satisfy XML pattern XML index specification GENERATE KEY USING XMLPATTERN XML pattern expression Data type

Key Generation Relational index inserts one key per table row
Index on XML Column may insert multiple keys per table row Multiple parts of document may satisfy XML pattern

XML Documents for Examples

XML Pattern: Path Expression Steps
Supports subset of XQuery path expressions Path expression steps separated by forward slash (/) Double forward slash (//) is abbreviated syntax for /descendant-or-self::node()/ Each step has a forward axis with the child axis as the default child:: (contains children of the context node) attribute:: (contains attributes of the context node) @ (Abbreviated syntax for attribute::) descendant:: (contains descendants of context node such as child, grandchild, etc.) self:: (contains just the context node itself) descendant-or-self:: (contains the context node and the descendants of the context node) The CREATE INDEX statements above could also be written using the unabbreviated syntax: 1. CREATE INDEX empindex on company(companydocs) GENERATE KEY USING XMLPATTERN '/child::company/child::emp/attribute::id' AS SQL DOUBLE CREATE INDEX idindex on company(companydocs) GENERATE KEY USING XMLPATTERN '/descendant-or-self::node()/attribute::id‘

Qualifying Paths and Nodes
Set of nodes may qualify if single path specified Set of paths and nodes may qualify if wildcard, descendant axis, or descendant-or-self axis specified

Specifying text() If text() is specified, then only the text from the context element node qualifies. In this example, no index entries are generated because the “name” element does not contain any text nodes itself. Text nodes are only found in the child element, “first” and “last” which do not qualify. The presence or absence of the “text()” on the XML kind test of the XML pattern expression will affect index entry generation for non-leaf elements but not leaf elements. In general, using text() is not recommended because it is not as user friendly since it requires more typing. In addition, the queries must also use the text() for successful index matching. Schema validation only applies to elements and not text nodes.

If text() not specified
If text() is not specified, the value of the element node is the text from the context element node and the text from all descendant element nodes concatenated together. In this example, value of the element “name” is “DarthVapor” which is the concatenation of the text from the “first” and “last” child elements. Specifying elements without text() is good for leaf node elements because there is less typing and it will index what you expect. Be careful of specifying elements that are non-leaf nodes (especially //*). This is usually not useful and concatenation will occur which may not be what you want. In some cases, the concatenation can be useful for varchar indexes such as in this example: <title>This is a <bold>great</bold> book about XML</title> In this case, an index on /title might be useful to "ignore" the bold formatting indicator.

Data Types Four SQL data types are supported VARCHAR DOUBLE DATE
TIMESTAMP Native XML data types are not supported yet Multiple indexes on the same XML pattern can be created with different data types

VARCHAR(n) Values longer than specified length(n) are not indexed
VARCHAR(n) is used to store varying-length character data Maximum length "n" specified in number of bytes is a constraint Length "n" ranges from 1 to a page size-dependent maximum Index guaranteed to store entire character string values XQUERY semantics are used for string comparisons where trailing blanks are significant. This differs from SQL semantics where trailing blanks are insignificant during comparisons. Values longer than specified length(n) are not indexed Document insertion or index creation will fail Index can support both range scans and equality look-ups Trailing blanks are significant during string comparisons

VARCHAR HASHED Has no length limit and can index arbitrary length character strings System generates an 8-byte hash code over entire string Only used for equality look-ups and not range scans Index contains hash codes instead of the actual character data

DOUBLE Unbounded decimal types and 64 bit integers may lose precision Unique values in the document may be converted to the same DOUBLE value All numeric values will be converted and stored in the index as the DOUBLE data type Special numeric values (NaN, INF, -INF, +0, -0) indexed even though not supported by SQL DOUBLE data type

DATE and TIMESTAMP Indexing xs:time Since XQuery disallows casting from xs:time to xs:dateTime, xs:time can’t be indexed using the TIMESTAMP data type Use the VARCHAR date type to index xs:time instead Timezones DB2 SQL date and timestamp don’t have a timezone DB2 SQL assumes the local timezone (from OS) XQuery dates/time/dateTime can have timezones Default XQuery implicit timezone is UTC If timezone not specified, original value stored in index If timezone specified, DATE and TIMESTAMP data type values are normalized to UTC (Coordinated Universal Time) before storing in index

Document Rejection or CREATE INDEX Failures
DB2 does not support the entire range of XML values date or dateTime values with year > 9999 or < 0 date or dateTime values with fractional second precision > 6 digits out of range numeric values Correct results for the query may include the value beyond DB2 limits. If the value was ignored and not indexed, we could miss the valid value from a query request and return incorrect results. We must issue an error and reject the document to maintain consistent results Errors causing document rejection for INSERT or UPDATE statements and CREATE INDEX failure if table already populated VARCHAR(n): Value length exceeds length constraint Conversion Errors: Valid XML value but can't convert to DB2's representation for the data type because of DB2 limitations

Invalid XML Values For DOUBLE, DATE, and TIMESTAMP indexes
XML values without a valid lexical form for the target index XML data type are invalid DB2 9 XML indexes always ignore invalid XML values Invalid XML values can be rejected or ignored on new CREATE INDEX option for DB2 9.5 For indexes using the data types DOUBLE, DATE, and TIMESTAMP, an XML pattern value is converted to the index XML data type using the XQuery cast expression. XML values that do not have a valid lexical form for the target index XML data type are considered to be invalid XML values. For example, “ABC” is an invalid XML value for the xs:double data type. How the index handles the invalid XML values depends on the option specified in the xmltype-clause of the CREATE INDEX statement.

DB2 9.5 Reject Invalid Values
If the REJECT INVALID VALUES option is specified, all XML pattern values must be valid for the index XML data type. If any XML pattern value cannot be cast to the index XML data type, an error is returned. XML data is not inserted or updated in the table if the index already exists. If the index does not exist, the index is not created. New REJECT INVALID VALUES option for DB2 9.5 If XML value can’t be cast to index XML data type, error returned If index does not exist, index is not created XML data not inserted or updated in the table if index exists

DB2 9.5 Ignore Invalid Values
If the IGNORE INVALID VALUES option is specified, invalid XML pattern values for the target index XML data type are ignored. The corresponding values in the stored XML documents are not indexed by the CREATE INDEX statement. This is the default behavior for indexes created in DB2 9 and the default option for indexes created in DB2 9.5. During insert and update operations, the invalid XML pattern values are not indexed, but the XML documents will still be inserted into the table. No error or warning is returned because specifying these data types is not considered a constraint on the XML pattern values, primarily because XQuery expressions that are searching for the specific XML index data type will never consider these values. Invalid values for index XML data type ignored and not indexed No error or warning is issued Default option

Unique Keyword Values unique in the XML document may result in duplicate errors during insertion when the values are converted to the index data type VARCHAR HASHED Unique character strings may hash to same value DOUBLE Unbounded decimal types and 64 bit integers may lose precision when stored as DOUBLE Uniqueness enforced across all documents within a single XML column Enforced within index data type, XML path to node, and value of node after value cast to index data type

Query Operators for XML
XSCAN ( XML Document Scan) Traverses XML document trees and may evaluate predicates and extract document values XISCAN (XML Index Scan) Performs probes and scans on XML indexes and can evaluate predicates. XANDOR (XML Index ANDing and ORing) Evaluates two or more equality predicates by driving multiple XISCANs. XSCAN (XML Document Scan) DB2 uses the XSCAN operator to traverse XML document trees, evaluate predicates and extract document fragments and values. XSCAN is not an "XML table scan" but it can appear in an execution plan after a table scan to process each of the documents. XISCAN (XML Index Scan) The XISCAN operator performs lookups or scans on XML indexes. The XISCAN takes a value predicate as input, such as a path-value pair like /book[price = 29] or where $i/book/price = 29. It returns a set of row IDs and node IDs. The row IDs identify the rows that contain the qualifying documents, and the node IDs identify the qualifying nodes within these documents. XANDOR (XML Index ANDing and ORing) The XANDOR operator evaluates two or more equality predicates simultaneously by driving multiple XISCANs. It returns the row IDs of those documents that satisfy all of the predicates. DB2 9 only supports XML index ANDing. ORing support for XML indexes was added in DB Index ANDing and ORing can’t be combined in the same plan. See Susanne Englert’s talk for more information on explain output at CREATE INDEX AgeIndex on t1(XMLDoc) GENERATE KEY USING XMLPATTERN '/Person/Age' AS SQL DOUBLE; XQUERY for $i in db2-fn:xmlcolumn(‘T1.XMLDOC’)/Person where $i/Age = 17 return $i;

Index Eligibility Requirements for an XML index to be used for a query: Index “contains” the query predicate, i.e. is equally or less restrictive than the predicate Query predicate matches the index data type /text() is used consistently in query predicate and index definition: both specify /text() or not specify /text() The index must be equally or less restrictive than the predicate so that it contains everything that matches the predicate. The index cannot be more restrictive than the predicate because then it may not return all the matching results. Although using wildcards in the index definition such as //* will cause the index to match a wider number of queries, it may also cause more nodes to be indexed than is necessary and result in unwanted index growth. Wherever possible, it is recommended to use the exact path to the desired elements or attributes in index definitions and queries, without wildcards. XML index patterns such as //* or //text() are possible but should be used with caution. The data type of the query predicate and the index must match for an index to be chosen. Value predicates also have a data type that is determined by the type of the literal value. A value in double quotes is always a string, but a numeric value without quotes is interpreted as a number. For example, if the query predicate uses a numeric value, then the index must have the DOUBLE data type. A predicate with /text() can only be evaluated by an index that also specifies /text() in its XML pattern. If your index does not use /text(), your predicates should also not use /text(). The optimizer can decide to not use an index even if it is eligible such as if the index does not significantly reduce the number of rows retrieved from the table and the cost of the index access outweighs the savings in I/O to the table. Even if these requirements are satisfied, the optimizer can still decide NOT to use an eligible index!

Queries using an Index on an XML Column
The first query uses an equality predicate to find the salary of the employee with id = ‘42366’. The index contains the equality predicate because the XML pattern of is equally as restrictive as the predicate. The index data type of VARCHAR matches the string data type since ‘42366’ is in quotes. The index and the query both do not specify text() so all 3 requirements are satisfied for index eligibility. The second query uses a range predicate > 50000’. The index contains the range predicate because the XML pattern of is less restrictive than the predicate. The index data type of DOUBLE matches the numeric data type for the query since is not in quotes. The index and the query both do not specify text() so all 3 requirements are satisfied for index eligibility. Some sample queries using equality and range predicates

SYSCAT.INDEXXMLPATTERNS
The slide shows a subset of the columns in SYSCAT.INDEXXMLPATTERNS. Here is the full description of all the columns. Each row represents a pattern clause in an index on an XML column. # Column Name Data Type Null Description 0 INDSCHEMA VARCHAR(128) Logical index schema 1 INDNAME VARCHAR(128) Logical index name 2 PINDNAME VARCHAR(128) Physical index name 3 PINDID SMALLINT Internal physical index ID 4 TYPEMODEL CHAR(1) Q=SQL DATA TYPE 5 DATATYPE VARCHAR(128) Name of data type 6 HASHED CHAR(1) Y=Yes, value is hashed N=No, value not hashed 7 LENGTH SMALLINT Length of VARCHAR(n), else zero 8 PATTERNID SMALLINT Internal Pattern ID 9 PATTERN CLOB(2M) Y Pattern definition Catalog view externalizes information on the XML pattern specified for an Index on an XML Column

SYSCAT.INDEXES Index on an XML Column has a logical and physical index
When you create an index on an XML column, 2 indexes actually get created, a logical index and a physical index. The logical index contains the XML pattern information specified in the CREATE INDEX statement. The physical index has DB2 generated key columns to support the logical index and contains the actual index values. The user works with an index on an XML column at the logical level for the CREATE INDEX and DROP INDEX statements. Processing of the underlying physical index by DB2 is transparent to the user. The logical index has the index name specified in the CREATE INDEX statement and has the indextype = XVIL. The physical index has a system generated name and has the indextype = XVIP. The logical index is always created and assigned an index id first. The physical index is created immediately afterwards and is assigned the next consecutive index id. 4 index types have been added to SYSCAT.INDEXES XVIL = Index on an XML column (logical) XVIP = Index on an XML column (physical) XPTH = XML paths index XRGN = XML regions index Both the XML regions index and XML column path index are internal, system-generated indexes associated with XML columns. Although they are recorded in SYSCAT.INDEXES, these indexes are not recognized by any application programming interface that returns index metadata. Index on an XML Column has a logical and physical index Logical index contains XML pattern created by user Physical index contains index values DB2 system generated key columns

DB2 9.7 XML Indexes on Range Partitioned Tables
CREATE INDEX zipcode ON sales(customer_info) GENERATE KEY USING XMLPATTERN ’/Customer/Address/Zipcode’ AS SQL varchar(10) NOT PARTITIONED; CREATE INDEX zipcode ON sales(customer_info) GENERATE KEY USING XMLPATTERN ’/Customer/Address/Zipcode’ AS SQL varchar(10) PARTITIONED; Relational Indexes may be not partitioned or partitioned in DB2 9.7 User-defined XML Indexes may be not partitioned (DB2 9.7 GA) or partitioned (DB2 9.7 FP1) System generated XML Paths Indexes are always not partitioned System generated XML Regions Indexes are always partitioned

DB2 9.7 XML Indexes on Range Partitioned Tables
Non-Partitioned Relational Index or Index on XML Column Partitioned Relational Index or Index on XML Column Partitioned Relational Index or Index on XML Column Partitioned Relational Index or Index on XML Column Base Table Partition 1 Base Table Partition 2 Base Table Partition 3 Partitioned Regions Index Partitioned Regions Index Partitioned Regions Index XDA XDA XDA Non-Partitioned XML Path Index

DB2 9.7 Online XML Index Create and Reorg
Transaction 1 Transaction 2 create table employee (empid integer, info XML); create index empidx on employee(info) generate key using xmlpattern '/employeeinfo/addr' as sql varchar(50); Create Index will delete the index entry before it completes. Delete from employee where empid = 31201 Delete will not wait for the Create Index and will complete successfully reorg indexes all for table employee allow write access Reorg Indexes command will delete the index entry before it completes. where empid = 31664 Delete will not wait for the Reorg Indexes and will complete successfully Insert/Update/Delete transactions no longer need to wait until the CREATE INDEX/REORG INDEXES/REORG INDEX statement completes Results in increased throughput and faster response time for concurrent transactions.

DB2 9.7 Index Compression Default for relational and XML indexes enables compression if data row compression enabled New COMPRESS keyword on CREATE/ALTER INDEX can override default behavior Index can be compressed even if data rows not compressed MDC Block Indexes and XML Paths Indexes not compressed

What Did You Learn Today?
What the difference is between XML and relational indexes How to create an index on an XML column How to avoid common user errors What the requirements are for queries to use XML indexes How the XML indexes are defined in the catalog DB2 9.5 and DB2 9.7 XML index enhancements

DB2 for Linux, Unix, and Windows

Similar presentations

Presentation on theme: "DB2 for Linux, Unix, and Windows"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DB2 for Linux, Unix, and Windows

Similar presentations

Presentation on theme: "DB2 for Linux, Unix, and Windows"— Presentation transcript:

Similar presentations

About project

Feedback