XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015
Reminders Homework 2 “release” version is now on the Web site Simple web crawling XPath XSLT Storage (Berkeley DB) Milestone 1 due March 1 Milestone 2 due March 8 2
More than XPath XPath identifies or extracts subtrees from an XML document … But there are lots of cases where we want to convert from XML XML, or something else XML text (document extraction) XML HTML XML SVG etc. Here we need something more – often XSLT 3
4 A Functional Language for XML XSLT is based on a series of templates that match different parts of an XML document There’s a policy for what rule or template is applied if more than one matches (it’s not what you’d think!) XSLT templates can invoke other templates XSLT templates can be nonterminating (beware!) XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths) Within each template, directly describe what should be output
5 An XSLT Template An XML document itself XML tags create output OR are XSL operations All XSL tags are prefixed with “xsl” namespace All non-XSL tags are part of the XML output Common XSL operations: template with a match XPath Recursive call to apply-templates, which may also select where it should be applied Attach to XML document with a processing-instruction:
6 An Example XSLT Stylesheet This is DBLP …
7 XML Data Root ?xml dblp mastersthesis article mdate key authortitleyearschool editortitleyearjournalvolumeee mdate key 2002… ms/Brown92 Kurt P…. PRPL… 1992 Univ…. 2002… tr/dec/… Paul R. The… Digital… SRC… 1997 db/labs/dec attribute root p-i element text
8 XSLT Processing Model List of source nodes result tree fragment(s) Start with root Find all template rules with matching patterns from root Find “best” match according to some heuristics Set the current node list to be the set of things it maches Iterate over each node in the current node list Apply the operations of the template “Append” the results of the matching template rule to the result tree structure Repeat recursively if specified to by apply-templates
9 What If There’s More than One Match? Eliminate rules of lower precedence due to importing Break a rule into any | branches and consider separately Choose rule with highest computed or specified priority Simple rules for computing priority based on “precision”: QName preceded by XPath child/axis specifier: priority 0 NCName preceded by child/axis specifier: priority NodeTest preceded by child/axis specifier: pririty -0.5 else priority 0.5
10 Other Common Operations Iteration: Conditionals: Copying current node and children to the result set:
11 Creating Output Nodes Return text/attribute data (this is a default rule): Create an element from text (attribute is similar): Copy nodes matching a path
12 Embedding Stylesheets You can “import” or “include” one stylesheet from another: “Include”: the rules get same precedence as in including template “Import”: the rules are given lower precedence
13 XSLT Summary A very powerful, template-based transformation language for XML document other structured document Commonly used to convert XML PDF, SVG, GraphViz DOT format, HTML, WML, … Primarily useful for presentation of XML or for very simple conversions What if we want to: Manage and combine collections of XML documents? Make Web service requests for XML? “Glue together” different Web service requests? Query for keywords within documents, with ranked answers This is where XQuery plays a role – see CIS 330 / 550 for details
Now… How Do We Crawl the Web and Get Data? A few remarks on basic crawlers… … Then an XML-specific crawler 14
15 Crawling the Web: The Basic Process Start with some initial page P 0 Collect all URLs from P 0 and add to the crawler queue Consider tag, anchor links, optionally image links, CSS, DTDs, scripts Considerations: What order to traverse (polite to do BFS – why?) How deep to traverse What to ignore (coverage) How to escape “spider traps” and avoid cycles How often to crawl
16 Essential Crawler Etiquette Robot exclusion protocols First, ignore pages with: Second, look for robots.txt at root of web server See To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: BobsCrawler Disallow: /news/ Disallow: /tmp/
Suppose We Want to Crawl XML Documents Based on User Interests We need several parts: A list of “interests” – expressed in an executable form, perhaps XPath queries A crawler – goes out and fetches XML content A filter / routing engine – matches XML content against users’ interests, sends them the content if it matches 17
18 XML-Based Information Dissemination Basic model (XFilter, YFilter, Xyleme): Users are interested in data relating to a particular topic, and know the schema /politics/usa//body A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties
19 Engine for XFilter [Altinel & Franklin 00]
20 How Does It Work? Each XPath segment is basically a subset of regular expressions over element tags Convert into finite state automata Parse data as it comes in – use SAX API Match against finite state machines Most of these systems use modified FSMs because they want to match many patterns at the same time
21 Path Nodes and FSMs XPath parser decomposes XPath expressions into a set of path nodes These nodes act as the states of corresponding FSM A node in the Candidate List denotes the current state The rest of the states are in corresponding Wait Lists Simple FSM for politics usabody Q1_1 Q1_2 Q1_3
22 Decomposing Into Path Nodes Query ID Position in state machine Relative Position (RP) in tree: 0 for root node if it’s not preceded by “//” -1 for any node preceded by “//” Else =1+ (no of “*” nodes from predecessor node) Level: If current node has fixed distance from root, then 1+ distance Else if RP = –1, then –1, else 0 Finaly, NextPathNodeSet points to next node Q Q1-1Q1-2Q1-3 Q Q2-1Q2-2Q2-3 Q2=//usa/*/body/p
23 Query Index Query index entry for each XML tag Two lists: Candidate List (CL) and Wait List (WL) divided across the nodes “Live” queries’ states are in CL; “pending” queries + states are in WL Events that cause state transition are generated by the XML parser politics usa body p Q1-1 Q2-1 Q1-3Q2-2 Q2-3 X X X X X X X X CL WL Q1-2
24 Encountering an Element Look up the element name in the Query Index and all nodes in the associated CL Validate that we actually have a match Q Q1-1 politics Q1-1 X X WL startElement: politics CL Query ID Position Rel. Position Level Entry in Query Index: NextPathNodeSet
25 Validating a Match We first check that the current XML depth matches the level in the user query: If level in CL node is less than 1, then ignore height else level in CL node must = height This ensures we’re matching at the right point in the tree! Finally, we validate any predicates against attributes (e.g.,
26 Processing Further Elements Queries that don’t meet validation are removed from the Candidate Lists For other queries, we advance to the next state We copy the next node of the query from the WL to the CL, and update the RP and level When we reach a final state (e.g., Q1-3), we can output the document to the subscriber When we encounter an end element, we must remove that element from the CL
27 Publish-Subscribe Model Summarized Well-suited to an XML format called RSS (Rich Site Summary or Really Simple Syndication) Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles Seems like a perfect fit for publish-subscribe models!