Vijayshankar Raman, CS294-7, Spring Querying the WWW Alberto O. Mendelzon George A. Mihaila Tova Milo
Vijayshankar Raman, CS294-7, Spring Scenarios... §Find about PCs from IBM query: +IBM +“personal computer” +price l can we restrict search to ? §Find a good music store l should I ask yahoo or hotbot or lycos or … ? §Find pages about databases within 2 links from Joe’s webpage §Find recent web pages with title “Bob’s Music Store”
Vijayshankar Raman, CS294-7, Spring Problems §Queries don’t exploit structure of data §Queries don’t exploit link topology of data §Source selection hard l different search engines have different functionalities, idiosyncratic behaviour l different search engines good at different tasks
Vijayshankar Raman, CS294-7, Spring Outline §Motivation §WebSQL §Nuts and Bolts §Query Locality §Good, Bad and Ugly
Vijayshankar Raman, CS294-7, Spring WebSQL Integrate structure/topology constraints with textual retrieval §Virtual graph model of document network §Need to combine navigation and querying §Query Language that utilizes document’s structure and can accept constraints on link topology
Vijayshankar Raman, CS294-7, Spring Data Model Relational §Each web object is a tuple in a Document l {url, title, text, type, length, modification info} §Hyperlinks are tuples in Anchor l {base, href, label} interior links ( )within same document local links ( ) within same server global ( ) across servers
Vijayshankar Raman, CS294-7, Spring Examples §SELECT x.url, x.title, y.url, y.title FROM Document x SUCH THAT x MENTIONS “Computer Science”, Document y SUCH THAT x = y -- docs within 2 links from something on CS. §SELECT d.url, d.title FROM Document d SUCH THAT “ = d WHERE d.title CONTAINS “database”; -- docs within 2 links of CS homepage. MENTIONS: search engine, CONTAINS: checked locally
Vijayshankar Raman, CS294-7, Spring More examples from Toronto from Toronto Job Opportunities for Software Engineers SELECT e.url FROM Document d SUCH THAT d MENTIONS "Career Opportunities", Document e SUCH THAT d = | -> e WHERE e.text CONTAINS "Software Engineer”; this query is useful, but...
Vijayshankar Raman, CS294-7, Spring Outline §Motivation §WebSQL §Nuts and Bolts §Query Locality §Good, Bad and Ugly
Vijayshankar Raman, CS294-7, Spring Nuts and bolts §SELECT Fields(x1, x2, …, xn) FROM Obj x1 SUCH THAT A1 Obj x2 SUCH THAT A2 … WHERE Condition(x1, x2, … xn) § nested loops join algorithm: for all x1 such that A1 is true for all x2 such that A2 is true …
Vijayshankar Raman, CS294-7, Spring §each atomic condition A1 … Am is of form l Path( from_node, path_expression, to_node) x5 = | (->*) x7 enumerate links to check these l NodePredicate(node) CONTAINS “Bob’s Coffee Place” (x5) query a “customizable set of known” search engines §what queries are computable? l those that don’t have to explore the entire web l “safe” queries: every variable must be either directly solvable in some atomic condition, OR directly derivable from another in some atomic condition
Vijayshankar Raman, CS294-7, Spring Query Locality §distinguish between access to local and remote documents §model communication cost of a query based on l “expected” number of results from search engines l “expected” size of documents l “expected” number of exterior, interior, remote links per document l “expected” cost of network access §can identify potentially expensive components of a query and warn user
Vijayshankar Raman, CS294-7, Spring The Good §Idea of using structure in answering queries §topologies can be useful, with a better interface... §can be used for link maintenancelink maintenance
Vijayshankar Raman, CS294-7, Spring The Bad §Too complicated (especially syntax) l easy to write queries that explore the entire web. §does end user care for topology constraint, besides domain constraint? §Remote accesses cause huge slow down l check topology constraints at search engine? §availability
Vijayshankar Raman, CS294-7, Spring The Ugly §How to avoid back links? §Fuzzy queries l find me “good”, “inexpensive” Chilean restaurants that are “close by”
Vijayshankar Raman, CS294-7, Spring Issues §What kinds of path based queries are useful, intuitive? §How to check the path constraints at the search engine? §Can hypertext links be viewed as yet another kind of link in a semi-structured model
Vijayshankar Raman, CS294-7, Spring Other Work §Other, generic intra-document structure can be useful §Topology, structure can be used by system (instead of by end user) l use links to determine quality of site content l authority sites -- find for query on harvard l classification -- Cha-ChaCha-Cha §Store links at search engine for proximity searches l can generalize to arbitrary links in a directed graph model --- Goldman et. al ’98 l get “see also” info