Download presentation
Presentation is loading. Please wait.
1
March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report: www-db.stanford.edu/pub/gio/1999/miti.htm E P F L seminar Supported by the AFOSR- New World Vistas Program
2
March 2000 Gio XIT 2 Growth Factors Consumer Pull Research & Inno- vation Tool building Product building & marketingGeneral Technology Push Business needs Government responsibilities Information Technology
3
March 2000 Gio XIT 3 T r e n d s 1998 : 1999 Users of the Internet 40% Ü 52% of U.S. population Growth of Net Sites (now 2.2M public sites with 288M pages) Expected growth in E-commerce by Internet users [BW, 6 Sep.1999] segment 1998 1999 –books 7.2% Ü 16.0% –music & video 6.3% Ü 16.4% –toys 3.1% Ü 10.3% –travel 2.6% Ü 4.0% –tickets 1.4% Ü 4.2% –Overall 8.0% Ü 33.0% = $9.5Billion An unstainable trend cannot be sustained [Herbert Stein] Ü new services 98 99 00 01 02 03 04 0.3 1 3 9 27 81 ** 90 80 70 60 50 40 30 20 10 0 Year / % Ü % Ü Centroid, in 1999 ~1% of total market E-penetration Toys
4
March 2000 Gio XIT 4 Expect continuing growth Hardware technology will continue to lead and encourage broader usage Communication technology will continue to lead and become more economical User interfaces will improve and not be a barrier to the acceptance of technology Government policies will not hinder open interaction - or not be able to
5
March 2000 Gio XIT 5 The Problem of Information Growth: " We are drowning in information but starved for knowledge. This level of information is clearly impossible to be handled by present means. Uncontrolled and unorganized information is no longer a resource in an information society, instead it becomes the enemy." -- John Naisbitt, author of 1982 bestseller Megatrends... and it’s not getting better Dealing with this issue requires Precision: Helpful for casual users -- reduce human filtering when browsing Essential for business -- regular tasks require automation needs Know- ledge
6
March 2000 Gio XIT 6 Knowledge from Experts encoded for reuse The product: Information Observations Filters Analyses Aggregation of instances Integration of sources Data Data + Knowledge Information
7
March 2000 Gio XIT 7 Precision to be improved in Relevance of Information for the Customer –modeling the customer Timeliness of Information –resolving temporal mismatch for past data Search for Information –precision versus recall Meaning of the Information our focus here –resolving semantic mismatch Service model to achieve these objectives services add value by increasing precision
8
March 2000 Gio XIT 8 Search techniques add value Yahoo humans catalog and organize useful web sites. Junglee integrates diverse sources using wrappers. AltaVista automatically surfs and indexes the web. Excite also tracks queries and classifies customers. Firefly provides customer control over their profiles. Cookies track users’ activities between sessions. Alexa collects webpages and their usage. Google ranks the reference importance of web pages....
9
March 2000 Gio XIT 9 Problems for search engines and progress Unsuitable source representations part classification: HTML --- XML print formats: postscript, adobe PDF non-text: images, sound, video hidden in databases behind CGI scripts Inconsistent semantics context distinct / scope / view Naïve modeling of customers roles & growth Search engines cannot solve all problems Being improved. Rate?
10
March 2000 Gio XIT 10 Large quantities affect cost The human genome: ~ 4 000 000 000 base pairs Genes, and gene abnormalities 1 human ~10 000 proteins ? diseases Everybody’s genes 6 000 000 000 humans Small organic molecules - affect proteins - suitable for drugs ~2 000 000 molecules Metabolic pathways <1000 system s Nature Progress
11
March 2000 Gio XIT 11 Need for precision adapted from Warren Powell, Princeton Un. data errors information quantity human limit acceptable limit human with tools? Information Wall More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toysnot real cars apparent drug-target with poor annotation ) False Negatives cause lost opportunities, suboptimal to some degree False positives = poor precision typically cost more than false negatives = poor recall Testing false lead in pharmaceutics costs > $ 100 000 in stage 1.
12
March 2000 Gio XIT 12 Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed Platform and Operating Systems 4 4 Representation and Access Conventions 4 Naming and Ontology :
13
March 2000 Gio XIT 13 Semantic Mismatches Information comes from many autonomous sources Differing viewpoints (by source) –differing terms for similar items { lorry, truck } –same terms for dissimilar items trunk(luggage, car) –differing coverage vehicles (DMV, AIA) –differing granularity trucks (shipper, manuf.) –different scope student museum fee, Stanford Hinders use of information from disjoint sources –missed linkages loss of information, opportunities –irrelevant linkages overload on user or application program Poor precision when merged ok for web browsing, poor for business
14
March 2000 Gio XIT 14 Proposed Solutions Specify and standardize terminology usage: ontology Globally all interacting sources –wonderful for users and their programs –long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ? –costly maintenance, since all sources evolve –who has the authority to dictate conformance Domain-specific XML DTD assumption –Small, focused, cooperating groups –high quality, some examples - genomics, arthritis, shakespeare plays –allows sharable, formal tools –ongoing, local maintenance affecting users - annual updates –poor interoperation, users still face inter-domain mismatches solves only part of the problem
15
March 2000 Gio XIT 15 Domains and Consistency. a domain will contain many objects the object configuration is consistent within a domain all terms are consistent & relationships among objects are consistent context is implicit No committee is needed to forge compromises * within a domain Compromises hide valuable details Domain Ontology
16
March 2000 Gio XIT 16 Objective of S calable K nowledge C omposition Provide for Maintainable Application Ontologies devolve maintenance onto many domain-specific experts / authorities provide an algebra to compute composed ontologies that are limited to their articulation terms enable interpretation within the source contexts SKC
17
March 2000 Gio XIT 17 Sample Operation: INTERSECTION Source Domain 1: Owned and maintained by Store Result contains shared terms, useful for purchasing Source Domain 2: Owned and maintained by Factory Articulation
18
March 2000 Gio XIT 18 Tools to create articulations Graph matcher for Articulation- creating Expert Vehicle ontology Transport ontology Suggestions for articulations
19
March 2000 Gio XIT 19 continue from initial point Tool suggests terms for further articulation: by spelling similarity, by graph position by term match nexus Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay ’s are converted into articulation rules
20
March 2000 Gio XIT 20 Candidate Match Nexus Term linkages automatically extracted from Webster’s* / Oxford dictionary + * freely available + restricted Based on processing headwords ý definitions using algebra primitives Notice presence of 2 domains: chemistry, transport
21
March 2000 Gio XIT 21 Using the Match Nexus Experiment: On government structures of NATO countries: SKEIN system resolved over 70% of unmatched terms
22
March 2000 Gio XIT 22 Using the Match Nexus
23
March 2000 Gio XIT 23 An Ontology Algebra A knowledge-based algebra for ontologies The Articulation Ontology (AO) consists of matching rules that link domain ontologies Intersection create a subset ontology keep sharable entries Union create a joint ontology merge entries Difference create a distinct ontology remove shared entries
24
March 2000 Gio XIT 24 INTERSECTION support Store Ontology Articulation ontology Matching rules that use terms from the 2 source domains Factory Ontology Terms useful for purchasing
25
March 2000 Gio XIT 25 Other Basic Operations typically prior intersections UNION: merging entire ontologies DIFFERENCE: material fully under local control Arti- culation ontology
26
March 2000 Gio XIT 26 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused when sources change
27
March 2000 Gio XIT 27 Knowledge Composition Articulation knowledge for UUUU U (A(AB)B) U (B(BC)C) U (C(CE)E) Knowledge resource B Knowledge resource A Knowledge resource C Knowledge resource D U (C(CD)D) U (B(BC)C) Articulation knowledge Composed knowledge for applications using A,B,C,E Knowledge resource E U (C(CE)E) Legend: U : union U : intersection Articulation knowledge for (A(A B)B) U
28
March 2000 Gio XIT 28 SKC Primitive Operations Unary Summarize -- abstract Glossarize - list terms Filter - reduce instances Extract - move into context Binary Match - data corrobaration Difference - distance measure Intersect - use of articulation Union - search broadening Constructors create object create set Connectors match object match set Editors insert value edit value move value delete value Converters object - value object indirection reference indirection Model and Instance
29
March 2000 Gio XIT 29 Exploiting the result. Processing & query evaluation is best performed within Source Domains & by their engines Result has links to source Avoid n 2 problem of interpreter mapping
30
March 2000 Gio XIT 30 Domain Specialization. Knowledge Acquisition (20% effort) & Knowledge Maintenance (80% effort *) to be performed Domain specialists Professional organizations Field teams of modest size Empowerment automously maintainable * based on experience with software
31
March 2000 Gio XIT 31 Summary To sustain the growth of web usage 1. The value of the results has to keep increasing precision, relevance not volume, nor recall 2. Value is provided by experts, encoded as models of diverse resources, customers Problems to be addressed mismatches quality maintenance + Tools for these tasks } Clear, scalable models
32
March 2000 Gio XIT 32 Acknowledgments Supported by AF Office of Scientific Research –New World Vistas program Participants David Maluf, postdoc, PhD EE, McGill Univ., 1997. Jan Jannink, PhD candidate, CS, grad. June 2000? Shrish Agarwal, MS graduate, CS, 1999. Prasenjit Mitra, PhD candidate, EE, grad. 2001? Martin Kerstens, PhD, summer visitor from CWI. Stefan Decker, postdoc, PhD Univ.Karlsruhe 1999.
33
March 2000 Gio XIT 33 April-June 2000, at 14:15 - 15:15, room ? Presentations in English -- but I'll try to manage discussions in French and/or German. I plan to cover the material in an integrating fashion, drawing from concepts in databases, artificial intelligence, software engineering, and business principles. 1. 13/4 Historical background, enabling technology: ARPA, Internet, DB, OO, AI., IR, XML. 2. 27/4 Search engines and methods (recall, precision, overload, semantic problems). 3. 4/5 Digital libraries, information resources. Value of services, copyright. 4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing. 5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance. 6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer] 7. 31/5 Application to Bioinformatics. 8. 15/6 Educational challenges. Expected changes in teaching and learning. 9. 22/6 Privacy protection and security. Security mediation. 10.29/6 Summary and projection for the future. Feedback and comments are appreciated. Seminar Course on Intelligent Information Systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.