Download presentation
Presentation is loading. Please wait.
1
Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003
2
Primary Challenges for Content Management Systems Heterogenenous Data Sources – create some normalized representation of data to provide equal (reading) accessibility for human and machine alike. retrieving data from a RDBMS involves programmatic access (ODBC, SQL) HTML files consist of tagged text. Stylistic and structural info, different code is interpreted by browsers in different ways, confusing for automated programs, but humans manage it. Word processing applications – Word, Acrobat, binary data converted to text with proprietary interpreter, and associated viewer. Want interoperability of viewers with other formats.
3
Primary Challenges for Content Management Systems Distribution of Data Sources Access involves use of protocols (HTTP, HTTPS, FTP, SCP,…) to go through firewalls. With business applications we still need security and to limit views to selected individuals and groups. Additional protocols (XML, IIOP, SOAP and Web Services) are being used to build tools for integrating systems. To deliver messages to components through http, a protocol is needed. The Simple Object Access Protocol (SOAP), written in XML, is emerging as the protocol.
4
Primary Challenges for Content Management Systems What is being used to identify distributed data sources: Distributed Directories and protocols The Domain Name Service (DNS) is a hierarchically distributed directory of Names (home.himolde.no) and IP addresses. The X.500 directory service is a hierarchically distributed directory of objects. Object attribute-value pairs may be stored and looked up. LDAP is a protocol for accessing a directory service. Most visions of the Web imagine “federated” servers to help find objects. UDDI is one protocol for advertising and discovery
5
The Web Today Web Server Client DN Server 1. Location Lookup 2. Object Request
6
The Web with Object Directories Web Server Client DN Server LDAP Server 1. Registration Web Server 2. Attribute/Value Request and Object/Location Response 3. The Rest
7
Primary Challenges for Content Management Systems Data Size and the Relevance Factor Large repositories like WWW Need a system to drill down to subsets of relevant information. Speed and automation is critical. (Find not just more results, but better.) Find a particular needle in a haystack with a billion needles. Find all the needles which are similar to some other needle which has already been discovered.
8
What can help? Semantic web technology XML and the Resource Description Framework (RDF) will allow XML tags to be labeled in conjunction with a referential knowledge representation. Machine based inference engines should replace today's search engines. New editors are needed to infuse semantic information into the content easily, as some editors allow users that do not know html syntax to create web pages.
9
Syntactic Integration
10
Structural Integration
11
Semantic Integration
12
RDF RDF provides a simple data model for expressing statements using (subject, predicate, object) triples, and an associated serialization syntax in XML. All three elements of the triple can be defined within the current document or refer to another resource on the Web. As an example of RDF applied in a logistic context we model the three entities ship,container and item.
13
RDF in use In RDF we can express relations between entities, such as a ship transports a container, and a container contains an item. These relations can but need not to be hierarchical, i.e. a business can be the owner of the transported item, and at the same time the user of the container. It is important to note that these relations can change over time, ownership moves from one business to another, and container move from ships to trucks for further transportation. These transitions may trigger events, like financial transactions or notifications. An ontology can be used to define all the concepts and their meaning used in a certain (set of) schema(s).
14
Components of Semantic Technology Classification Metadata Ontologies (taxonomies)
15
Classification General keyword searches lead to many irrelevant results. An automatic classification system could for example, divide a 1000 stories into 5 categories, so keyword searches would be more relevant. Techniques for classification Statistical analysis and pattern matching Rule-based methods Linguistic analysis Bayesian theory (probabilistic) Ontology driven: name-entity and domain-phrase recognition Committee-based approaches use various techniques Classification is more precise if documents are tagged with metadata and conform to a predetermined schema.
16
Metadata Data about the data Levels of Metadata Syntatic Structural Semantic
17
Syntatic Metadata General information Little for context determination Document size, location, date of creation.. Used in Assessment of the document’s relevance Version tracking User level access policies Email, docs in file systems, have this info.
18
Structural Metadata Information about the structure of content Varies widely with document type XML allows creators to enclose content within meaningful tags. Can make associations between content from multiple documents.
19
Semantic Metadata Semantic Metadata is “data which may be associated explicitly or implicitly with a given piece of content (such as a document) and whose relevance for that content is determined by its ontological position (its context) within one or more domains of knowledge.”
20
Semantic Metadata Metadata receives its contextual information from a reference knowledgebase. Metadata that is extracted from any document may be stored as a snapshot of that document’s relevant information. The metadata contained within this snapshot simply references the instances of name-entities, which are stored in the ontology. Each name-entity has related information stored: synonyms, attributes, related entities.
21
Semantic Metadata Documents can link to each other in several ways Explicit metadata – docs that mention the same exact metadata Implicitly related metadata – docs that contain synonyms or hierarchically related name entities. Ontoloical associations – by name-entities associations, one doc mentions a company name while another mentions the ticker symbol.
22
Standards: DCML defines a generic element set, non-specific to domain of knowledge. Can be used as a top domain.
23
Forms of knowledge representation Dictionary – terms are the keys and definitions are the values. There are no links between terms. Thesaurus – includes antonyms and synonyms. The pieces of knowledge are linked. Taxonomy – includes etymological information (derivation) and synonyms are organized hierarchically (inheritance). Flower is a subclass of plant. But a rose may be related to love. Associations may be emotional, cultural, temporal. Relevant associations Can be discovered by a data- analysis system utilizing a reference knowledge base. Ontology – is the labeling of the relationship in the taxonomy.
24
Types of Metadata
25
Ontology Description Languages Knowledge model building in a given domain is subjective Problems combining independently developed ontologies Resource Description Framework (RDF) and RDF- Schema (RDF-S) data model tries to address this: Resource – is an item of interest at the atomic level, entitity, concept or document. Each resource is uniquely identified by a URI Properties – descriptive, characteristics and attributes of a resource. They may be associative, relating one resource to another. Statement – is what is known as an RDF triple. It contains a reference to a resource, a property names, and that property’s value. These identifiers take the form of link addresses.
26
Ontology Description Languages RDF-S (specification for ontoloy modeling.) http://www.w3.org/TR/2000/CR-rdf-schema- 20000327/ http://www.w3.org/TR/2000/CR-rdf-schema- 20000327/ Dublin Core Metadata Initiative http://dc2003.ischool.washington.edu/program.html http://dc2003.ischool.washington.edu/program.html DARPA Agent Markup Language + Ontology Interface Layer (DAML+OIL) expands on the RDF-S. Classes are defined as elements and can be related to other classes in disjunction, union, or equality. The W3C has a ontology web language (OWL) that is based on OIL.
27
Meta-data Interpretation DAML (DARPA) endeavor to interpret a simple ontology to infer information about resources. Put very simply: If people have names If students are people If resource X is about a student Resource X should have a name This kind of inference could be easily constructed within the context of an object- oriented directory
28
Schema Interpretation – and integration consider two sets of resources: For set A, the attributes are structured in accord with the kind of meta data described on the previous slide. Imagine the same for set B, but using different attribute names and values Accept that the attribute-values are called resource descriptions and a document called a resource description schema defines the relations for each set. Imagine the two schema are related through a third schema Finally imagine an engine that relates resources in set A to resources in set B based on schema level inferences
29
The Semantic Web Vision Web Server Client DN Server LDAP Server Web Server 5. The Rest LDAP Server Schema Server 2. Description Association 1. Schema Registration 3. Object Query 4. Inferencing
30
Sample Knowledgebases WordNet is a networked thesaurus, developed at Princeton, in the form of a lexical matrix. It maps word forms to word meanings, M2M relationship. The set of word-meanings for a word is a synset. It is not an ontology because it does not contain real world information required in labeled relationships, such as, a “branch” is an administrative division with a chairman above it. Open Directory Project http://www.dmoz.org/ http://www.dmoz.org/ National Library of Medicine has an ontology system, Unified Medical Language System (UMLS), with researchers and intstitutions contributing to it. http://www.nlm.nih.gov/research/umls/ http://www.nlm.nih.gov/research/umls/
31
Toolkits – should provide for.. Establishing of configurable parameters Extraction agents and classifiers modules The system should accept training sets of data, and learn from patterns, so future items are classified without manual trigger. Easily navigatible visual environment Tracking date and time of data entry ROADS provides tools for creating subject gateways, http://www.ilrt.bristol.ac.uk/roads/http://www.ilrt.bristol.ac.uk/roads/
32
Extracting Wrapper Technologies WysiWyg Web Wrapper Factory (W4F), crawl and retrieve data from web pages, to create wrappers that represent the content of the pages. ANDES, uses XPath rules XWRAP toolkit, has interactive rules formulation S-CREAM (semiautomatic creation of metadata) lets the user annotate documents. Ontoprise (product by Semagix) http://www.ontoprise.com http://www.ontoprise.com BUT, an ontology driven classifier and domain specific metadata annotator allows searching on classification by keyword AND on implied entity association. (SEE example on next slide.)
34
Semagix Visualizer – is a visualization tool for viewing an ontology or schema.
35
Related References http://bazaar.sis.pitt.edu/ The E-Speak Initiative at the University of Pittsburgh http://bazaar.sis.pitt.edu/ E-Speak Overview (http://bazaar.sis.pitt.edu/es_ppt_over/AIntrotoE Speak_files/frame.htm )http://bazaar.sis.pitt.edu/es_ppt_over/AIntrotoE Speak_files/frame.htm E-Speak Revised (http://bazaar.sis.pitt.edu/es_ppt_over/AESpeak Revisited_files/frame.htm )http://bazaar.sis.pitt.edu/es_ppt_over/AESpeak Revisited_files/frame.htm Oracle9i Data Mining Concepts Oracle9i AS Personalization is used to build data mining models. Oracle9i AS Personalization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.