1 The BT Digital Library A case study in intelligent content management Paul Warren
2 Semantics in content management limitations of conventional technology the users’ view using the technology enhancing the experience the starting point
3 Semantics in content management Intelligent content management
4 The need for semantics Content management systems need to: index by meaning, not just text combine information from heterogeneous sources Users need information: identified by semantics, not just keywords precise and complete selected by their interests and their task context defined semantically from heterogeneous sources, accessed uniformly semantics in content management
5 Higher precision, greater recall Precision Find me information about Washington the man, not the state or city Find me information about a company called X which operates in industry Y Recall Finding all relevant documents E.g. ask for information about ‘George W Bush’ and be given documents on ‘the President’ semantics in content management
6 Interests and context Need information about Jaguar? interested in cars, the natural world, South America … with a context defined by current activities Not just about searching interest & context to share information … … and to push information to user … plus many integrated applications semantics in content management
7 Too much relevant information Documents with duplicate information. Goal to: extract what is unique from each document help users prioritise their reading Need to: aggregate from disparate sources remove duplication present meaningfully classified summarised semantics in content management
8 The starting point The BT digital library before SEKT
9 The BT digital library the starting point Two major document databases 5 million articles – abstracts plus some full text Originally text-based with some attribute- based querying: e.g. author, date information spaces defined by queries
10 An information space the starting point Query-defined alerts ed weekly as database updated Public info spaces anyone can subscribe forming communities Private info spaces defined by user
11 Personalisation the starting point Personalised entry page shows user’s info spaces, journals of interest, recent reading and ‘jottings’ (bookmarks)
12 Limitations of conventional technology Why we need semantics
13 Queries Text string ‘knowledge management’ 4161 ABI Inspec records Descriptor ‘knowledge management’ 3213 ABI Inspec So careful query formulation needed … … but average query length is 1.8 words Little use of ‘advanced’ functions … … 80% queries use no query modifier limitations of conventional technology
14 Poor relevancy of results A simple keyword search tends to offer high recall and low precision. Ambiguity in the query, e.g. synonymy where several terms could describe the same concept, homonymy where a word has many different meanings. Relevant documents retrieved |A| Non relevant documents retrieved |B| Non relevant Documents |C| Relevant Documents |D| Relevant documents Retrieved documents Recall = |A|/(|A|+|D|) (proportion of relevant documents retrieved) Precision = |A|/(|A|+|B|) (proportion of retrieved documents that are relevant) limitations of conventional technology
15 Presenting results Searches Only 17% results read after 1 st page … no more than 10 results checked Same query, same results regardless of user’s preference & context Document descriptors Lots – many irrelevant to readership Where relevant, not fine-grained e.g. knowledge management limitations of conventional technology
16 Enhancing the experience What semantics can offer a digital library
17 A new experience enhancing the experience Hybrid searching concepts, instances, information spaces, and text search results meaningfully classified Automatic annotation identifying companies, people, … hyperlinked to a knowledgebase Topics – finer grained than document descriptors semi-automatically generated automatic document classification An extended corpus crawling the Web for related pages Web pages added to share knowledge
18 A better experience Semantics to improve precision & recall Washington the man, not city or state references to the President not just George W Bush Information spaces defined on semantic queries not just text queries Taking account of interests and context semantically defined Natural language results enhancing the experience
19 The users’ view What users want
20 Initial questionnaire & focus group Users want: Improved searching and indexing based on a user’s profile integrated into working environment To stay in control advise but not decide frustrated by too many alerts the users’ view
21 Features – what the users think very important / important summarising results of search with personal interests and preferences advanced attribute-based search looking beyond the library suggesting candidate topic areas highlighting & hyperlinking named entities natural language queries the users’ view
22 After that … Important / minor importance retrieving similar articles re-using old queries agent searches access from a range of devices the users’ view
23 Using the technology Applying semantics to the BT Digital Library
24 Search: knowledge management using the technology knowledge management as: info space topic term With clustered results
25 A complex query using the technology microsoft 2 companies term semantic web info space topic term sem web info space Microsoft-authored Microsoft as term
26 Querying a concept alloy a term but also - concept in ontology … with properties … definition … sub-concepts using the technology
27 Document with markup using the technology Identified: Bhargava Waterbury Connecticut USA IEE Click for related documents, e.g. by Bhargava
28 Categorising results … using the technology
and more categories using the technology
30 In summary Semantic technology - provides intelligence in content management - enhances the user experience - satisfies proven user needs