Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.

1 Semantic Search: different meanings

2 Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the syntactic level of matching keywords – Hakia, PowerSet, SearchMonkey Definition 2: Semantic search as the problem of searching large semantic web datasets – Watson, PowerAqua, Swoogle, Sindice, SWSE

3 Facing keyword-based search problems Relations between search terms: – “books about recommender systems” vs. “systems that recommend books” Polisemy – “mouth” as part of the body vs. “mouth” as part of a stream Synonymy – “movies” vs. “films” Documents about individuals where query keywords do not appear: – “English banks”, individual “Abbey”

4 Several attempts from the IR community Early 80s: elaboration of conceptual frameworks and their introduction in IR models – Taxonomies (categories + hierarchical relations), e.g., The ODP (Open Directory Project) – Thesaurus (categories + fixed hierarchical & associative relations), e.g., WordNet (used by linguistic approaches) – Algebraic methods such as LSA Limitations: The level of conceptualization is often shallow (specially at the level of relations)

5 The emergence of the SW Late 90s: introduction of ontologies as conceptual framework (classes + instances (KBs) + arbitrary semantic relations + rules) – Semantic search: Exploiting ontologies as a richer conceptualizations & formal languages to enhance traditional keyword-based document retrieval – Semantic search: Need to search this emergent and continuously growing structured information space (the Web of Data) DPLP, Geonames, DBPedia, BBC Music,... ( penData/DataSets)

6 The Web of Data  2007  2008  2009 Extracted from: Linked Data Tutorial (Florianópolis)

7 LOD cloud May 2007 Figure from [4] Facts: Focal points: DBPedia: RDFized vesion of Wikipiedia; many ingoing and outgoing links Music-related datasets Big datasets include FOAF, US Census data Size approx. 1 billion triples, 250k links Extracted from: Linked Data Tutorial (Florianópolis)

8 LOD cloud September 2008 Facts: More than 35 datasets interlinked Commercial players joined the cloud, e.g., BBC Companies began to publish and host dataset, e.g. OpenLink, Talis, or Garlik. Size approx. 2 billion triples, 3 million links Extracted from: Linked Data Tutorial (Florianópolis)

9 LOD cloud March 2009 Facts: Big part from Linking Open Drug cloud and the BIO2RDF project Notable new datasets: Freebase, OpenCalais, ACM/IEEE Size > 10 billion triples Extracted from: Linked Data Tutorial (Florianópolis)

10 The LOD clouds Extracted from: Linked Data Tutorial (Florianópolis)

11 Commercial interest by publishers

12 Commercial interest by search engines 2007 Yahoo! Presents Search Monkey

13 Commercial interest by search engines July-2008 Microsoft buys Powerset

14 Commercial interest by search engines April 2010 Facebook announced the use of the Open Graph protocol

15 Commercial interest by search engines May-2009 Google announces Rich Snippets and it’s official use of RDFa and Microformats

16 Commercial interest by search engines July-2010 Google buys Metaweb (the company behind FreeBase)

17 Commercial interest by search engines November-2010 Google announced the support of the GoodRelations vocabulary for Google Rich Snippets.

18 Challenges Exploiting this new information space for semantic search purposes opens new research challenges: – Scalability – Heterogeneity – Uncertainty

19 Scalability Effective exploitation of the linked data requires infrastructure that scales to a large and ever growing collection of interlinked data!

20 Heterogeneity Dbpedia:Rudi_Studer Dblp:Studer:Rudi.html SW:/en/rudi_studer Dblp:~ley/db/../author SW:Person Dbpedia:Professor SCHEMA-LEVEL DATA-LEVEL Align Reconcile, Combine Effective exploitation of the data web requires an effective mechanism for finding the relevant data sources integrating data sources combining elements from different data sources

21 Uncertainty Incomplete Representation of User’s Needs and content meanings –User cannot completely specify the need –The semantic information in the search space is incomplete Effective exploitation requires match user’s needs to data in an imprecise way rank the results be flexible enough to adjust to changes in constraints! “Find action films directed by some Hong Kong film director and starring Chinese martial actors”

22 The Search Space: different representations

23 The search space: different representations Unstructured search space – The Web of documents (textual and multimedia content) Structured search space – The Web of data (ontologies + Knowledge Bases) Hybrid search space – Unstructured content is enriched with metadata Embedded annotations Not embedded annotations

24 The unstructured search space The Web of human-understandable content. The Web of documents and links – CC License Documents Search space

25 Search engines

26 The structured search space The Web of machine understandable content. The Web of objects and relations – Creative Commons License objects Search space

27 Search engines

28 The hybrid search space Enriching documents with metadata Objects Documents How to interlink documents and data? Search space

29 Two ways of interlinking metadata and documents Information Extraction By relying on Web publishers – More on the section Data on the (Semantic) Web

30 Search engines

