Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Maximise Your Online Presence SEO & Social Media Strategies For Local Business Owners.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
Sematic Web Microdata, Microformat and RDF Advanced Web-based Systems | Misbhauddin.
Making the Web searchable, or the Future of Web Search Peter Mika Yahoo! Research Barcelona.
The Social Application World Jia Shen Founder of RockYou!
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Samad Paydar Web Technology Laboratory Computer Engineering Department Ferdowsi University of Mashhad 1389/11/20 An Introduction to the Semantic Web.
Inbound Statistics Slides Attract. 1 Blogging There are 31% more bloggers today than there were three years ago 46% of people read blogs more than once.
Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
Overview of Search Engines
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
Best Practices Using Enterprise Search Technology Aurelien Dubot Consultant – Media and Entertainment, Fast Search & Transfer (FAST) British Computer Society.
Top 5 Facebook Tips Mark Smith Rosemary Turner. What is Facebook? Users create a personalised profile for themselves and then add people as friends to.
Web Content Management at GCN.com The Gilbane Conference: Content Technologies for Government Alec Dann SVP of Internet Publishing PostNewsweek Tech Media.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
YELLOWPAGES.COM. Privileged and Confidential2 (New Results Page) Where consumers go on the Web when they “need something” local… Need something? sm.
Data on the (Semantic) Web
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Semantic Web Technologies ufiekg-20-2 | data, schemas & applications | lecture 21 original presentation by: Dr Rob Stephens
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Introduction Position your online or offline business Quickly and safely grow a huge army of potential clients Productively interact with your new and.
Practical RDF Chapter 1. RDF: An Introduction
Semantic Search: different meanings. Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.
By: Dan Johnson & Jena Block. RDF definition What is Semantic web? Search Engine Example What is RDF? Triples Vocabularies RDF/XML Why RDF?
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
Lifecycle Metadata for Digital Objects (INF 389K) September 18, 2006 The Big Metadata Picture, Web Access, and the W3C Context.
RDFa, Microformats, and Atom Semantic Web Presented by: Anuradha Kandula Instructor: Steven Seida.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Semantic Visualization What do we mean when we talk about visualization? - Understanding data - Showing the relationships between elements of data Overviews.
Microsearch and SearchMonkey Interfaces for Semantic Search Peter Mika Researcher, Data Architect Yahoo! Research.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
Features. Yahoo! Features My Yahoo! Flickr Delicious Yahoo! Pipes Yahoo! Maps Yahoo! Developer Network (YDN) Yahoo! Finance Yahoo! Mobile Yahoo! Hot.
Personalizing Java based Answers for Hundreds of Millions of Users Anurag Gupta Senior Architect, Yahoo Answers & Groups
OWL Representing Information Using the Web Ontology Language.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
How Linked Open Data helps Museums Collaborate, Reach New Audiences, and Improve Access to art Information Eleanor E. Fink Manager, American Art Collaborative.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Fundamentals of Web DevelopmentRandy Connolly and Ricardo HoarFundamentals of Web DevelopmentRandy Connolly and Ricardo Hoar Fundamentals of Web DevelopmentRandy.
From XML to DAML – giving meaning to the World Wide Web Katia Sycara The Robotics Institute
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Week 1 Introduction to Search Engine Optimization.
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
CITA 330 Section 11 The Web and Its Future. Web 1.0 News, music and everything else is moved to digital Web sites become super applications Ease of.
Semantic Web in Depth RDFa, GRDDL and POWDER Dr Nicholas Gibbins
Social Media Strategies. Socialnomics Video Markets are conversations Silence is fatal…. The Clue Train Manifesto – published 2000.
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Semantic Web in Depth Schema.org RDFa, JSON-LD, Microdata Professor Steffen Staab 2016, Many slides courtesy by Dr. Nick Gibbins.
Creating & Customizing Business for Sale Websites
RDFa How and Why Ralph R. Swick World Wide Web Consortium
GAMING POV.
XML QUESTIONS AND ANSWERS
Web Engineering.
Embedding Knowledge in HTML
Fred Dirkse CEO, OIC Group, Inc.
PREMIS Tools and Services
Embedding Knowledge in HTML
Presentation transcript:

Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research

- 2 - Yahoo! Research (research.yahoo.com)

- 3 - Yahoo! Research Barcelona Established January, 2006 Led by Ricardo Baeza-Yates Research areas –Web Mining content, structure, usage –Distributed Web retrieval –Multimedia retrieval –NLP and Semantics

- 4 - Yahoo! by numbers (April, 2007) There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1 out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data). Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media Metrix, US, Feb. 2007). Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore WorldMetrix, Feb. 2007). Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world, an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007). Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have 115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007). Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique users and 250 million answers worldwide (Yahoo! Internal Data). There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80 percent of the photos are public (Yahoo! Internal Data). Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb. 2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007) Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM community approaching 350 million user accounts (Yahoo! Internal Data). Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day (comScore WorldMetrix, Feb. 2007).  Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data). Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work” List (2006).

- 5 - Agenda The Annotated Web SearchMonkey –Demo –Technology DataRSS format Query language Lessons learned Toward Semantic Search BOSS –Build your Own Search Service Y!OS 1.0 –Yahoo! Open Strategy

The Annotated Web

- 7 - Previously in search Horizontal search –Yahoo… –Keyword-based indexing –Minimal natural language processing –Limited experiments with ontologies (query expansion) Vertical search –e.g. shopping.com, Kelkoo –Faceted search, browsing –Fixed ontology Combinations –Google Base, Google Co-op Web-scale, but fixed ontologies Proprietary technology Can we do better with the Semantic Web? –Address the long tail of queries (88% of queries) –Use standard technology Not a new question. But the answer may be new.

- 8 - Which Semantic Web? Two visions –Data Web Bringing the content of databases to the Web (linkeddata.org) Rich data, heavyweight semantics Deep Web –Annotated Web Annotating the content of Web resources (documents, mm) Simple data, lightweight semantics Shallow Web This presentation is about the Annotated Web.

- 9 - Brief history of the Annotated Web 1995: HTML meta tags 1996: Simple HTML Ontology Extensions (SHOE) 1998: RDF/XML –RDF/XML in HTML –RDF linked from HTML 2003: Web 2.0 –Tagging –Microformats –Metadata in Wikipedia –Machine tags in Flickr 2005: eRDF 2008: RDFa

HTML meta tags <LINK rel="meta" type="application/rdf+xml" title="FOAF" href= " …

SHOE example (Hefflin & Hendler, 1996) My name is George Cook and I live at...

SHOE system

SHOE Text-based query interface

SHOE Graphical Query Interface

Example: Creative Commons Embedding CC license in HTML (now deprecated): … … <!–- <rdf:RDF xmlns=" xmlns:dc=" xmlns:rdf=" The Law of Averages...because eventually i&apos;ll be right... -->

Example: Creative Commons Current: rel attribute (HTML4) This work is licensed under a Creative Commons Attribution 3.0 United States License. Use of the “rel” attribute for semantic annotation is the birth of the microformat…

Example: microformats <a class="fn url" rel="friend colleague met" href=" Meyer wrote a post ( Tax Relief ) about an unintentionally humorous letter he received from the Internal Revenue Service. Joe Friday Area Administrator, Assistant

microformats microformats.org Originated by Tantek Celik and others Agreements on the way to encode certain kinds metadata in HTML –Reuse of semantic-bearing HTML elements –Based on existing standards –Community process –Persons, events, listings etc. but also syntactic metadata: licenses, tags Microformats have no shared syntax –Each microformat has a separate syntax tailored to the vocabulary Microformats are not ontologies –No formal descriptions of schema, only text –Limited reuse, extensibility of schemas –No datatypes No namespaces, unique identifiers (URIs) –no interlinking –mapping between instances is required Relationship to page context is unclear Widely used in millions of documents –User-generated as well as automatically generated

Example: tags and machine tags

Example: Tags and machine tags Tags –User defined keywords –Minimal agreement Is ‘rock’ on Flickr same as ‘rock’ on myspace? Is ‘rock’ by me on Flickr is the same as ‘rock’ by you on Flickr? Is ‘rock’ by me on Flickr today the same as ‘rock’ by me on myspace tomorrow? Machine tags –User defined values for user defined properties –Possibility to define the namespace (but not enforced) –Limited use

RDF-based annotation #1: eRDF eRDF –Ian Davis (Talis) –Embedding RDF in HTML Straightforward mapping to RDF triples (XSLT available) HTML4 compatible –More complex than microformats Use any RDF/OWL vocabulary Reuse of semantic-bearing HTML elements is limited –More limited than RDF No blank nodes No data types No statements about subjects other than the current document –Limited usage

RDF-based annotation #2: RDFa RDFa –World Wide Web Consortium (W3C) last call document –Similar intent as eRDF, but full RDF support Requires XHTML –Big question: user complexity (  data quality) Jo Smith. Web hacker at Example.org. You can contact me via ....

SearchMonkey

Creating an ecosystem of publishers, developers and end- users –Motivating and helping publishers to implement semantic annotation –Providing tools for developers to create compelling applications –Focusing on end-user experience Rich abstracts as a first application Addressing the long tail of query and content production Standard Semantic Web technology –dataRSS = Atom + RDFa –Industry standard vocabularies SearchMonkey

BeforeAfter an open platform for using structured data to build more useful and relevant search results What is SearchMonkey?

image deep links name/value pairs or abstract Enhanced Result

- 27 -YAHOO! CONFIDENTIAL | 27 Infobar

Acme.com’s database Index RDF/Microformat Markup site owners/publishers share structured data with Yahoo!. 1 consumers customize their search experience with Enhanced Results or Infobars 3 site owners & third-party developers build SearchMonkey apps. 2 DataRSS feed Web Services Page Extraction Acme.com’s Web Pages SearchMonkey

Developer tool

Developer tool

Developer tool

Developer tool

Developer tool

Gallery

Example apps LinkedIn –hCard plus feed data Creative Commons by Ben Adida –CC in RDFa

Example apps. II. Other me by Dan Brickley –Google Social Graph API wrapped using a Web Service

DataRSS <feed xmlns:xsi=" xsi:schemaLocation=" Peter Mika Example data feed for social T04:05:06+07:00 Peter Mika T04:05:06+07:00 John Doe male Jane Doe female Atom 1.0 XML + RDFa

The data part <adjunct version="1.0" id=“com.yahoo.page.rdfa" xmlns=" updated=“ T04:05:06+07:00”> John Doe male Jane Doe female

DataRSS An Atom extension for structured data Why a new format? –A feed format is required by publishers Exclusive content (e.g. partnerships, paid inclusion) No changes necessary to the web page No standard named graph format for the Semantic Web –Needed to capture meta-metadata such as source and timestamp of information –Not really a new format An Atom extension Use any RDFa parser to get the triples out cf. Google Base feeds

What happened since the launch? It’s starting to work! –Click rates improve  Publishers are willing to invest  More structured data  More applications  More users  Click rates improve Increasing excitement all around –Standardization of RDFa is bringing new energy –Good market for companies that help publishers to ‘semantify’ or support developers in extracting structured from web pages OpenCalais, Dapper, AdaptiveBlue, Intel MashMaker, Zemanta… There have been some lessons learned…

Data quality Publishers/developers want the quick and dirty answer, not the long and clean one Resource or literal? – – Webpage or resource? –Should we allow a resource have the same URI as an existing webpage? –This is the default in eRDF/RDFa! Peter Mika Types vs. datatypes – Extensibility –rdfs:movies Complexity of the formalism = Data quality down

Vocabularies Coverage is small –Books, movies, stuff people care about… Competing proposals –Versions floating around Not maintained –I cannot maintain your vocabulary for you Vocabularies for microformats –A must The role of the W3C –Ontologies as member submission…. Vocabularies not designed for the annotated Web Distributed ontology development = Mess

eRDF Difficult for complex pages and dangerous in non-expert hands –Serious limitations No datatypes No subjects other than identifiers within the current page Reuse of the id attribute Peter Mika ….

RDFa A huge improvement –E.g. no repurposing of HTML attributes Still, not everything is intuitive to the uninitiated: Peter Mika jpg … </span Peter Mika jpg </span

Semantic Search

Semantics and IR Hard searches that cannot be solved with purely syntactical approach –Ontologies in IR shown to work in limited domains –In Web IR most attempts (e.g. query expansion) have failed What is new? The scale and breadth. –Growth in annotations, all domains (Web 2.0) –Data Web vs. Deep Web

Hard searches Ambiguous searches –Paris Hilton Multimedia search –Images of Paris Hilton Imprecise or overly precise searches –Publications by Jim Hendler –Find images of strong and adventurous people (Lenat) Searches for descriptions –Search for yourself without using your name –Product search (ads!) Searches that require aggregation –Size of the Eiffer tower (Lenat) –Public opinion on Britney Spears Queries that require a deeper understanding of the query, the content and/or the world at large –Note: some of these are so hard that users don’t even try them any more

Example…

Application: query intent Paris Hilton is a person!

Application: query intent #2 Hugo is a person!

Time to experiment! Future work in IR –Ranking documents –Ranking resources/triples Ranking resources in the presence of documents –New interfaces Query interface Result presentation Metrics –Precision/recall –CTR –ROI

Time to experiment! #2 Future work in Semantic Web –(Semi-)automated ways of metadata creation (NLP!) –Entity resolution –Scale –Data quality We allow providing metadata for other people’s sites! –Reasoning To the extent that it’s useful Constraints –Keyword-based search can not suffer Preference for very conservative solutions –Metadata in the context of documents Assumption is still that the user will want to see/visit the source (what to do with linked data?)

Going even further… NLP, Information Extraction New types of application –Aggregators –Stateful apps –Intent-driven apps –Mobile apps –… Beyond search –Analysis, design, diagnosis etc. on top of aggregated data Personalization –Building rich user profiles –Re-ranking results based on interests Monetization –No more “buy virgins on eBay”

BOSS: Build your Own Search Service Unlimited queries per day Ability to re-order results and blend-in addition content No restrictions on presentation No branding or attribution Access to multiple verticals (web search, image, news) Ability to monetize 40+ supported language and region pairs

Y!OS 1.0 Yahoo! Open Strategy –Yahoo! Social Platform Profiles, Connections, Updates, Contacts and Status –Yahoo! Query Language (YQL) Access (other) web services using a SQL-like language –Yahoo! Application Platform (YAP) Developer hosted execution of applications with access to Yahoo's Social APIs and YQL; Support for OpenSocial's JavaScript API; andOpenSocial's JavaScript API Support for server-side YML tags. Future: run applications on Y! sites OpenID, OAuth

Contact Peter Mika –Come to Barcelona and stop by SearchMonkey –developer.yahoo.com/searchmonkey/ –mailing lists –forums –Semantic Web FAQ

the monkey is out!