Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)

Similar presentations


Presentation on theme: "Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)"— Presentation transcript:

1 Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)

2 - 2 - Research Outline Evolution of Online Communities –Social Search –PeopleWeb Trends in Search and Information Discovery –Move towards task-centricity –Need to interpret content –Community Information Management Web Data Infrastructure –Massively distributed computing and hosted “cloud” services

3 Evolution of Online Communities

4 - 4 - Research

5 - 5 - Research Rate of content creation Estimated growth of content –Published content from traditional sources: 3-4 Gb/day –Professional web content: ~2 Gb/day –User-generated content: 8-10 Gb/day –Private text content: ~3 Tb/day (200x more) –Upper bound on typed content: ~700 Tb/day (Towards a PeopleWeb, Ramakrishnan & Tomkins, IEEE Computer, August 2007)

6 - 6 - Research Metadata Estimated growth of metadata –Anchortext: 100Mb/day –Tags: 40Mb/day –Pageviews: 100-200Gb/day –Reviews: Around 10Mb/day –Ratings: Drove most advances in search from 1996-present Increasingly rich and available, but not yet useful in search This is in spite of the fact that interactions on the web are currently limited by the fact that each site is essentially a silo

7 - 7 - Research PeopleWeb: Site-Centric People-Centric Common web-wide id for objects (incl. users) –Even common attributes? (e.g., pixels for camera objects) As users move across sites, their personas and social networks will be carried along Increased semantics on the web through community activity (another path to the goals of the Semantic Web) Global Object Model Portable Social Environment Community Search (Towards a PeopleWeb, Ramakrishnan & Tomkins, IEEE Computer, August 2007)

8 - 8 - Research Content Access and Ownership (Slide courtesy Andrew Tomkins)

9 - 9 - Research Facebook Apps, Open Social Web site provides canvas –Third party apps can paint on this canvas –“Paint” comes from data on and off-network Via APIs that each site chooses to expose What is the core asset of a web portal? What are the computational implications? –App hosting and caching –Dynamic, personalized content –Searching over “spaghetti” information threads

10 Trends in Search

11 - 11 - Research Search and Content Supply Premise: –People don’t want to search –People want to get tasks done I want to book a vacation in Tuscany. StartFinish Broder 2002, A Taxonomy of web search

12 - 12 - Research “seafood san francisco” Category: restaurant Location: San Francisco Reserve a table for two tonight at SF’s best Sushi Bar and get a free sake, compliments of OpenTable! Category: restaurant Location: San Francisco Alamo Square Seafood GrillAlamo Square Seafood Grill - (415) 440-2828 803 Fillmore St, San Francisco, CA - 0.93mi - mapmap Category: restaurant Location: San Francisco Structure Intent

13 - 13 - Research Y! Shortcuts

14 - 14 - Research Google Base

15 - 15 - Research Search as Killer App for Web Data Semantics Publishers and search engine collaborate –Example: Abstracts surfacing structured content Users see richer search experience –Accomplish their tasks faster and more effectively

16 Social Search

17 - 17 - Research Social Search Explicitly open up search –Enable communities, sites and consumers to explicitly re-define search results (e.g., SearchMonkey, Boss) What is the right unit for a “search result”? Can we intelligently “stitch together” more informative abstracts, possibly from multiple sources? Facilitate creation of specialized ranking engines based on different kinds of tasks, or aimed at different communities of users Implicitly leverage socially engaged users and their interactions –Learning from shared community interactions, and leveraging community interactions to create and refine content Expanding search results to include sources of information –E.g., Experts, sub-communities of shared interest, particular search engines (in a world with many, this is valuable!) Reputation, Quality, Trust, Privacy

18 - 18 - Research Opening Up Yahoo! Search Phase 1 Phase 2 Giving site owners and developers control over the appearance of Yahoo! Search results. BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo! Search infrastructure and technology to developers and companies to help them build their own search experiences. (Slide courtesy Prabhakar Raghavan)

19 - 19 - Research What Is It? BeforeAfter An open platform for using structured data to build more useful and relevant search results (Slide courtesy Amit Kumar)

20 - 20 - Research What’s New? task links buy this user reviews best trips task links buy this user reviews best trips structured data review ratings product prices hours of operation structured data review ratings product prices hours of operation favicon send result share this rich result with others send result share this rich result with others media product images business photos profile pictures media product images business photos profile pictures user choice remove report spam user choice remove report spam (Slide courtesy Amit Kumar)

21 - 21 - Research How Does It Work? Acme.com’s database Index RDF/Microformat Markup Site owners/publishers share structured data with Yahoo!. 1 Consumers customize their search experience with Enhanced Results or Infobars 3 Site owners & third-party developers build SearchMonkey apps. 2 DataRSS feed Web Services Page Extraction Acme.com’s Web Pages (Slide courtesy Amit Kumar)

22 - 22 - Research Publishing Structured Data: Support for Emerging Semantic Web Standards ++ Microformats –hCard, hEvent, hReview, hAtom, XFN –More as they get adopted RDFa and eRDF markup OpenSearch –+extensions to return structured data Atom/RSS Feeds –+extensions to embed structured data markup (crawl) apis (pull) push (Slide courtesy Andrew Tomkins)

23 - 23 - Research Infobars: Integrating 3 rd Party Data Pull in data from any web service (Slide courtesy Amit Kumar)

24 - 24 - Research babycenter epicurious Search Results of the Future yelp.com answers.com LinkedIn webmd Gawker New York Times (Slide courtesy Andrew Tomkins)

25 - 25 - Research BOSS Offerings API A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences. BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search. University of Illinois Urbana Champaign Carnegie Mellon University Stanford University Purdue University MIT Indian Institute of Technology Bombay University of Massachusetts CUSTOM Working with 3rd parties to build a more relevant, brand/site specific web search experience. This option is jointly built by Yahoo! and select partners. ACADEMIC Working with the following universities to allow for wide-scale research in the search field: (Slide courtesy Prabhakar Raghavan)

26 - 26 - Research BOSS Could Enable Custom Search Experiences Social Search Vertical Search Visual Search (Slide courtesy Prabhakar Raghavan)

27 - 27 - Research Partner Examples

28 - 28 - Research Web Search Results for “Lisa” Latest news results for “Lisa”. Mostly about people because Lisa is a popular name Web search results are very diversified, covering pages about organizations, projects, people, events, etc. 41 results from My Web!

29 - 29 - Research Save / Tag Pages You Like You can save / tag pages you like into My Web from toolbar / bookmarklet / save buttons You can pick tags from the suggested tags based on collaborative tagging technology Type-ahead based on the tags you have used Enter your note for personal recall and sharing purpose You can specify a sharing mode You can save a cache copy of the page content (Courtesy: Raymie Stata)

30 - 30 - Research My Web 2.0 Search Results for “Lisa” Excellent set of search results from my community because a couple of people in my community are interested in Usenix Lisa- related topics

31 - 31 - Research Google Co-Op This query matches a pattern provided by Contributor… …so SERP displays (query- specific) links programmed by Contributor. Subscribed Link edit | remove Query-based direct-display, programmed by Contributor Users “opts-in” by “subscribing” to them

32 - 32 - Research

33 - 33 - Research Tech Support at COMPAQ “In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.” “Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.” – Steve Young, VP of Customer Care, Compaq

34 - 34 - Research KNOWLEDGE BASE QUESTION Answer added to power self service SELF SERVICE ANSWER KNOWLEDGE BASE QUESTION SELF SERVICE - -Partner Experts - Customer Champions -Employees Customer How It Works Support Agent Answer added to power self service

35 - 35 - Research 65% (3,247) 77% (3,862) 86% (4,328) 6,845 74% answered Answers provided in 12h Answers provided in 24h 40% (2,057) Answers provided in 3h Answers provided in 48h Questions No effort to answer each question No added experts No monetary incentives for enthusiasts Timely Answers 77% of answers provided within 24h

36 - 36 - Research Power of Knowledge Creation ~80% Support IncidentsAgent Cases 5-10 % Self- Service *) Customer Mass Collaboration *) Knowledge Creation SHIELD 1 SHIELD 2 *)Averages from QUIQ implementations SUPPORT

37 - 37 - Research Mass Contribution Users who on average provide only 2 answers provide 50% of all answers 7 % (120) 93 % (1,503) 50 % (3,329) 100 % (6,718) Answers Contributing Users Top users Contributed by mass of users

38 - 38 - Research Interesting Problems Question categorization Detecting undesirable questions & answers Identifying “trolls” Ranking results in Answers search Finding related questions Estimating question & answer quality (Byron Dom: SIGIR talk)

39 - 39 - Research Supplying Structured Search Content Semantic Web? Unleash community computing—PeopleWeb! Three ways to create semantically rich summaries that address the user’s information needs: –Editorial, Extraction, UGC Challenge: Design social interactions that lead to creation and maintenance of high-quality structured content

40 - 40 - Research Better Search via Information Extraction Extract, then exploit, structured data from raw text: For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

41 - 41 - Research Community Information Management (CIM) Many real-life communities have a Web presence –Database researchers, movie fans, stock traders Each community = many data sources + people Members want to query and track at a semantic level: –Any interesting connection between researchers X and Y? –List all courses that cite this paper –Find all citations of this paper in the past one week on the Web –What is new in the past 24 hours in the database community? –Which faculty candidates are interviewing this year, where?

42 - 42 - Research DBLife  Integrated information about a (focused) real- world community  Collaboratively built and maintained by the community  Semantic web via extraction & community

43 - 43 - Research DBLife Faculty: AnHai Doan & Raghu Ramakrishnan Students: P. DeRose, W. Shen, F. Chen, R. McCann, Y. Lee, M. Sayyadian Prototype system up and running since early 2005 Plan to release a public version of the system in Spring 2007 1164 sources, crawled daily, 11000+ pages / day 160+ MB, 121400+ people mentions, 5600+ persons See DE overview article, CIDR 2007 demo

44 - 44 - Research DBLife Papers Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE-08. Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, W. Shen, A. Doan, J. Naughton, R. Ramakrishnan. VLDB-07. Source-aware Entity Matching: A Compositional Approach, W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan: ICDE 2007. OLAP over Imprecise Data with Domain Constraints, D. Burdick, A. Doan, R. Ramakrishnan, S. Vaithyanathan. VLDB-07. Community Information Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006. Managing Information Extraction, A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD-06 Tutorial.

45 - 45 - Research DBLife Integrate data of the DB research community 1164 data sources Crawled daily, 11000+ pages = 160+ MB / day

46 - 46 - Research Entity Extraction and Resolution Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava,...

47 - 47 - Research Resulting ER Graph “Proactive Re-optimization Jennifer Widom Shivnath Babu SIGMOD 2005 David DeWitt Pedro Bizarro coauthor advise write PC-Chair PC-member

48 - 48 - Research Challenges Extraction –Domain-level vs. site-level extraction “templates” Compositional, customizable approach to extraction planning –Blending extraction with other sources (feeds, wiki-style user edits) Maintenance of extracted information –Managing information Extraction –Incremental maintenance of “extracted views” at large scales –Mass Collaboration—community-based maintenance Exploitation –Search/query over extracted structures in a community –Search across communities—Semantic Web through the back door! –Detect interesting events and changes

49 - 49 - Research Mass Collaboration We want to leverage user feedback to improve the quality of extraction over time. –Maintaining an extracted “view” on a collection of documents over time is very costly; getting feedback from users can help –In fact, distributing the maintenance task across a large group of users may be the best approach

50 - 50 - Research Mass Collaboration: A Simplified Example Not David! Picture is removed if enough users vote “no”.

51 - 51 - Research Mass Collaboration Meets Spam Jeffrey F. Naughton swears that this is David J. DeWitt

52 - 52 - Research Incorporating Feedback A. Gupta, D. Smith, Text mining, SIGMOD-06 System extracted “Gupta, D” as a person name System extracted “Gupta, D” using rules: (R1) David Gupta is a person name (R2) If “first-name last-name” is a person name, then “last-name, f” is also a person name. Knowing this, system can potentially improve extraction accuracy. (1)Discover corrective rules (2)Find and fix other incorrect applications of R1 and R2 A general framework for incorporating feedback? User says this is wrong

53 - 53 - Research Collaborative Editing Users should be able to –Correct/add to the imported data –E.g., User imports a paper, system provides bib item Challenges –Incentives, reputation –Handling malicious/spam users –Ownership model My home page vs. a citation that appears on it –Reconciliation Extracted vs. manual input Conflicting input from different users

54 - 54 - Research The Purple SOX Project Operator Library Extraction Management System (e.g Vertex, Societek) Shopping, Travel, Autos Academic Portals (DBLife/Me Yahoo) Enthusiast Platform …and many others Application Layer (SOcial eXtraction)

55 Web Data Management: Massively Distributed Hosted Systems

56 - 56 - Research Two Key Subsystems Serving system –Takes queries and returns results –Supports low-latency updates Content system –Gathers input of various kinds (including crawling) –Generates the data sets used by serving system Both highly parallel Serving System Content System Data sets Users Logs Web sites Data updates Goal: speedup. Hardware increments speed computations. Goal: scaleup. Hardware increments support larger loads. (Courtesy: Raymie Stata)

57 - 57 - Research An Example Web App uploads tags as “flower” » Friend activity » Your Photos Sonja uploaded Brandon tagged a photo » Photos tagged as “flower” Updates Queries Heavy use of simple database operations

58 - 58 - Research The Problem What does it take to build the next big app?

59 - 59 - Research Why Hosted? simple API  No maintenance worries for application  Single ops team  Resource sharing leads to savings  No maintenance worries for application  Single ops team  Resource sharing leads to savings

60 - 60 - Research Data Analysis Platforms User Tags Understanding online communities, and provisioning their data needs –Exploratory analysis over massive data sets Challenges: Analyze shared, evolving social networks of users, content, and interactions to learn models of individual preferences and characteristics; community structure and dynamics; and to develop robust frameworks for evolution of authority and trust; extracting and exploiting structure from web content … Examples: –Bigtable, Map-Reduce, Hadoop, PIG

61 - 61 - Research The Bigger Picture Software-as-a-service –E.g., Salesforce.com Hosted data systems –E.g., Amazon’s S3/Dynamo and EC2 Web application development –Ning, Ruby-on-rails Change tracking –Stream management

62 - 62 - Research Implications Data management as a service –Scientists and others who’ve resisted (installing, maintaining, and) using DBMSs will find it much easier to reap the benefits –“Data centers” and “Computing Centers” will come into vogue again Hosted back-ends and RAD tools will make Web application development accessible to all –The Web is becoming open E.g., OpenSocial, OpenID Ideas will be the most valuable currency, not the wherewithal to build complex systems Paradigm shifts possible for how we do research in many fields –Build applications that embed your algorithms and test them directly in the field— Computer Scientists can interact directly with users (ironically, this would still be a breakthrough of sorts after four decades!) –Many other disciplines (e.g., Sociology, microeconomics) can design and conduct online experiments involving unprecedented numbers of participants

63 - 63 - Research Summary Online communities represent a tremendous resource for organizing information online –Open APIs and cloud services = mass engagement –Extraction + mass collaboration = semantics Web is becoming –More people-centric, less site-centric –Highly intertwined, distributed, dynamic, personalized –Models of ownership, trust, incentives? –Next generation of search algorithms and infrastructure?

64 - 64 - Research Further Reading Content, Metadata, and Behavioral Information: Directions for Yahoo! Research, The Yahoo! Research Team, IEEE Data Engineering Bulletin, Dec 2006 (Special Issue on Web-Scale Data, Systems, and Semantics) Systems, Communities, Community Systems on the Web, Community Systems Group at Yahoo! Research, SIGMOD Record, Sept 2007 Towards a PeopleWeb, R. Ramakrishnan and A. Tomkins, IEEE Computer, August 2007 (Special Issue on Web Search)


Download ppt "Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)"

Similar presentations


Ads by Google