1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

SCOPUS Searching for Scientific Articles By Mohamed Atani UNEP.
Recommender Systems & Collaborative Filtering
Google News Personalization: Scalable Online Collaborative Filtering
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
LIS618 lecture 6 Thomas Krichel Structure Probabilistic model News from the front line –Open WorldCat Pilot –Amazon Search Inside the book.
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
Chen Cheng1, Haiqin Yang1, Irwin King1,2 and Michael R. Lyu1
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Google Directory By, Dixie E. Oyola. Google Directory The Google Web Directory integrates Google's sophisticated search technology with Open Directory.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
HOW BIG IS THE INTERNET? As of 2005, Internet size is estimated at 5 million terabytes: 5.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
CHAPTER 4 PROBABILITY THEORY SEARCH FOR GAMES. Representing Knowledge.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
1 Computing Trust in Social Networks Huy Nguyen Lab seminar April 15, 2011.
College Search 10 th Grade Plan Test #3. Pre-Test 1.What is the average ACT score range for an Open Admissions College? 2.What is the average composite.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
Physical Education & Technology Chevon Mitchell Website:
Uncovering the Invisible Web. Back in the day… Students used to research using resources hand-picked by librarians and teachers. These materials were.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
Research Methods School of Economic Information Engineering Dr. Xu Yun :
1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Millions of Databases: Which are Trustworthy and Relevant?
Recommender Systems & Collaborative Filtering
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Author: Kazunari Sugiyama, etc. (WWW2004)
Basic Information Retrieval
Search Pages and Results
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Anatomy of a Search Search The Index:
CS246: Information Retrieval
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Presentation transcript:

1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science

2 The Web and Information Galore

3 10 Years Ago Reading papers for research Reading papers for research –Stacks of papers –Long wait

4 With Web

5 Challenges (1) Information overload Information overload –Too much information, too little time

6 Information Overload “XML” to Google “XML” to Google –14 Million matching documents! “XML” to Amazon “XML” to Amazon –464 matching books! Which one to read? Which one to read?

7 Challenges (2) Hidden Web Hidden Web –Not indexed by Search Engines –“Hidden” from an average user –Browse every site manually? …

8 Challenges (3) Transience Transience

9 Challenges (4) Scattered & unstructured data Scattered & unstructured data –All Computer Science faculty members and graduate students in the US?

10 Projects In Our Group Web Archive Web Archive Hidden Web Integration Hidden Web Integration Page Ranking Algorithm Page Ranking Algorithm User Recommendation System User Recommendation System

11 User Recommendation System 464 books on XML 464 books on XML Which one to read? Which one to read? –The one that my colleagues and friends recommend?

12 Amazon’s Recommendation System 1 – 5 star rating by individual users 1 – 5 star rating by individual users Books can be sorted by “average user rating” Books can be sorted by “average user rating”

13 My Typical Scenario Sort books by their average user rating Sort books by their average user rating Browse top 20 books to decide what to read Browse top 20 books to decide what to read

14 Questions Is “5 star” by one user better than “4.9 star” by 100 users? Is “5 star” by one user better than “4.9 star” by 100 users? –Intuitively, I prefer 4.9 star by 100 users –More “reliable” rating How much can I trust the rating of a particular person? How much can I trust the rating of a particular person? –How do I know that the person’s rating is reliable

15 Our Approach “Inherent quality” or “rating” of a book “Inherent quality” or “rating” of a book –How many users recommend the book (i.e., give high rating) if all users have read the book? More user rating  More information on the “quality” of the book More user rating  More information on the “quality” of the book –An average user is likely to give high rating for a high-quality book

16 Probabilistic Rating Model How likely is the book of “4 star rating”? How likely is the book of “4 star rating”? –Rating probability distribution Book rating/quality Probability density

17 Update of Rating Probability As more users provide rating, we update our probability distribution As more users provide rating, we update our probability distribution Book rating/quality Probability density

18 Update of Rating Probability As more users provide rating, we update our probability distribution As more users provide rating, we update our probability distribution Book rating/quality Probability density After five-star rating by a user

19 Update of Rating Probability As more users provide rating, we update our probability distribution As more users provide rating, we update our probability distribution Book rating/quality Probability density After one-star rating by a user

20 Update of Rating Probability As more users provide rating, we update our probability distribution As more users provide rating, we update our probability distribution Book rating/quality Probability density After many ratings

21 Bayesian Inference Theory Given a user rating UR, what is the inherent rating IR? Given a user rating UR, what is the inherent rating IR? )( )()|( )|( URP IRP URP IRP  Probability of book rating BEFORE user rating Probability of book rating AFTER user rating

22 User Model The characteristics of a user The characteristics of a user Sensitivity: Slope of the curve Sensitivity: Slope of the curve +1: good, –1 : bad, 0: not useful Good Bad Book quality User rating Book quality User rating

23 User Model The characteristics of a user The characteristics of a user Bias: Average “height” of the curve Bias: Average “height” of the curve Positive bias Negative bias Book quality User rating Book quality User rating

24 Iterative Model Refinement As more users rate a book, we get better estimates on book quality As more users rate a book, we get better estimates on book quality As we estimate a book quality better, we get better idea on a user’s sensitivity and bias As we estimate a book quality better, we get better idea on a user’s sensitivity and bias

25 Iterative Model Refinement User-provided Rating Book Rating Estimate User Characteristics

26 Final Recommendation Recommend the book with the highest expected rating Recommend the book with the highest expected rating

27 Initial Results Our system prefers a 4.9-star book by 100 people to a 5-star book by 1 user Our system prefers a 4.9-star book by 100 people to a 5-star book by 1 user If a user gives random ratings, the system ignores the user’s rating If a user gives random ratings, the system ignores the user’s rating More thorough evaluation on the way More thorough evaluation on the way

28 Other Projects Web Archive Web Archive Hidden Web Integration Hidden Web Integration Page Ranking Algorithm Page Ranking Algorithm

29 Ph.D. Students on the Projects Alex Ntoulas Rob Adams Victor Liu –In Dr Chu’s group

30 Thank You Questions? Questions?