Presentation is loading. Please wait.

Presentation is loading. Please wait.

Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,

Similar presentations


Presentation on theme: "Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,"— Presentation transcript:

1 Characterizing Web Content , User Interest, and Search Behavior by Reading Level and Topic
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais Source: WSDM 2012 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang

2 outline Introduction Reading Level & Topic Profiles
Characterizing the web Applications Conclusion

3 Introduction web search user interest Topic Reading level

4 Introduction Estimate probabilistic profiles to describe users, queries or websites and analyze user behavior Topic Reading level

5 outline Introduction Reading Level & Topic Profiles
Characterizing the web Applications Conclusion

6 Reading Level & Topic Profiles
entity :website(s), user(u), query(q) reading level(R), topic(T), reading level and topic(RT) profile :a probability distribution of reading level and topic (RLT profile) EX a reading level and topic profile of a user:P(RT | u) a reading level and topic profile of a query:P(RT | q)

7 Predicting Reading Level and Topic for URL
Represent the reading difficulty of a document as a random variable Rd taking values in the range Reading Level Classifier Based on language model Topic Classifier Training using URLs in each Open Directory Project category (ODP)

8 Building Reading Level and Topic Profiles
Profiles based on the entity itself Given a sets of URLs associated with each entity, the joint of distribution of reading level and topic is built by aggregating the distributions of the individual URLs computed by URL-level classifiers To prevent the bias arising Choose 25 URLs to estimate the site-level or user-level profiles Use the top URLs as of the profile for the query

9 Building Reading Level and Topic Profiles
Profiles based on the entity relationships Circular dependency using profiles based only on the entity itself Query Surface Issue Website User Visit

10 Characterizing and Comparing profiles
Characterizing an Individual Entity E[R|e] : expectation of reading level for a given entity e H(R|e) : reading level entropy of the entity e

11 Characterizing and Comparing profiles
Characterizing a Group of Entities Build the profile of an entity group by aggregating the distributions of individual weighted centroid of the individual distributions EX:reading level profile of U Characterize the group profile can represent the diversity in terms of its members

12 Characterizing and Comparing profiles
Comparing Entities and Groups Simplest metric of comparison

13 Characterizing and Comparing profiles
Comparing Entities and Groups Similarity between the full probability distribution of two entities Kullback-Leibler(KL) Divergence Jensen-Shannon(JS) Divergence

14 outline Introduction Reading Level & Topic Profiles
Characterizing the web Applications Conclusion

15 Data Set Session Log Data Web content dataset
Contain the anonymized logs of URL visited by user Web pages visits from users who visited at least 25 pages During 10 weeks (2010.8) Web content dataset Reading level and ODP topic predictions 8 billion web document from

16 Characterizing web content

17 Characterizing websites
Topic-specific analysis

18 Characterizing web queries

19 Characterizing websites
Joint analysis of reading level and topic

20 Characterizing web users
Users’ Deviation from Their Own Profiles Stretch reading Future work

21 outline Introduction Reading Level & Topic Profiles
Characterizing the web Applications Conclusion

22 Application Compare expert v.s non-expert URLs

23 Application Predict expert websites Result

24 outline Introduction Reading Level & Topic Profiles
Characterizing the web Applications Conclusion

25 Conclusion Provide novel characterizations for websites, users and queries by combining distribution of reading level and topic. Can be used for a variety of search-related tasks and predicting the content of a URL or site is targeted at domain experts or non-experts. Use features derived from RLT profiles to predict a user’s preference for Websites in search results. .

26 Conclusion The divergence metrics developed in this paper can be evaluated for their effectiveness as features for personalized re- ranking. The techniques developed for expert v.s notice site classification can be applied both for recommendation and ranking purposes.

27 ~Thank you for your listening~


Download ppt "Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,"

Similar presentations


Ads by Google