Presentation is loading. Please wait.

Presentation is loading. Please wait.

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan.

Similar presentations


Presentation on theme: "Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan."— Presentation transcript:

1 Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais *Work done during internship at Microsoft Research

2 Search and recommendation are about the matching. Queries Documents Websites Users

3 Term-space matching is not always a good idea. Granularity Sparsity Efficiency

4 Can we build representations beyond the term vectors? Topic Category Reading Level Sentiment Style

5 What would be their implications for search and recommendations? Queries Documents Websites Users Topic Category Reading Level Sentiment Style

6 In a Nutshell, WHAT WE DID:  Build Profiles of Reading Level and Topic (RLT)  For queries, websites, users and search sessions  In order to characterize and compare entities WHAT WE FOUND:  Profile matching predicts user’s content preference  Profiles can indicate when not to personalize  Profile features can predict expert content

7 Building Reading Level and Topic Profiles

8 Predicting Reading Level and Topic for URL  Reading Level Classifier  Based on language model and other sources  Topic Classifier  Trained using URLs in each Open Directory Project category  Profile  Distribution over reading level, topic, or reading level and topic (RLT) P(R|d 1 ) P(T|d 1 )

9  Entities and Related URLs  Websites : content vs. user-viewed URLs  Users : URLs visited during search sessions  Queries : top-10 retrieved URLs  Example:  Site profile made from URLs visited during search sessions Entity Profile Built from Related URLs P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R,T|s)

10  Entity and related entities  User – Websites visited  Website – Surfacing queries  Query – Issuing users  Example:  Site profile made from the profiles of its visitors Entity Profile Built with Related Entities User Query Website Visit Issue Surface P(R,T|s) P(R,T|u)

11  Characterizing an Individual Entity  Mean : expectation  Variance : entropy  Characterizing a Group of Entities  Build a group centroid from its members  Variance : divergence among members  Comparing Entitles and Groups  Difference in mean  Divergence in profile (distribution) Characterizing and Comparing Profiles

12 Characterizing Web Content, User Interests, and Search Behavior

13 Data Set  Session Log Data  2,281,150 URL visits (1,218,433 SERP clicks)  Collected from 8,841 users  Profiles of Entities  4,715 websites with 25+ clicked URLs  7,613 users with 25+ URL visits  141,325 unique queries

14  Each topic has different reading level distribution Reading Level Distribution for Top ODP Categories CategoryR1R2R3R4R5R6R7R8R9R10R11R12E[R|T] Reference 0.00 0.020.170.100.150.040.020.030.200.27 8.80 Health 0.00 0.030.180.080.130.04 0.100.270.11 8.53 Science 0.00 0.060.230.090.070.020.010.080.270.17 8.44 Computers 0.00 0.060.240.190.030.01 0.020.320.12 8.11 Business 0.00 0.050.220.160.090.030.020.040.260.12 8.08 Society 0.00 0.020.230.070.350.030.01 0.220.06 7.62 Adult 0.00 0.050.280.260.140.050.020.010.130.06 6.98 Kids and Teens 0.00 0.020.230.260.130.090.020.010.020.150.08 6.60 Games 0.00 0.190.360.100.110.02 0.030.120.03 6.39 Recreation 0.00 0.110.440.190.080.02 0.090.02 6.18 Arts 0.00 0.080.400.270.100.050.01 0.060.02 6.18 Home 0.00 0.020.190.410.140.040.030.010.030.090.04 6.08 News 0.00 0.040.410.330.140.02 0.010.030.01 5.99 Shopping 0.00 0.010.220.290.240.090.030.010.020.070.02 5.98 Sports 0.00 0.090.560.110.100.03 0.020.060.02 5.94

15 Topic and reading level characterize websites in each category

16 Profile matching predict user’s preference over search results  Metric  % of user’s preferences predicted by profile matching, for each clicked website over the skipped website above  Results  By degree of focus in user profile : H(R,T|u)  By the distance metric between user and website  KL R (u,s) / KL T (u,s) / KL RLT (u,s) User Group #ClicksKL R (u,s)KL T (u,s)KL RLT (u,s) ↑Focused 5,96059.23%60.79%65.27% 147,19552.25%54.20%54.41% ↓Diverse 197,73352.75%53.36%53.63%

17 Users’ Deviation from Their Own Profiles  Stretch reading  Session-level reading level >> Long-term reading level  Casual reading  Session-level reading level << Long-term reading level URL Title Words for Stretch Reading URL Title Words for Casual Reading Title word Log ratio Title word Log ratio tests2.22best-0.42 test1.99football-0.45 sample1.94store-0.46 digital1.88great (deals)-0.47 (tuition) options1.87items-0.52 (financial) aid1.87new-0.53 (medication) effects1.84sale-0.61 education1.77games-0.65

18 Comparing Expert vs. Non-expert URLs  Expert vs. Non-expert URLs taken from [White’09]

19 Predicting Expert vs. Novice Websites  Results  Features Baseline (predict most likely class) 65.8% Classifier accuracy 82.2% Feature Correl. with Expertness Description E[R|Qs]+0.34Expectation of Surfacing Query's RL E[R|Us]+0.44Expectation of Visitor's RL Div RLT (U,s)-0.56Distance of visitors’ RLT profile from site's Div T (U,s)-0.55Distance of visitors’ Topic profile from site's

20 Thank you for your attention! WHAT WE DID:  Build Profiles of Reading Level and Topic (RLT)  For Queries, Websites, Users and Search Sessions  To characterize and compare entities WHAT WE FOUND:  Profile matching predict user’s content preference  Profiles can indicate when not to personalize  Profile features can predict expert content More at : @jin4ir / cs.umass.edu/~jykim

21 Optional Slides

22  Website reading level vs. visitor diversity  Breakdown per topic reveals stronger relationship Correlation between Site vs. Visitor Profiles Website Reading Level Visitor Profile Diversity Div R (U|s)Div T (U|s)Div RT (U|s) E[R|s]0.0520.0810.095

23 Query / User Reading Level against P(Topic)  User profile shows different trends in Computers


Download ppt "Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan."

Similar presentations


Ads by Google