Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan.

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais *Work done during internship at Microsoft Research

Search and recommendation are about the matching. Queries Documents Websites Users

Term-space matching is not always a good idea. Granularity Sparsity Efficiency

Can we build representations beyond the term vectors? Topic Category Reading Level Sentiment Style

What would be their implications for search and recommendations? Queries Documents Websites Users Topic Category Reading Level Sentiment Style

In a Nutshell, WHAT WE DID:  Build Profiles of Reading Level and Topic (RLT)  For queries, websites, users and search sessions  In order to characterize and compare entities WHAT WE FOUND:  Profile matching predicts user’s content preference  Profiles can indicate when not to personalize  Profile features can predict expert content

Building Reading Level and Topic Profiles

Predicting Reading Level and Topic for URL  Reading Level Classifier  Based on language model and other sources  Topic Classifier  Trained using URLs in each Open Directory Project category  Profile  Distribution over reading level, topic, or reading level and topic (RLT) P(R|d 1 ) P(T|d 1 )

 Entities and Related URLs  Websites : content vs. user-viewed URLs  Users : URLs visited during search sessions  Queries : top-10 retrieved URLs  Example:  Site profile made from URLs visited during search sessions Entity Profile Built from Related URLs P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R,T|s)

 Entity and related entities  User – Websites visited  Website – Surfacing queries  Query – Issuing users  Example:  Site profile made from the profiles of its visitors Entity Profile Built with Related Entities User Query Website Visit Issue Surface P(R,T|s) P(R,T|u)

 Characterizing an Individual Entity  Mean : expectation  Variance : entropy  Characterizing a Group of Entities  Build a group centroid from its members  Variance : divergence among members  Comparing Entitles and Groups  Difference in mean  Divergence in profile (distribution) Characterizing and Comparing Profiles

Characterizing Web Content, User Interests, and Search Behavior

Data Set  Session Log Data  2,281,150 URL visits (1,218,433 SERP clicks)  Collected from 8,841 users  Profiles of Entities  4,715 websites with 25+ clicked URLs  7,613 users with 25+ URL visits  141,325 unique queries

 Each topic has different reading level distribution Reading Level Distribution for Top ODP Categories CategoryR1R2R3R4R5R6R7R8R9R10R11R12E[R|T] Reference 0.00 0.020.170.100.150.040.020.030.200.27 8.80 Health 0.00 0.030.180.080.130.04 0.100.270.11 8.53 Science 0.00 0.060.230.090.070.020.010.080.270.17 8.44 Computers 0.00 0.060.240.190.030.01 0.020.320.12 8.11 Business 0.00 0.050.220.160.090.030.020.040.260.12 8.08 Society 0.00 0.020.230.070.350.030.01 0.220.06 7.62 Adult 0.00 0.050.280.260.140.050.020.010.130.06 6.98 Kids and Teens 0.00 0.020.230.260.130.090.020.010.020.150.08 6.60 Games 0.00 0.190.360.100.110.02 0.030.120.03 6.39 Recreation 0.00 0.110.440.190.080.02 0.090.02 6.18 Arts 0.00 0.080.400.270.100.050.01 0.060.02 6.18 Home 0.00 0.020.190.410.140.040.030.010.030.090.04 6.08 News 0.00 0.040.410.330.140.02 0.010.030.01 5.99 Shopping 0.00 0.010.220.290.240.090.030.010.020.070.02 5.98 Sports 0.00 0.090.560.110.100.03 0.020.060.02 5.94

Topic and reading level characterize websites in each category

Profile matching predict user’s preference over search results  Metric  % of user’s preferences predicted by profile matching, for each clicked website over the skipped website above  Results  By degree of focus in user profile : H(R,T|u)  By the distance metric between user and website  KL R (u,s) / KL T (u,s) / KL RLT (u,s) User Group #ClicksKL R (u,s)KL T (u,s)KL RLT (u,s) ↑Focused 5,96059.23%60.79%65.27% 147,19552.25%54.20%54.41% ↓Diverse 197,73352.75%53.36%53.63%

Users’ Deviation from Their Own Profiles  Stretch reading  Session-level reading level >> Long-term reading level  Casual reading  Session-level reading level << Long-term reading level URL Title Words for Stretch Reading URL Title Words for Casual Reading Title word Log ratio Title word Log ratio tests2.22best-0.42 test1.99football-0.45 sample1.94store-0.46 digital1.88great (deals)-0.47 (tuition) options1.87items-0.52 (financial) aid1.87new-0.53 (medication) effects1.84sale-0.61 education1.77games-0.65

Comparing Expert vs. Non-expert URLs  Expert vs. Non-expert URLs taken from [White’09]

Predicting Expert vs. Novice Websites  Results  Features Baseline (predict most likely class) 65.8% Classifier accuracy 82.2% Feature Correl. with Expertness Description E[R|Qs]+0.34Expectation of Surfacing Query's RL E[R|Us]+0.44Expectation of Visitor's RL Div RLT (U,s)-0.56Distance of visitors’ RLT profile from site's Div T (U,s)-0.55Distance of visitors’ Topic profile from site's

Thank you for your attention! WHAT WE DID:  Build Profiles of Reading Level and Topic (RLT)  For Queries, Websites, Users and Search Sessions  To characterize and compare entities WHAT WE FOUND:  Profile matching predict user’s content preference  Profiles can indicate when not to personalize  Profile features can predict expert content More at : @jin4ir / cs.umass.edu/~jykim

Optional Slides

 Website reading level vs. visitor diversity  Breakdown per topic reveals stronger relationship Correlation between Site vs. Visitor Profiles Website Reading Level Visitor Profile Diversity Div R (U|s)Div T (U|s)Div RT (U|s) E[R|s]0.0520.0810.095

Query / User Reading Level against P(Topic)  User profile shows different trends in Computers

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan.

Similar presentations

Presentation on theme: "Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan.

Similar presentations

Presentation on theme: "Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan."— Presentation transcript:

Similar presentations

About project

Feedback