Download presentation
Presentation is loading. Please wait.
Published byRicardo Jagoe Modified over 9 years ago
1
Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais *Work done during internship at Microsoft Research
2
Search and recommendation are about the matching. Queries Documents Websites Users
3
Term-space matching is not always a good idea. Granularity Sparsity Efficiency
4
Can we build representations beyond the term vectors? Topic Category Reading Level Sentiment Style
5
What would be their implications for search and recommendations? Queries Documents Websites Users Topic Category Reading Level Sentiment Style
6
In a Nutshell, WHAT WE DID: Build Profiles of Reading Level and Topic (RLT) For queries, websites, users and search sessions In order to characterize and compare entities WHAT WE FOUND: Profile matching predicts user’s content preference Profiles can indicate when not to personalize Profile features can predict expert content
7
Building Reading Level and Topic Profiles
8
Predicting Reading Level and Topic for URL Reading Level Classifier Based on language model and other sources Topic Classifier Trained using URLs in each Open Directory Project category Profile Distribution over reading level, topic, or reading level and topic (RLT) P(R|d 1 ) P(T|d 1 )
9
Entities and Related URLs Websites : content vs. user-viewed URLs Users : URLs visited during search sessions Queries : top-10 retrieved URLs Example: Site profile made from URLs visited during search sessions Entity Profile Built from Related URLs P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R,T|s)
10
Entity and related entities User – Websites visited Website – Surfacing queries Query – Issuing users Example: Site profile made from the profiles of its visitors Entity Profile Built with Related Entities User Query Website Visit Issue Surface P(R,T|s) P(R,T|u)
11
Characterizing an Individual Entity Mean : expectation Variance : entropy Characterizing a Group of Entities Build a group centroid from its members Variance : divergence among members Comparing Entitles and Groups Difference in mean Divergence in profile (distribution) Characterizing and Comparing Profiles
12
Characterizing Web Content, User Interests, and Search Behavior
13
Data Set Session Log Data 2,281,150 URL visits (1,218,433 SERP clicks) Collected from 8,841 users Profiles of Entities 4,715 websites with 25+ clicked URLs 7,613 users with 25+ URL visits 141,325 unique queries
14
Each topic has different reading level distribution Reading Level Distribution for Top ODP Categories CategoryR1R2R3R4R5R6R7R8R9R10R11R12E[R|T] Reference 0.00 0.020.170.100.150.040.020.030.200.27 8.80 Health 0.00 0.030.180.080.130.04 0.100.270.11 8.53 Science 0.00 0.060.230.090.070.020.010.080.270.17 8.44 Computers 0.00 0.060.240.190.030.01 0.020.320.12 8.11 Business 0.00 0.050.220.160.090.030.020.040.260.12 8.08 Society 0.00 0.020.230.070.350.030.01 0.220.06 7.62 Adult 0.00 0.050.280.260.140.050.020.010.130.06 6.98 Kids and Teens 0.00 0.020.230.260.130.090.020.010.020.150.08 6.60 Games 0.00 0.190.360.100.110.02 0.030.120.03 6.39 Recreation 0.00 0.110.440.190.080.02 0.090.02 6.18 Arts 0.00 0.080.400.270.100.050.01 0.060.02 6.18 Home 0.00 0.020.190.410.140.040.030.010.030.090.04 6.08 News 0.00 0.040.410.330.140.02 0.010.030.01 5.99 Shopping 0.00 0.010.220.290.240.090.030.010.020.070.02 5.98 Sports 0.00 0.090.560.110.100.03 0.020.060.02 5.94
15
Topic and reading level characterize websites in each category
16
Profile matching predict user’s preference over search results Metric % of user’s preferences predicted by profile matching, for each clicked website over the skipped website above Results By degree of focus in user profile : H(R,T|u) By the distance metric between user and website KL R (u,s) / KL T (u,s) / KL RLT (u,s) User Group #ClicksKL R (u,s)KL T (u,s)KL RLT (u,s) ↑Focused 5,96059.23%60.79%65.27% 147,19552.25%54.20%54.41% ↓Diverse 197,73352.75%53.36%53.63%
17
Users’ Deviation from Their Own Profiles Stretch reading Session-level reading level >> Long-term reading level Casual reading Session-level reading level << Long-term reading level URL Title Words for Stretch Reading URL Title Words for Casual Reading Title word Log ratio Title word Log ratio tests2.22best-0.42 test1.99football-0.45 sample1.94store-0.46 digital1.88great (deals)-0.47 (tuition) options1.87items-0.52 (financial) aid1.87new-0.53 (medication) effects1.84sale-0.61 education1.77games-0.65
18
Comparing Expert vs. Non-expert URLs Expert vs. Non-expert URLs taken from [White’09]
19
Predicting Expert vs. Novice Websites Results Features Baseline (predict most likely class) 65.8% Classifier accuracy 82.2% Feature Correl. with Expertness Description E[R|Qs]+0.34Expectation of Surfacing Query's RL E[R|Us]+0.44Expectation of Visitor's RL Div RLT (U,s)-0.56Distance of visitors’ RLT profile from site's Div T (U,s)-0.55Distance of visitors’ Topic profile from site's
20
Thank you for your attention! WHAT WE DID: Build Profiles of Reading Level and Topic (RLT) For Queries, Websites, Users and Search Sessions To characterize and compare entities WHAT WE FOUND: Profile matching predict user’s content preference Profiles can indicate when not to personalize Profile features can predict expert content More at : @jin4ir / cs.umass.edu/~jykim
21
Optional Slides
22
Website reading level vs. visitor diversity Breakdown per topic reveals stronger relationship Correlation between Site vs. Visitor Profiles Website Reading Level Visitor Profile Diversity Div R (U|s)Div T (U|s)Div RT (U|s) E[R|s]0.0520.0810.095
23
Query / User Reading Level against P(Topic) User profile shows different trends in Computers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.