Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan.

Slides:



Advertisements
Similar presentations
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Advertisements

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Research Seminar in Economics and finance
PCC-SU-010 Identify some key business issues you need to consider before setting up Product Change Control in MFG/PRO PCC Set up Product Change Control.
/ /17 32/ / /
Reflection nurulquran.com.
Worksheets.
1 ROOT vs PAW Davide Grandi INFN Milano. 2 Summary AMS Computing Meeting june 10th 2002 Davide Grandi INFN Milano ROOT and NT file size (different compressions.
1) a) H = 2.06%, S = 32.69%, O = 65.25% b) Ca = 54.09%, O = 43.18%, H = 2.73% 2) 24 x x x.112 = ) a) g/mol, b) g/mol.
Bayesian Adaptive Dose Finding Studies: Smaller, Stronger, Faster Scott M. Berry Scott M. Berry
Tripod Classroom-Level Student Perceptions as Measures of Teaching Effectiveness April 18, 2011 Ronald F. Ferguson, PhD The Tripod Project for School Improvement,
Anatomy of Aggregate Collections: The Example of Google Print for Libraries Brian Lavoie Senior Research Scientist OCLC Research OCLC Members Council Meeting.
AIDS epidemic update Figure AIDS epidemic update Figure 2007 Estimated adult (15–49 years) HIV prevalence rate (%) globally and in Sub-Saharan Africa,
Whiteboardmaths.com © 2004 All rights reserved
Demonstration of capabilities of a bi- regional CGE model to assess impacts of rural development policies (RURMOD-E) Demonstration Workshop Brussels,
Demonstration of capabilities of a bi- regional CGE model to assess impacts of rural development policies (RURMOD-E) Demonstration Workshop Brussels,
AIDS epidemic update Figure AIDS epidemic update Figure 2007 Estimated adult (15–49 years) HIV prevalence rate (%) globally and in Sub-Saharan Africa,
A SKILLED WORKFORCE FOR AVOIDING THE MIDDLE-INCOME TRAP IN BIH
CALENDAR.
50 Where are we listening? Could we detect Earth? Is there a Solution?
1 1  1 =.
Summative Math Test Algebra (28%) Geometry (29%)
Year 6 mental test 15 second questions Numbers and number system Numbers and the number system, Measures and Shape.
I can count in decimal steps from 0.01 to
Tenths and Hundredths.
Adding Adding by Partitioning Vertically.
Worker turnover: quits and separations Is worker turnover desirable? Is worker turnover desirable? Why do workers quit? Why do they separate? Why do workers.
£1 Million £500,000 £250,000 £125,000 £64,000 £32,000 £16,000 £8,000 £4,000 £2,000 £1,000 £500 £300 £200 £100 Welcome.
R + Hadoop = big data analytics Antonio Piccolboni Revolution Analytics.
Press the letter H to access the bonus slide
1 A B C
GALLAUDET UNIVERSITY NATIONAL ACADEMIC BOWL PRELIMINARY MATCH SAMPLE ROUND 1 1.
ENERGY STAR ® Qualified Fenestration Products ENERGY STAR Meeting – June 4, 2008 Steve Hopwood Office of Energy Efficiency.
GALLAUDET UNIVERSITY NATIONAL ACADEMIC BOWL PRELIMINARY MATCH SAMPLE ROUND 1 1.
GALLAUDET UNIVERSITY NATIONAL ACADEMIC BOWL PRELIMINARY MATCH SAMPLE ROUND 1.
1 Managing Flow Variability: Safety Inventory The Newsvendor ProblemArdavan Asef-Vaziri, Oct 2011 The Magnitude of Shortages (Out of Stock)
Break Time Remaining 10:00.
GALLAUDET UNIVERSITY NATIONAL ACADEMIC BOWL PRELIMINARY MATCH ROUND 1 1.
The basics for simulations
GALLAUDET UNIVERSITY NATIONAL ACADEMIC BOWL PRELIMINARY MATCH SAMPLE ROUND 1.
TUNING SHU RANGE 4 NM ASTERNAHEAD COURSE 350T PLOTTING GUARDSPEED 5 KTS START EXIT TUTORIALS COURSE 350T.
To Use the Teamwork Test -- Or Not? A Psychometric Evaluation Janet L. Kottke California State University, San Bernardino Kimberly A. French University.
Testing for Central Auditory Processing Disorders
2014 Level I Prep Class Determining a Neighborhood Factor 1.
Maritime Connectivity and Trade
A new way of myoelectric control A new way of myoelectric control MyoPro project Daphne van Baal Hans Rietman, Laura Kallenberg 25 september 2009.
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
DOKUZ EYLUL UNIVERSITY MARITIME BUSINESS AND MANAGEMENT DEPARTMENT OF DECK SELÇUK NAS Presented by World Coordinate System.
25-Aug-14Created by Mr. Lafferty Maths Dept. Bearings Working with Scales Scaled Drawings Scaled Drawings Making Simple Scale Drawings.
US & VIETNAMESE TACTICS IN THE VIETNAM WAR (1) The plan for today…. quick review of key points so far & why the US got involved video & discussion on.
WHY DID THE US GET INVOLVED IN VIETNAM? Learning Outcomes By the end of the lesson, we will have… … showed off our art skills … produced a timeline showing.
Charging at 120 and 240 Volts 120-Volt Portable Vehicle Charge Cord 240-Volt Home Charge Unit.
Making your point: debating Voting All school assemblies must be delivered as a rap. NOYES.
Week 2 Computer Programming Gray , Calibri 24
: 3 00.
5 minutes.
Numeracy Resources for KS2
Increasing studying time Personal improvement project.
Reconciling Asymmetric Bilateral Trade Statistics In the Construction of Global SUTs Presented by Lin Jones Zhi Wang International Conference on Measurement.
$100,000 Pyramid A Fun Vocabulary Game! CAN YOU GUESS ALL SIX WORDS IN 1 MINUTE? Player 1: Sees the word and defines/describes it without saying the word.
Age Biased Technical and Organisational Change, Training and Employment Prospects of Older Workers Luc Behaghel, Eve Caroli and Muriel Roger Paris School.
Clock will move after 1 minute
Select a time to count down from the clock above
Customized Individual Development Plan / CIDP® Refined Analytics
RAT R1 R2 R3 R4 R5 R6 R7 Fetch Q RS MOB ROB Execute Retire.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Topics and Transitions: Investigation of User Search Behavior Xuehua Shen, Susan Dumais, Eric Horvitz.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
Topics and Transitions: Investigation of User Search Behavior
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan Dumais *Work done during internship at Microsoft Research

Search and recommendation are about the matching. Queries Documents Websites Users

Term-space matching is not always a good idea. Granularity Sparsity Efficiency

Can we build representations beyond the term vectors? Topic Category Reading Level Sentiment Style

What would be their implications for search and recommendations? Queries Documents Websites Users Topic Category Reading Level Sentiment Style

In a Nutshell, WHAT WE DID:  Build Profiles of Reading Level and Topic (RLT)  For queries, websites, users and search sessions  In order to characterize and compare entities WHAT WE FOUND:  Profile matching predicts user’s content preference  Profiles can indicate when not to personalize  Profile features can predict expert content

Building Reading Level and Topic Profiles

Predicting Reading Level and Topic for URL  Reading Level Classifier  Based on language model and other sources  Topic Classifier  Trained using URLs in each Open Directory Project category  Profile  Distribution over reading level, topic, or reading level and topic (RLT) P(R|d 1 ) P(T|d 1 )

 Entities and Related URLs  Websites : content vs. user-viewed URLs  Users : URLs visited during search sessions  Queries : top-10 retrieved URLs  Example:  Site profile made from URLs visited during search sessions Entity Profile Built from Related URLs P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R|d 1 ) P(T|d 1 ) P(R,T|s)

 Entity and related entities  User – Websites visited  Website – Surfacing queries  Query – Issuing users  Example:  Site profile made from the profiles of its visitors Entity Profile Built with Related Entities User Query Website Visit Issue Surface P(R,T|s) P(R,T|u)

 Characterizing an Individual Entity  Mean : expectation  Variance : entropy  Characterizing a Group of Entities  Build a group centroid from its members  Variance : divergence among members  Comparing Entitles and Groups  Difference in mean  Divergence in profile (distribution) Characterizing and Comparing Profiles

Characterizing Web Content, User Interests, and Search Behavior

Data Set  Session Log Data  2,281,150 URL visits (1,218,433 SERP clicks)  Collected from 8,841 users  Profiles of Entities  4,715 websites with 25+ clicked URLs  7,613 users with 25+ URL visits  141,325 unique queries

 Each topic has different reading level distribution Reading Level Distribution for Top ODP Categories CategoryR1R2R3R4R5R6R7R8R9R10R11R12E[R|T] Reference Health Science Computers Business Society Adult Kids and Teens Games Recreation Arts Home News Shopping Sports

Topic and reading level characterize websites in each category

Profile matching predict user’s preference over search results  Metric  % of user’s preferences predicted by profile matching, for each clicked website over the skipped website above  Results  By degree of focus in user profile : H(R,T|u)  By the distance metric between user and website  KL R (u,s) / KL T (u,s) / KL RLT (u,s) User Group #ClicksKL R (u,s)KL T (u,s)KL RLT (u,s) ↑Focused 5, %60.79%65.27% 147, %54.20%54.41% ↓Diverse 197, %53.36%53.63%

Users’ Deviation from Their Own Profiles  Stretch reading  Session-level reading level >> Long-term reading level  Casual reading  Session-level reading level << Long-term reading level URL Title Words for Stretch Reading URL Title Words for Casual Reading Title word Log ratio Title word Log ratio tests2.22best-0.42 test1.99football-0.45 sample1.94store-0.46 digital1.88great (deals)-0.47 (tuition) options1.87items-0.52 (financial) aid1.87new-0.53 (medication) effects1.84sale-0.61 education1.77games-0.65

Comparing Expert vs. Non-expert URLs  Expert vs. Non-expert URLs taken from [White’09]

Predicting Expert vs. Novice Websites  Results  Features Baseline (predict most likely class) 65.8% Classifier accuracy 82.2% Feature Correl. with Expertness Description E[R|Qs]+0.34Expectation of Surfacing Query's RL E[R|Us]+0.44Expectation of Visitor's RL Div RLT (U,s)-0.56Distance of visitors’ RLT profile from site's Div T (U,s)-0.55Distance of visitors’ Topic profile from site's

Thank you for your attention! WHAT WE DID:  Build Profiles of Reading Level and Topic (RLT)  For Queries, Websites, Users and Search Sessions  To characterize and compare entities WHAT WE FOUND:  Profile matching predict user’s content preference  Profiles can indicate when not to personalize  Profile features can predict expert content More at / cs.umass.edu/~jykim

Optional Slides

 Website reading level vs. visitor diversity  Breakdown per topic reveals stronger relationship Correlation between Site vs. Visitor Profiles Website Reading Level Visitor Profile Diversity Div R (U|s)Div T (U|s)Div RT (U|s) E[R|s]

Query / User Reading Level against P(Topic)  User profile shows different trends in Computers