Download presentation
Presentation is loading. Please wait.
Published byLorin Wiggins Modified over 9 years ago
1
NSA - Nov 4, 2010 Susan Dumais Microsoft Research http://research.microsoft.com/~sdumais Information Analysis in a Dynamic and Data-Rich World
2
NSA - Nov 4, 2010 Change is Everywhere in Information Systems New documents appear all the time New documents appear all the time Query volume changes over time Query volume changes over time Document content changes over time Document content changes over time So does metadata and extracted info.. So does metadata and extracted info.. User interaction changes over time User interaction changes over time E.g., anchor text, query-click streams, page visits E.g., anchor text, query-click streams, page visits What’s relevant to a query changes over time What’s relevant to a query changes over time E.g., U.S. Open 2010 (in May vs. Sept) E.g., U.S. Open 2010 (in May vs. Sept) E.g., Hurricane Earl (in Sept 2010 vs. before/after) E.g., Hurricane Earl (in Sept 2010 vs. before/after) Change is pervasive in IR systems … yet, we’re not doing much about it ! Change is pervasive in IR systems … yet, we’re not doing much about it !
3
NSA - Nov 4, 2010 Information Dynamics 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Content Changes Today’s Browse and Search Experiences But, ignores … User Visitation/ReVisitation 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
4
NSA - Nov 4, 2010 Digital Dynamics Easy to Capture Easy to capture Easy to capture But … few tools support dynamics But … few tools support dynamics
5
NSA - Nov 4, 2010 Overview Characterize change in digital content Characterize change in digital content Content changes over time Content changes over time People re-visit and re-find over time People re-visit and re-find over time Improve retrieval and understanding Improve retrieval and understanding Data -> Models/Systems -> Decision Analysis Data -> Models/Systems -> Decision Analysis Examples Examples Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser News: Analysis of novelty (e.g., NewsJunkie) News: Analysis of novelty (e.g., NewsJunkie) Web: Tools for understanding change (e.g., Diff-IE) Web: Tools for understanding change (e.g., Diff-IE) Web: IR models that leverage dynamics Web: IR models that leverage dynamics Examples from our work on search and browser support … but more general Examples from our work on search and browser support … but more general
6
NSA - Nov 4, 2010 Stuff I’ve Seen (SIS) Many silos of information Many silos of information SIS: SIS: Unified access to distributed, heterogeneous, content (mail, files, web, tablet notes, rss, etc.) Unified access to distributed, heterogeneous, content (mail, files, web, tablet notes, rss, etc.) Index full content + metadata Index full content + metadata Fast, flexible search Fast, flexible search Information re-use Information re-use Stuff I’ve Seen Windows-DS [Dumais et al., SIGIR 2003] SIS -> SIS -> Windows Desktop Search Windows Desktop Search
7
Example searches Looking for: recent email from Fedor that contained a link to his new demo Initiated from: Start menu Query: from:Fedor Looking for: the pdf of a SIGIR paper on context and ranking (not sure it used those words) that someone (don’t remember who) sent me a month ago Initiated from: Outlook Query: SIGIR Looking for: meeting invite for the last intern handoff Initiated from: Start menu Query: intern handoff kind:appointment Looking for: C# program I wrote a long time ago Initiated from: Explorer pane Query: QCluster*.* NSA - Nov 4, 2010
8
Stuff I’ve Seen: Lessons Learned Personal stores: 5–1500k items Retrieved items: Age: Today (5%), Last week (21%), Last month (47%) Need to support episodic access to memory Information needs: People are important – 25% queries involve names/aliases Date by far the most common sort order, even for people who had best-match Rank as the default Few searches for “best” matching object Many other criteria (e.g., time, people, type), depending on task Need to support flexible access Abstractions important – “useful” date, people, pictures Desktop search != Web search NSA - Nov 4, 2010
9
Beyond Stuff I’ve Seen Better support for human memory Better support for human memory Memory Landmarks Memory Landmarks LifeBrowser LifeBrowser Phlat Phlat Beyond search Beyond search Proactive retrieval Proactive retrieval Stuff I Should See (IQ) Stuff I Should See (IQ) Temporal Gadget Temporal Gadget Using desktop index as a rich “user model” Using desktop index as a rich “user model” PSearch PSearch News Junkie News Junkie Diff-IE Diff-IE NSA - Nov 4, 2010
10
Memory Landmarks Importance of episodes in human memory Importance of episodes in human memory Memory organized into episodes (Tulving, 1983) Memory organized into episodes (Tulving, 1983) People-specific events as anchors (Smith et al., 1978) People-specific events as anchors (Smith et al., 1978) Time of events often recalled relative to other events, historical or autobiographical (Huttenlocher & Prohaska, 1997) Time of events often recalled relative to other events, historical or autobiographical (Huttenlocher & Prohaska, 1997) Identify and use landmarks facilitate search and information management Identify and use landmarks facilitate search and information management Timeline interface, augmented w/ landmarks Timeline interface, augmented w/ landmarks Learn Bayesian models to identify memorable events Learn Bayesian models to identify memorable events Extensions beyond search, e.g., Life Browser Extensions beyond search, e.g., Life Browser NSA - Nov 4, 2010
11
Memory Landmarks Search Results Memory Landmarks - General (world, calendar) - Personal (appts, photos) Linked to results by time Distribution of Results Over Time [Ringle et al., 2003]
12
Memory Landmarks Learned models of memorability NSA - Nov 4, 2010
13
Images & videos Appts & events Desktop & search activity Whiteboard capture Locations LifeBrowser [Horvitz & Koch, 2010] NSA - Nov 4, 2010
14
LifeBrowser Learned models of selective memory NSA - Nov 4, 2010
15
Personalized news using information novelty Personalized news using information novelty Stream of news -> cluster groups of related articles Stream of news -> cluster groups of related articles Characterize what a user knows Characterize what a user knows Compute the novelty of new articles, relative to this background (relevant & novel) Compute the novelty of new articles, relative to this background (relevant & novel) Novelty = KLDivergence (article || current_knowledge) Novelty = KLDivergence (article || current_knowledge) Novelty score and user preferences guide what to show when and how to do so Novelty score and user preferences guide what to show when and how to do so [Gabrilovich et al., WWW 2004] News Junkie Evolution of Context over Time
16
NSA - Nov 4, 2010 NewsJunkie Types of novelty, via intra-article novelty dynamics Offshoot: SARS impact on Asian stock markets Word Position Novelty Score On-topic, recap Word Position Novelty Score On topic, elaboration: SARS patient’s wife held under quarantine Word Position Novelty Score Offshoot: Swiss company develops SARS vaccine Word Position Novelty Score
17
NSA - Nov 4, 2010 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Content Changes User Visitation/ReVisitation 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Characterizing Web Change Large-scale Web crawls, over time Large-scale Web crawls, over time Revisited pages Revisited pages 55,000 pages crawled hourly for 18+ months 55,000 pages crawled hourly for 18+ months Unique users, visits/user, time between visits Unique users, visits/user, time between visits Pages returned by a search engine (for ~ 100k queries) Pages returned by a search engine (for ~ 100k queries) 6 million pages crawled every two days for 6 months 6 million pages crawled every two days for 6 months [Adar et al., WSDM 2009]
18
NSA - Nov 4, 2010 Measuring Web Page Change Summary metrics Summary metrics Number of changes Number of changes Amount of change Amount of change Time between changes Time between changes Change curves Change curves Fixed starting point Fixed starting point Measure similarity over different time intervals Measure similarity over different time intervals Within-page changes Within-page changes
19
NSA - Nov 4, 2010 Measuring Web Page Change Summary metrics Summary metrics Number of changes Number of changes Amount of change Amount of change Time between changes Time between changes 33% of Web pages change 33% of Web pages change 66% of visited Web pages change 66% of visited Web pages change 63% of these change every hr. 63% of these change every hr. Avg. Dice coeff. = 0.80 Avg. Dice coeff. = 0.80 Avg. time bet. change = 123 hrs. Avg. time bet. change = 123 hrs..edu and.gov pages change infrequently, and not by much.edu and.gov pages change infrequently, and not by much popular pages change more frequently, but not by much popular pages change more frequently, but not by much
20
NSA - Nov 4, 2010 Measuring Web Page Change Summary metrics Summary metrics Number of changes Number of changes Amount of change Amount of change Time between changes Time between changes Change curves Change curves Fixed starting point Fixed starting point Measure similarity over different time intervals Measure similarity over different time intervals Knot point Time from starting point
21
NSA - Nov 4, 2010 Measuring Within-Page Change DOM-level changes DOM-level changes Term-level changes Term-level changes Divergence from norm Divergence from norm cookbooks cookbooks salads salads cheese cheese ingredient ingredient bbq bbq … “Staying power” in page “Staying power” in page Time Sep. Oct. Nov. Dec.
22
NSA - Nov 4, 2010 Example Term Longevity Graphs
23
NSA - Nov 4, 2010 Revisitation on the Web 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Content Changes User Visitation/ReVisitation 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 What was the last Web page you visited? Why did you visit (re-visit) the page? Revisitation patterns Revisitation patterns Log analyses Log analyses Toolbar logs for revisitation Toolbar logs for revisitation Query logs for re-finding Query logs for re-finding User survey to understand intent in revisitations User survey to understand intent in revisitations [Adar et al., CHI 2009]
24
60-80% of Web pages you visit, you’ve visited before 60-80% of Web pages you visit, you’ve visited before Many motivations for revisits Many motivations for revisits NSA - Nov 4, 2010 Measuring Revisitation Summary metrics Summary metrics Unique visitors Unique visitors Visits/user Visits/user Time between visits Time between visits Revisitation curves Revisitation curves Histogram of revisit intervals Histogram of revisit intervals Normalized Normalized Time Interval
25
NSA - Nov 4, 2010 Four Revisitation Patterns Fast Fast Hub-and-spoke Hub-and-spoke Navigation within site Navigation within site Hybrid Hybrid High quality fast pages High quality fast pages Medium Medium Popular homepages Popular homepages Mail and Web applications Mail and Web applications Slow Slow Entry pages, bank pages Entry pages, bank pages Accessed via search engine Accessed via search engine
26
NSA - Nov 4, 2010 Revisitation and Search (ReFinding) Repeat query (33%) Q: microsoft research Repeat click (39%) http://research.microsoft.com http:// Q: microsoft research, msr … Big opportunity (43%) 24% “navigational revisits” [Teevan et al., SIGIR 2007] [Tyler et al., WSDM 2010]
27
NSA - Nov 4, 2010 Building Support for Web Dynamics 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Content Changes 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Diff IE Temporal IR User Visitation/ReVisitation
28
NSA - Nov 4, 2010 Diff-IE Changes to page since your last visit Diff-IE toolbar [Teevan et al., UIST 2009] [Teevan et al., CHI 2010]
29
NSA - Nov 4, 2010 Interesting Features of Diff-IE Always on In-situ New to you Non-intrusive
30
NSA - Nov 4, 2010 Examples of Diff-IE in Action
31
NSA - Nov 4, 2010 Expected New Content
32
NSA - Nov 4, 2010 Monitor
33
Unexpected Important Content
34
NSA - Nov 4, 2010 Serendipitous Encounters
35
NSA - Nov 4, 2010 Understand Page Dynamics
36
NSA - Nov 4, 2010 Unexpected Unimportant Content Attend to Activity Edit Understand Page Dynamics Serendipitous Encounter Unexpected Important Content Expected New Content Monitor Expected Unexpected
37
NSA - Nov 4, 2010 Studying Diff-IE Feedback buttons Feedback buttons Survey Survey Prior to installation Prior to installation After a month of use After a month of use Logging Logging URLs visited URLs visited Amount of change when revisited Amount of change when revisited Experience interview Experience interview In situ Representative Experience Longitudinal
38
NSA - Nov 4, 2010 People Revisit More Perception of revisitation remains constant Perception of revisitation remains constant How often do you revisit? How often do you revisit? How often are revisits to view new content? How often are revisits to view new content? Actual revisitation increases Actual revisitation increases First week: 39.4% of visits are revisits First week: 39.4% of visits are revisits Last week: 45.0% of visits are revisits Last week: 45.0% of visits are revisits Why are people revisiting more with DIFF-IE? Why are people revisiting more with DIFF-IE? 14%
39
NSA - Nov 4, 2010 Revisited Pages Change More Perception of change increases Perception of change increases What proportion of pages change regularly? What proportion of pages change regularly? How often do you notice unexpected change? How often do you notice unexpected change? Amount of change seen increases Amount of change seen increases First week: 21.5% revisits, changed by 6.2% First week: 21.5% revisits, changed by 6.2% Last week: 32.4% revisits, changed by 9.5% Last week: 32.4% revisits, changed by 9.5% Diff-IE is driving visits to changed pages Diff-IE is driving visits to changed pages It supports people in understanding change It supports people in understanding change 51+% 17% 8%
40
Other Examples of Dynamics and User Experience Content changes Content changes Diff-IE Diff-IE Zoetrope (Adar et al., 2008) Zoetrope (Adar et al., 2008) Diffamation (Chevalier et al., 2010) Diffamation (Chevalier et al., 2010) Temporal summaries and snippets … Temporal summaries and snippets … Interaction changes Interaction changes Explicit annotations, ratings, wikis, etc. Explicit annotations, ratings, wikis, etc. Implicit interest via interaction patterns Implicit interest via interaction patterns Edit wear and read wear (Hill et al., 1992) Edit wear and read wear (Hill et al., 1992) NSA - Nov 4, 2010
41
Leveraging Dynamics for Retrieval 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Content Changes 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 User Visitation/ReVisitation Temporal IR
42
Temporal Retrieval Models Current IR algorithms look only at a single snapshot of a page Current IR algorithms look only at a single snapshot of a page But, Web pages change over time But, Web pages change over time Can we can leverage this to improved retrieval? Can we can leverage this to improved retrieval? Pages have different rates of change Pages have different rates of change Different priors (using change vs. link structure) Different priors (using change vs. link structure) Terms have different longevity (staying power) Terms have different longevity (staying power) Some are always on the page; some transient Some are always on the page; some transient Language modeling approach to ranking Language modeling approach to ranking NSA - Nov 4, 2010 Change prior Term longevity [Elsas et al., WSDM 2010]
43
Relevance and Page Change Page change is related to relevance judgments Page change is related to relevance judgments Human relevance judgments Human relevance judgments 5 point scale – Perfect/Excellent/Good/Fair/Bad 5 point scale – Perfect/Excellent/Good/Fair/Bad Rate of Change -- 60% Perfect pages; 30% Bad pages Rate of Change -- 60% Perfect pages; 30% Bad pages Use change rate as a document prior (vs. priors based on link structure like Page Rank) Use change rate as a document prior (vs. priors based on link structure like Page Rank) Shingle prints to measure change Shingle prints to measure change NSA - Nov 4, 2010 Change prior
44
Relevance and Term Change Terms patterns vary over time Terms patterns vary over time Represent a document as a mixture of terms with different “staying power” Represent a document as a mixture of terms with different “staying power” Long, Medium, Short Long, Medium, Short NSA - Nov 4, 2010 Term longevity
45
Evaluation: Queries & Documents 18K Queries, 2.5M Judged Documents 18K Queries, 2.5M Judged Documents 5-level relevance judgment (Perfect … Bad) 5-level relevance judgment (Perfect … Bad) 2.5M Documents crawled weekly for 10 wks 2.5M Documents crawled weekly for 10 wks Navigational queries Navigational queries 2k queries identified with a “Perfect” judgment 2k queries identified with a “Perfect” judgment Assume these relevance judgments are consistent over time Assume these relevance judgments are consistent over time NSA - Nov 4, 2010
46
Experimental Results Baseline Static Model Dynamic Model Dynamic Model + Change Prior Change Prior NSA - Nov 4, 2010
47
Temporal Retrieval, Ongoing Work Initial evaluation/model Initial evaluation/model Focused on navigational queries Focused on navigational queries Assumed their relevance is “static” over time Assumed their relevance is “static” over time But, there are many other cases … But, there are many other cases … E.g., US Open 2010 (in June vs. Sept) E.g., US Open 2010 (in June vs. Sept) E.g., World Cup Results (in 2010 vs. 2006) E.g., World Cup Results (in 2010 vs. 2006) Ongoing evaluation Ongoing evaluation Collecting explicit relevance judgments, interaction data, page content, and query frequency over time Collecting explicit relevance judgments, interaction data, page content, and query frequency over time NSA - Nov 4, 2010 [Kulkarni et al., WSDM 2011]
48
Relevance over Time Query: march madness [Mar 15 – Apr 4, 2010] Query: march madness [Mar 15 – Apr 4, 2010] NSA - Nov 4, 2010 Time (weeks ) Relevance Judgment (normalized) Mar 25, 2010 May 28, 2010
49
Other Examples of Dynamics and IR Models/Systems Temporal retrieval models Temporal retrieval models Elsas & Dumais (2010); Liu & Croft (2004); Efron (2010); Aji et al. (2010) Elsas & Dumais (2010); Liu & Croft (2004); Efron (2010); Aji et al. (2010) Document dynamics, for crawling and indexing Document dynamics, for crawling and indexing Efficient crawling and updates Efficient crawling and updates Query dynamics Query dynamics Jones & Diaz (2004); Kulkarni et al. (2011); Kotov et al. (2010) Jones & Diaz (2004); Kulkarni et al. (2011); Kotov et al. (2010) Integration of news into Web results Integration of news into Web results Diaz (2009) Diaz (2009) Extraction of temporal entities within documents Extraction of temporal entities within documents Protocols for retrieving versions over time Protocols for retrieving versions over time E.g., Memento (Von de Sompel et al., 2010) E.g., Memento (Von de Sompel et al., 2010) NSA - Nov 4, 2010
50
Summary Characterize change in information systems Characterize change in information systems Content changes over time Content changes over time People re-visit and re-find over time People re-visit and re-find over time Develop new algorithms and systems to improve retrieval and understanding of dynamic collections Develop new algorithms and systems to improve retrieval and understanding of dynamic collections Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser Desktop: Stuff I’ve Seen; Memory Landmarks; LifeBrowser News: Analysis of novelty (e.g., NewsJunkie) News: Analysis of novelty (e.g., NewsJunkie) Web: Tools for understanding change (e.g., Diff-IE) Web: Tools for understanding change (e.g., Diff-IE) Web: Retrieval models that leverage dynamics Web: Retrieval models that leverage dynamics
51
Key Themes Dynamics are pervasive in information systems Dynamics are pervasive in information systems Data-driven approaches increasingly important Data-driven approaches increasingly important Data -> Models/Systems -> Decision Analysis Data -> Models/Systems -> Decision Analysis Large volumes of data are available Large volumes of data are available Data scarcity is being replaced by data surfeit Data scarcity is being replaced by data surfeit Data exhaust is a critical resource in the internet economy, and can be more broadly leveraged Data exhaust is a critical resource in the internet economy, and can be more broadly leveraged Machine learning and probabilistic models are replacing hand-coded rules Machine learning and probabilistic models are replacing hand-coded rules Broadly applicable Broadly applicable Search … plus: machine translation, spam filtering, traffic flow, evidence-based health care, skill assessment, etc. Search … plus: machine translation, spam filtering, traffic flow, evidence-based health care, skill assessment, etc. NSA - Nov 4, 2010
52
Thank You ! Questions/Comments … More info, http://research.microsoft.com/~sdumais http://research.microsoft.com/~sdumais Diff-IE (coming soon), http://research.microsoft.com/en-us/projects/diffie/default.aspx http://research.microsoft.com/en-us/projects/diffie/default.aspx
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.