Download presentation
Presentation is loading. Please wait.
Published byLester Turner Modified over 9 years ago
1
2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy Dissertation Defense Old Dominion University Department of Computer Science Advisor: Dr. Michael L. Nelson Dr. Michele C. Weigle Dr. Hussein M. Abdel-Wahab Dr. M’Hammed Abdous Committee : May 5 th, 2015
2
2015 Hany SalahEldeen Dissertation Defense2 All tweets are equal… …but some are more equal than the others
3
2015 Hany SalahEldeen Dissertation Defense3 It is imperative to know… 1.How long would these last? 2.And if lost, is there a backup somewhere? 3.Is this what the author intended?
4
2015 Hany SalahEldeen Dissertation Defense4 To maintain historical integrity Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.
5
2015 Hany SalahEldeen Dissertation Defense5 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions
6
2015 Hany SalahEldeen Dissertation Defense6 People rely on social media for most updated information
7
2015 Hany SalahEldeen Dissertation Defense7 Social media is more than kitty photos Marie Colvin January 12, 1956 – February 22, 2012 Rémi Ochlik 16 October 1983 – 22 February 2012 Ahmed Assem 1987 – July 8, 2013
8
2015 Hany SalahEldeen Dissertation Defense8 For the web is dark, and full of missing content… Accessed in July 2014 3 out 8 external links on Remi’s Wikipedia page return 404
9
2015 Hany SalahEldeen Dissertation Defense9 even for content shared in social media Accessed in July 2014
10
2015 Hany SalahEldeen Dissertation Defense10 News sites are also prone to change Accessed in July 2014
11
2015 Hany SalahEldeen Dissertation Defense11 So are specialized sites Accessed in July 2014
12
2015 Hany SalahEldeen Dissertation Defense12 Research Problem: Author’s Intention ≠ Reader’s Experience
13
2015 Hany SalahEldeen Dissertation Defense13 Research Implication Author’s Intention ≠ Reader’s Experience Broken Inconsistent Web and Historical Records
14
2015 Hany SalahEldeen Dissertation Defense14 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions
15
2015 Hany SalahEldeen Dissertation Defense15 Social Post
16
2015 Hany SalahEldeen Dissertation Defense16 The anatomy of a tweet Author’s username Other user mention Tweet Body Hash Tag Shortened URL to resource Publishing timestamp Social Post Shared Resource Interaction options
17
2015 Hany SalahEldeen Dissertation Defense17 3 URIs = 3 Chances to fail
18
2015 Hany SalahEldeen Dissertation Defense18 URL shortening and aliasing curl -L -I http://bit.ly/losing_revolutionhttp://bit.ly/losing_revolution HTTP/1.1 301 Moved Permanently Server: nginx Date: Mon, 07 Jul 2014 18:19:48 GMT Cache-Control: private; max-age=90 Location: http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html Mime-Version: 1.0 Set-Cookie: _bit=53bae4c4-00328-04f10- cb1cf10a;domain=.bit.ly;expires=Sat Jan 3 18:19:48 2015;path=/; HttpOnly Content-Type: text/html;charset=utf-8 Content-Length: 167 HTTP/1.1 200 OK Expires: Mon, 07 Jul 2014 18:19:52 GMT Date: Mon, 07 Jul 2014 18:19:52 GMT Cache-Control: private, max-age=0 Last-Modified: Mon, 07 Jul 2014 18:19:07 GMT ETag: "e3555826-b103-4daa-a3f2- d0509ebab51f" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Server: GSE Alternate-Protocol: 80:quic Content-Type: text/html;charset=UTF-8 Content-Length: 0
19
2015 Hany SalahEldeen Dissertation Defense19 Life cycle of a social post
20
2015 Hany SalahEldeen Dissertation Defense20 Life cycle of a social post tweets
21
2015 Hany SalahEldeen Dissertation Defense21 Life cycle of a social post tweets Links to
22
2015 Hany SalahEldeen Dissertation Defense22 Life cycle of a social post tweets What the reader receives Links to Same state the author intended
23
2015 Hany SalahEldeen Dissertation Defense23 Life cycle of a social post tweets What the reader receives Links to Same state the author intended Ideally!
24
2015 Hany SalahEldeen Dissertation Defense24 Life cycle of a social post tweets What the reader receives Links to Same state the author intended After a period of time
25
2015 Hany SalahEldeen Dissertation Defense25 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared After a period of time
26
2015 Hany SalahEldeen Dissertation Defense26 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared The resource has changed After a period of time
27
2015 Hany SalahEldeen Dissertation Defense27 Memento framework * http://mementoweb.org/guide/rfc/http://mementoweb.org/guide/rfc/
28
2015 Hany SalahEldeen Dissertation Defense28 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions
29
2015 Hany SalahEldeen Dissertation Defense29 Related Work Social media analysis: Understanding Microblogging Zhao 2009 Yang 2010 Newman 2003 Kwak 2010 Java 2007 Cha 2009 History Narration Vieweg 2010 Starbird 2010-2012 Qu 2011 Neubig 2011 Lehman and Lalmas 2012- 2013 User’s Web Search Intention Ashkan 2009 Lee 2005 Loser 2008 Azzopardi 2009 Baeza-Yates 2006 Dai 2011 Commercial Intention Guo 2010 Benczur 2007 Sentiment Analysis Mishne 2006 Bollen 2011 Access to Archives Van de Sompel 2009 Persistence of shared resources – Nelson 2002 – Sanderson 2011 – McCown 2007 URL Shortening – Antoniades 2011 Tweeting, Micro-blogging and Popularity – Wu 2011 – Java 2007 – Kwak 2010 Social Networks Growth and Evolution – Meeder 2011 Further details: refer to chapter 3
30
2015 Hany SalahEldeen Dissertation Defense30 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions
31
2015 Hany SalahEldeen Dissertation Defense31 Research Question: Can we estimate the users’ intention at the time of posting and reading to predict and maintain temporal consistency?
32
2015 Hany SalahEldeen Dissertation Defense32 Research Goals Detect the temporal intention of the: 1.Author upon sharing time 2.The reader upon dereferencing time Model this intention as a function of time, nature of the resource, and its context. Predict how resources change with time and the intention behind sharing them to minimize inconsistency. Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9
33
2015 Hany SalahEldeen Dissertation Defense33 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions
34
2015 Hany SalahEldeen Dissertation Defense34 Shared ResourceTimeUser Our analysis covers three angles
35
2015 Hany SalahEldeen Dissertation Defense35 Shared ResourceTimeUser Loss and Persistence of Shared Resources
36
2015 Hany SalahEldeen Dissertation Defense36 Shared ResourceTimeUser Alive First: Estimate social media content loss
37
2015 Hany SalahEldeen Dissertation Defense37 Six socially significant events EventSourceYear Iranian ElectionSNAP Dataset2009 H1N1 Virus OutbreakSNAP Dataset2009 Michael Jackson’s DeathSNAP Dataset2009 Obama’s Nobel Peace PrizeSNAP Dataset2009 The Egyptian RevolutionTwitter, Websites, Books2011 The Syrian UprisingTwitter API2012
38
2015 Hany SalahEldeen Dissertation Defense38 Twitter tag expansion and filtration
39
2015 Hany SalahEldeen Dissertation Defense39 Twitter tag expansion increases precision
40
2015 Hany SalahEldeen Dissertation Defense40 What are people sharing?
41
2015 Hany SalahEldeen Dissertation Defense41 Existence on the live web and in the archives For each unique URL we resolved the final HTTP response and considered 2 classes: Success: 200 OK Failure: 4XX, 50X families and the 30X loop redirects or soft 404s. Utilize the memento aggregator: Archived: if it has at least one memento in the timemap
42
2015 Hany SalahEldeen Dissertation Defense42 Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%
43
2015 Hany SalahEldeen Dissertation Defense43 Shared ResourceTimeUser Alive Missing Second: Can we measure existence and disappearance as a function of time?
44
2015 Hany SalahEldeen Dissertation Defense44 Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%
45
2015 Hany SalahEldeen Dissertation Defense45 Timeline of Events
46
2015 Hany SalahEldeen Dissertation Defense46 Timeline of Events
47
2015 Hany SalahEldeen Dissertation Defense47 Social Events Having a Bimodal Time Distribution
48
2015 Hany SalahEldeen Dissertation Defense48 Timeline of Events
49
2015 Hany SalahEldeen Dissertation Defense49 Social Events Having a Bimodal Time Distribution
50
2015 Hany SalahEldeen Dissertation Defense50 Existence as a function of time
51
2015 Hany SalahEldeen Dissertation Defense51 Existence as a function of time
52
2015 Hany SalahEldeen Dissertation Defense52 Results: Publications and Articles: 1.H. M. SalahEldeen. Losing My Revolution: A year after the Egyptian Revolution, 10% of the social media documentation is gone. http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html, 2012.http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html 2.H. M. SalahEldeen and M. L. Nelson. Losing my revolution: how many resources shared on social media have been lost? In Proceedings of the Second international conference on Theory and Practice of Digital Libraries, TPDL'12, 2012. Conclusion: Existence could be estimated as a function of time Measured 21,625 resources from 6 data sets in archives & live web. After a year from publishing about 11% of content shared on social media will be gone. After this we are losing roughly 0.02% daily.
53
2015 Hany SalahEldeen Dissertation Defense53 Revisiting Existence after a year MJIranH1N1ObamaEgyptSyria Measured37.10%37.50%28.17%30.56%26.29%31.62%32.47%24.64%7.55%12.68% Predicted31.72%31.42%31.96%30.98%30.16%29.68%29.60%28.36%19.80%11.54% Error5.38%6.08%3.79%0.42%3.87%1.94%2.87%3.72%12.25%1.14% MJIranH1N1ObamaEgyptSyria Measured48.61%40.32%60.80%55.04%47.97%52.14%48.38%40.58%23.73%0.56% Predicted61.78%61.18%62.26%60.30%58.66%57.70%57.54%55.06%37.94%21.42% Error13.17%20.86%1.46%5.26%10.69%5.56%9.16%14.48%14.21%20.86% Average Prediction Error = 11.57% in all cases, our archival predictions were too optimistic Missing Archived Average Prediction Error = 4.15% in all cases, our missing predictions were acceptable
54
2015 Hany SalahEldeen Dissertation Defense54 Shared ResourceTimeUser Alive Missing Replaced Third: Can we use social context to find replacements of missing resources?
55
2015 Hany SalahEldeen Dissertation Defense55 Context discovery and shared resource replacement Problem: 140 characters limits the description of the linked resource. If it went missing, can we get the next best thing? Solution: Shared links typically have several tweets, responses, and retweets We can mine these traces for context and viable replacements
56
2015 Hany SalahEldeen Dissertation Defense56 Context Discovery Linking to: http://beta.18daysinegypt.com/
57
2015 Hany SalahEldeen Dissertation Defense57 What if the resource disappeared? Linking to: http://beta.18daysinegypt.com/
58
2015 Hany SalahEldeen Dissertation Defense58 Use Topsy to discover tweets sharing the same link
59
2015 Hany SalahEldeen Dissertation Defense59 Social Context Extraction { "URI": "http://beta.18daysinegypt.com/", "Related Tweet Count": 500, "Related Hashtags": "#tran #citizensx #arabspring #visualstorytelling #collaborativerevolution #feb11http://t.co/qxusp70...", "Users who talked about this": "@petra_stienen: @waleedrashed: @omarsamra @ungormite: @dcisbusy @webdocumentario:...", "All associated unique links:": "http://t.co/63X1f3f1 http://t.co/reBh6c4V http://t.co/B3GuhQN4 http://t.co/X2sjf4Rf http://t.co/P9iR28fH http://t.co/1C4EPh8h...", "All other links associated:": "http://vimeo.com/35368376 http://mashable.com/2012/01/21/18daysinegypt-2/ ", "Most frequent link appearing:": "http://t.co/2ke0rEjP", "Number of times the Most frequent link appearing:": 49, "Most frequent tweet posted and reposted:": "Check out 18DaysInEgypt - A crowd sourced documentary project ================= via @18daysinegypt", "Number of times the Most frequent tweet appearing:": 46, "The longest common phrase appearing:": "RT 2ke0rEjP is an interactive documentary website that YOU can help create Get your Jan25 stories ready! Pl RT", "Number of times the Most common phrase appearing:": 18 }
60
2015 Hany SalahEldeen Dissertation Defense60 Build a Tweet Document A tweet document represents the concatenation of all extracted tweets: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “
61
2015 Hany SalahEldeen Dissertation Defense61 Tweet Signature Tweet Document: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “ Tweet Signature = top 5 most frequent terms from Tweet Document documentary project daysinegypt check sourced
62
2015 Hany SalahEldeen Dissertation Defense62 Query Google with the Tweet Signature
63
2015 Hany SalahEldeen Dissertation Defense63 Search Engine Results The original resource
64
2015 Hany SalahEldeen Dissertation Defense64 Search Engine Results The original resource The others are good replacement candidates
65
2015 Hany SalahEldeen Dissertation Defense65 Recommendation Evaluation We extract a dataset of resources that are currently available: Pretend these resources no longer exist (for a baseline) Each of the resources are textual based Each resource has at least 30 retrievable tweets. Extracted 731 unique resources We use boiler plate removal library to remove the template from the: linked resources top 10 retrieved results from Google We use cosine similarity to compare the documents
66
2015 Hany SalahEldeen Dissertation Defense66 Similarity measures in resource replacement ----70% similarity---- 41% of the cases we found a replacement with >=70% similarity
67
2015 Hany SalahEldeen Dissertation Defense67 Conclusion: We can find viable replacements for missing shared resources Results: 41% of the test cases we can find a replacement page with at least 70% similarity to the original missing resource The search results provide a mean reciprocal rank of 0.43 Publications: 1.H. SalahEldeen and M. L. Nelson. Resurrecting my revolution: Using social link neighborhood in bringing context to the disappearing web. In Research and Advanced Technology for Digital Libraries- International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.
68
2015 Hany SalahEldeen Dissertation Defense68 Now we finished analyzing the shared resource…what’s next?
69
2015 Hany SalahEldeen Dissertation Defense69 Shared ResourceTimeUser Alive Missing Replaced Footprints on the web
70
2015 Hany SalahEldeen Dissertation Defense70 The tweet, the resource…and time time Posted a tweet Read the tweet Relevancy of the resource to the tweet changed through time we need to measure that Another tweet posted And another … We need to measure tweet relevance through time
71
2015 Hany SalahEldeen Dissertation Defense71 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Longitudinal Study: Rate of change of shared content
72
2015 Hany SalahEldeen Dissertation Defense72 Pilot 1: Resource change in the first 80 hours after tweeting
73
2015 Hany SalahEldeen Dissertation Defense73 Pilot 2: Delta days from Bitly creation for just tweeted content Dataset size = 4,000
74
2015 Hany SalahEldeen Dissertation Defense74 Pilot 3: Dataset of 1,000 freshly created Bitlys http://www.cnn.com depth = 0 http://www.cnn.com/world depth = 1 http://www.cnn.com/2009/SHOWBIZ/Music/06/25/jackson depth = 6
75
2015 Hany SalahEldeen Dissertation Defense75 What domains do users link to?
76
2015 Hany SalahEldeen Dissertation Defense76 What categories* do users link to? * Extracted from Alexa.com
77
2015 Hany SalahEldeen Dissertation Defense77 Summation of Intention in Social Content Through Time Longitudinal study: We record the change over an extended period of time: Content: we download a snapshot of the resource every 45 minutes Metadata: we collect meta data about the resource Facebook likes, posts Tweets in the last hour Bitly clicklogs and shares Average data size: ~1 TB per month
78
2015 Hany SalahEldeen Dissertation Defense78 Hourly analysis over an extended period of time
79
2015 Hany SalahEldeen Dissertation Defense79 There is a difference between t tweet and t click After just one hour, 4% of the resources have changed by 30%. After six hours, the percentage doubled to be 8% changed by 40%. After a day the change rate slowed to be 12% of the resources changed by 40%. After that it almost stabilizes at 17% of the resources to be changed by 40%.
80
2015 Hany SalahEldeen Dissertation Defense80 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation First: Resource – Time – Public Archives
81
2015 Hany SalahEldeen Dissertation Defense81 Revisited: Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%
82
2015 Hany SalahEldeen Dissertation Defense82 But on a more general notion we want to know…
83
2015 Hany SalahEldeen Dissertation Defense83 How much of the web is archived? Goal: Estimate how much of the public web is present in the public archives and how many copies are available? Action: Getting 4 different datasets from 4 different sources: Search Engines Indices Bit.ly DMOZ Delicious.
84
2015 Hany SalahEldeen Dissertation Defense84 Conclusion: It depends on the source Results: Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM.
85
2015 Hany SalahEldeen Dissertation Defense85 Conclusion: It depends on the source Results: Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM. Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives 2013 95% 92% 23% 26%
86
2015 Hany SalahEldeen Dissertation Defense86 Side Experiment: Analyzing the quality of the archives and the archived content Goal: Assessing the quality of the web archives Better discussed in Justin Brunelle’s work Publications: 1.J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources. In Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014 (Best student paper award)
87
2015 Hany SalahEldeen Dissertation Defense87 A question emerged: When did a certain resource first appear on the web?
88
2015 Hany SalahEldeen Dissertation Defense88 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Second: When was the resource created?
89
2015 Hany SalahEldeen Dissertation Defense89 Idea Web pages leave trails as well since the day they were created…
90
2015 Hany SalahEldeen Dissertation Defense90 Web Resource Web trails A web page could leave a trail of one of the following denoting its existence: References Links (anchors) Social media likes and interactions. URL shortening. Backlinks The creation date of any of the associated events/trails could be an estimate of the creation date.
91
2015 Hany SalahEldeen Dissertation Defense91 Resource’s timeline
92
2015 Hany SalahEldeen Dissertation Defense92 Observations Recorded 1.Last modified date from the response header. 2.First Appearance of a backlink. 3.First Tweet published. 4.First Bitly Shortened URL created. 5.Time stamp of first memento in the archives. 6.Date of the last crawl by the search engine.
93
2015 Hany SalahEldeen Dissertation Defense93 Carbon Date service
94
2015 Hany SalahEldeen Dissertation Defense94 Carbon Dating API { "self": "http://cd.cs.odu.edu/cd?url=http://www.cnn.com", "URI": "http://www.cnn.com", "Estimated Creation Date": "1998-12-06T04:02:33", "Last Modified": "", "Bitly.com": "2008-06-08T12:00:00", "Topsy.com": "2015-01-25T23:31:42", "Backlinks": "2003-03-12T05:35:44", "Google.com": "2005-01-11T00:00:00", "Archives": [ [ "Earliest", "1998-12-06T04:02:33" ], [ "By_Archive", { "http://archive.today/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26", "http://arquivo.pt/wayback/wayback/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26", "http://wayback.vefsafn.is/wayback/20011106102722/http://www.cnn.com/": "1998-12-06T04:02:33", "http://web.archive.org/web/20131218180509/http://www.cnn.com/": "2013-12-18T18:05:09" } ] }
95
2015 Hany SalahEldeen Dissertation Defense95 Evaluation Dataset From each we randomly selected 100 unique URLs to create our gold standard dataset
96
2015 Hany SalahEldeen Dissertation Defense96 Evaluation Applied our 6 methods on 1200 resources. Get leftmost estimate. Number of ResourcesPercentage An estimate found91076% Exact matching estimate39333% No estimate found29024% Total Resources1200100%
97
2015 Hany SalahEldeen Dissertation Defense97 Actual Vs. Estimated Dates
98
2015 Hany SalahEldeen Dissertation Defense98 Conclusion: We can estimate the creation date of resources correctly Results: Succeeded in estimating the creation date accurately in 75.90% of the resources. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Carbon dating the web: Estimating the age of web resources. In Proceedings of the 22nd International Conference on World Wide Web Companion, TempWeb03, WWW '13, 2013
99
2015 Hany SalahEldeen Dissertation Defense99 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities. http://cd.cs.odu.edu/
100
2015 Hany SalahEldeen Dissertation Defense100 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities. Yes, it’s better than mine… I admit it
101
2015 Hany SalahEldeen Dissertation Defense101 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation User’s Temporal Intention
102
2015 Hany SalahEldeen Dissertation Defense102 Problem: There is an inconsistency between what the tweet’s author intended to share at time t tweet and what the reader might actually read upon clicking on the link at time t click.
103
2015 Hany SalahEldeen Dissertation Defense103 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting What is Intention and how to detect it?
104
2015 Hany SalahEldeen Dissertation Defense104 Amazon’s Mechanical Turk Crowdsourcing Internet marketplace Co-ordinates the use of human intelligence to perform tasks that computers are currently unable to do.* * http://en.wikipedia.org/wiki/Amazon_Mechanical_Turkhttp://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
105
2015 Hany SalahEldeen Dissertation Defense105 Goal: Understand and collect user intention data via MT Tweets datasetIntention Classification Tasks User Intention Data Classifier Train
106
2015 Hany SalahEldeen Dissertation Defense106 Goal: Understand and collect user intention data via MT Tweets datasetIntention Classification Tasks User Intention Data Classifier Train Problem: It is not as easy as it seems!
107
2015 Hany SalahEldeen Dissertation Defense107 How NOT to classify temporal intention 101 The tweet is presented along with the two snapshots: at t tweet at t click
108
2015 Hany SalahEldeen Dissertation Defense108 And compared MT results with Experts Experts: Manually assigning a version to each tweet via a face to face meeting with WS-DL members. For 9 MT assignments per tweet: If we allowed 4-5 splits we have 58% match with WS-DL. If we allowed 3-6 splits or better we got 31% match Which is worse than flipping a coin!
109
2015 Hany SalahEldeen Dissertation Defense109 Idea: We need to transform the problem from intention to relevance.
110
2015 Hany SalahEldeen Dissertation Defense110 Relevance tasks are simpler MT workers are more accustomed to classification tasks and it requires minimum amount of explanation Transform a hard problem to an easy one Is that a cat? - Yes - No
111
2015 Hany SalahEldeen Dissertation Defense111 Temporal Intention Relevancy Model (TIRM) Between t tweet and t click : The linked resource could have: Changed Not changed The tweet and the linked resource could be: Still relevant No longer relevant
112
2015 Hany SalahEldeen Dissertation Defense112 Resource is changed but relevant The resource changed But it is still relevant Intention: need the current version of the resource at any time
113
2015 Hany SalahEldeen Dissertation Defense113 Relevancy and Intention mapping Current
114
2015 Hany SalahEldeen Dissertation Defense114 Resource is changed and not relevant Intention: need the past version of the resource at any time The resource changed But it is no longer relevant
115
2015 Hany SalahEldeen Dissertation Defense115 Relevancy and Intention mapping Past Current
116
2015 Hany SalahEldeen Dissertation Defense116 Resource is not changed and relevant Intention: need the past version of the resource at any time The resource is not changed And it is relevant
117
2015 Hany SalahEldeen Dissertation Defense117 Relevancy and Intention mapping Past Current Past
118
2015 Hany SalahEldeen Dissertation Defense118 Resource is not changed and not relevant Intention: I am not sure which version of the resource I need The resource is not changed But it is not relevant
119
2015 Hany SalahEldeen Dissertation Defense119 Relevancy and Intention mapping Past Current PastNot Sure
120
2015 Hany SalahEldeen Dissertation Defense120 Validation: Update the MT experiment MT workers ≡ judgments of the experts (WS-DL members) ✓ Is the content still relevant to the tweet?
121
2015 Hany SalahEldeen Dissertation Defense121 Mechanical Turk Workers Vs. Experts For 100 tweets, WS-DL members % of agreement: Cohen’s K = 0.854 almost perfect agreement Agreement in 3-2 split or more votes 93% Agreement in 4-1 split or more votes 80% Agreement with 5-0 votes 60%
122
2015 Hany SalahEldeen Dissertation Defense122 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Can we model this temporal intention?
123
2015 Hany SalahEldeen Dissertation Defense123 Data Collection From SNAP dataset we extracted: Tweets in English Each has an embedded URI pointing to an external resource. The embedded URI is shortened via Bit.ly The external resource: Still persists. Has at least 10 mementos. Is unique. We extracted 5,937 unique instances
124
2015 Hany SalahEldeen Dissertation Defense124 Time delta between the tweet and the closest memento Randomly selected 1,124 instances Time delta range: 3.07 minutes to 56.04 hours Average: 25.79 hours ~ 1 day Tweet time After Tweet time Before Tweet time
125
2015 Hany SalahEldeen Dissertation Defense125 Training Dataset R current : The state of the resource at current time. R click : The state of the resource at click time. Relevant Assignments 92982.65% Non-Relevant Assignments 19517.35% 5 MT workers agreeing (5-0 split) 58952.40% 4 MT workers agreeing (4-1 split) 30927.49% 3 MT workers agreeing (3-2 close call split) 22620.11%
126
2015 Hany SalahEldeen Dissertation Defense126 Training Dataset R current : The state of the resource at current time. R click : The state of the resource at click time. Relevant Assignments 92982.65% Non-Relevant Assignments 19517.35% 5 MT workers agreeing (5-0 split) 58952.40% 4 MT workers agreeing (4-1 split) 30927.49% 3 MT workers agreeing (3-2 close call split) 22620.11%
127
2015 Hany SalahEldeen Dissertation Defense127 Intention modeling: Feature extraction For each tweet we perform: Link analysis Social media mining Archival existence Sentiment analysis Content similarity Entity identification
128
2015 Hany SalahEldeen Dissertation Defense128 Training the classifier From the feature extraction phase we extracted 39 different features to train the classifier. Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%
129
2015 Hany SalahEldeen Dissertation Defense129 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets0.149 2Number of mementos0.090 3Tweet similarity with current page0.071 4Similarity: Current & past page0.053 5Similarity: Tweet & past page0.044 6Original URI’s depth0.032
130
2015 Hany SalahEldeen Dissertation Defense130 Testing the model We tested against: The remaining 4,813 from the original 5,937 instances after extracting the 1,124 used in training. The Tweet Collections based on historic events. (MJ, Obama, Iran, Syria, & H1N1) DatasetStatus 200Status 404 or otherRelevant %Non-Relevant % Extended 4,813 instances96.77%3.23%96.74%3.26% MJ’s Death57.54%42.46%93.24%6.76% H1N1 Outbreak8.96%91.04%97.48%2.52% Iran Elections68.21%31.79%94.69%5.31% Obama’s Nobel Prize62.86%37.14%93.89%6.11% Syrian Uprising80.80%19.20%70.26%29.75%
131
2015 Hany SalahEldeen Dissertation Defense131 Idea: We need to transform the problem from intention to relevance. Now we need to transform it back! Recap…
132
2015 Hany SalahEldeen Dissertation Defense132 Recap: Relevancy and Intention mapping Past Reading the wrong history
133
2015 Hany SalahEldeen Dissertation Defense133 Mapping TIRM We used 70% similarity as a threshold of relevancy. Reading the wrong history In up to 25% of the cases
134
2015 Hany SalahEldeen Dissertation Defense134 Conclusion: We can model users’ temporal intention accurately and efficiently Results: We successfully transformed the complicated problem of intention to a simpler one of relevance. We successfully collected a gold standard dataset of temporal user intention. We found a temporal inconsistency in the shared resource up to 25% of the cases according to the dataset. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Reading the correct history?: Modeling temporal intention in resource sharing. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, 2013.
135
2015 Hany SalahEldeen Dissertation Defense135 So we modeled intention… can we make it better?
136
2015 Hany SalahEldeen Dissertation Defense136 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets0.149 2Number of mementos0.090 3Tweet similarity with current page0.071 4Similarity: Current & past page0.0527 5Similarity: Tweet & past page0.04401 6Original URI’s depth0.0324
137
2015 Hany SalahEldeen Dissertation Defense137 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets0.149 2Number of mementos0.090 3Tweet similarity with current page0.071 4Similarity: Current & past page0.0527 5Similarity: Tweet & past page0.04401 6Original URI’s depth0.0324
138
2015 Hany SalahEldeen Dissertation Defense138 Enhancing TIRM Extending and tuning the features: Linguistic feature analysis Semantic similarity analysis using latent topic modeling Dataset balancing Feature selection and minimization
139
2015 Hany SalahEldeen Dissertation Defense139 A whole lot of features! 39 65 different features in extended TIRM Further details: refer to chapter 7
140
2015 Hany SalahEldeen Dissertation Defense140 TIRM enhancement and minimization results
141
2015 Hany SalahEldeen Dissertation Defense141 Point of Confusion: C Point of Certainty: S Strongest Current Intention From binary to probabilistic strength Further details: refer to chapter 7
142
2015 Hany SalahEldeen Dissertation Defense142 Intention strength formulation Intention strength magnitude of the new resource: Generalization in regards of class:
143
2015 Hany SalahEldeen Dissertation Defense143 Intention strength across instances in dataset
144
2015 Hany SalahEldeen Dissertation Defense144
145
2015 Hany SalahEldeen Dissertation Defense145 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting Can we find a relation between the modeled intention and time …to predict it?
146
2015 Hany SalahEldeen Dissertation Defense146 Remember: Data Collection From SNAP dataset we extracted: Tweets in English Each has an embedded URI pointing to an external resource. The embedded URI is shortened via Bit.ly The external resource: Still persists. Has at least 10 mementos. Is unique. We extracted 5,937 unique instances
147
2015 Hany SalahEldeen Dissertation Defense147 Intention strength across time time Resource = Closest memento Resource = current version We have 10 mementos of the resource uniformly distributed … We can calculate intention strength at every point
148
2015 Hany SalahEldeen Dissertation Defense148 Intention strength across time Dataset collection and calculation framework
149
2015 Hany SalahEldeen Dissertation Defense149 Behavior of instances in different classes time Intention strength Steady Current Intention Steady Past Intention Changing Intention
150
2015 Hany SalahEldeen Dissertation Defense150 Behavior of instances in different classes
151
2015 Hany SalahEldeen Dissertation Defense151 Given the features we already collected can we classify tweets according to their behavioral class?
152
2015 Hany SalahEldeen Dissertation Defense152 Classifying intention behavior across time
153
2015 Hany SalahEldeen Dissertation Defense153 If we can limit the features to the ones that exist before tweet time can we perform a prediction?
154
2015 Hany SalahEldeen Dissertation Defense154 Classifying intention behavior across time We can perform a prediction!
155
2015 Hany SalahEldeen Dissertation Defense155 Intention behavior prediction classifier
156
2015 Hany SalahEldeen Dissertation Defense156 Conclusion: We can predict the author’s temporal intention Results: We can predict for the author whether the intention conveyed to the readers will be consistent or will it change with 77% accuracy. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Predicting Temporal Intention in Resource Sharing. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '15, 2015.
157
2015 Hany SalahEldeen Dissertation Defense157 At this time, we successfully detected, modeled and predicted User’s Temporal Intention in Shared Content
158
2015 Hany SalahEldeen Dissertation Defense158 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting User Temporal Intention Temporal Intention Model
159
2015 Hany SalahEldeen Dissertation Defense159 So we built an awesome prediction model for Temporal Intention… what next?
160
2015 Hany SalahEldeen Dissertation Defense160 A Framework of Temporal Intention time Posted a tweet Read the tweet Tools for authors Enrich the archives with current content for posterity
161
2015 Hany SalahEldeen Dissertation Defense161 Prediction API
162
2015 Hany SalahEldeen Dissertation Defense162 Tools for Authors
163
2015 Hany SalahEldeen Dissertation Defense163 Temporal Intention Implementation time Posted a tweet Read the tweet Tools for readers Maintain the temporal consistence of content
164
2015 Hany SalahEldeen Dissertation Defense164 Tools for readers
165
2015 Hany SalahEldeen Dissertation Defense165 Tools for readers 1.Temporal preservation of vulnerable content 2.Version recommendation based on temporal intention estimation Target Publication: Utilizing Temporal Intention Prediction for Just-in-time Preservation and Recommendation of Vulnerable Social Media Content. WSDM 2016
166
2015 Hany SalahEldeen Dissertation Defense166 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions
167
2015 Hany SalahEldeen Dissertation Defense167 Accomplished Goals Detect the temporal intention of the: 1.Author upon sharing time 2.The reader upon dereferencing time Model this intention as a function of time, nature of the resource, and its context. Predict how resources change with time and the intention behind sharing them to minimize inconsistency. Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9
168
2015 Hany SalahEldeen Dissertation Defense168 Also, our work reached fame…
169
2015 Hany SalahEldeen Dissertation Defense169 The Virginian Pilot
170
2015 Hany SalahEldeen Dissertation Defense170 http://www.bbc.com/future/story/20120 927-the-decaying-web BBC.com
171
2015 Hany SalahEldeen Dissertation Defense171 Popular Mechanics February 2014 issue, page 20
172
2015 Hany SalahEldeen Dissertation Defense172 3 x MIT Technology Review http://www.technologyreview.com/view/513996/how-to-carbon-date-a-web- page/ http://www.technologyreview.com/view/519391/internet-archaeologists- reconstruct-lost-web-pages/ http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter- is-vanishing-from-the-web-say-computer-scientists/
173
2015 Hany SalahEldeen Dissertation Defense173 Mashable
174
2015 Hany SalahEldeen Dissertation Defense174 Mashable Yes I am Indiana Jones of the internet
175
2015 Hany SalahEldeen Dissertation Defense175 Publications PublishedSubmittedIn preparationPlanned JCDL 2011TPDL 2015WWW 2016IJDL 2016 TPDL 2012SIGIR 2016WSDM 2016 JCDL 2013 TPDL 2013 WWW 2013 DL 2014 AAAI 2015 IJDL 2015 JCDL 2015
176
2015 Hany SalahEldeen Dissertation Defense176 Remember Rémi Ochlik? Rémi Ochlik 16 October 1983 – 22 February 2012
177
2015 Hany SalahEldeen Dissertation Defense177 … and the missing content about him? Accessed in July 2014
178
2015 Hany SalahEldeen Dissertation Defense178 We can maintain the consistency of history Our Temporal Intention Relevancy Model
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.