Presenters: Radha Krishnan, Sneha Mehta April 28, 2016

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Databases & Data Warehouses Chapter 3 Database Processing.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Mark Phillips Digital Projects Department University of North Texas Annexation of Texas Project.
Information Literacy. Information Literacy includes: The ability of a student to: 1.Identify the need for information Select a topic 2.Access information.
Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
Mississippi State University Libraries’ EBSCO Discovery Service Experience.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Information Systems Today: Managing in the Digital World TB3-1 3 Technology Briefing Database Management “Modern organizations are said to be drowning.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.
Grant Writing for Digital Projects September 2012 IODE Project Office IODE Project Office Oostende, Belgium Oostende, Belgium Sustainability and.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Big Data Processing of School Shooting Archives
Classification CS5604 Information Storage and Retrieval - Fall 2016
Sentiment and Topic Analysis
SharePoint 101 – An Overview of SharePoint 2010, 2013 and Office 365
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
CS6604 Digital Libraries Global Events Team Final Presentation
LSTA Grant Workshop Jennifer Peacock, Administrative Services Bureau Director David Collins, Grant Programs Director Mississippi Library Commission October.
Traditional Collection and e-Rosources – Policy Dilemma
Collection Management (Tweets) Final Presentation
Collection Management Webpages
IDEALvr Team: Luciano Biondi, Omavi Walker, Dagmawi Yeshiwas
Collection Management
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,
Vaccine Code Set Management Services Pilot
Surcharge For Smokers Libby Bledsoe BSN-SN NUR 4030 Introduction
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Visualizations of School Shootings
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
CS 5604 Information Storage and Retrieval
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
CMPT 733, SPRING 2016 Jiannan Wang
Social Interactome Recommender Team Final Presentation
Event Focused URL Extraction from Tweets
Team FE Final Presentation
Collection Management Webpages Final Presentation
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Topic Oriented Semi-supervised Document Clustering
Information Storage and Retrieval
Cryptocurrencies: A Brief Look & Sentiment Analysis
Redesigning the Archival Services’ Website with User Perspectives
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Adam Lech Joseph Pontani Matthew Bollinger
Student name Student ID Degree program Area of specialization
CS5984:Big Data Text Summarization
Personalized Celebrity Video Search Based on Cross-space Mining
CMPT 733, SPRING 2017 Jiannan Wang
Presentation transcript:

Presenters: Radha Krishnan, Sneha Mehta April 28, 2016 Topic Analysis Presenters: Radha Krishnan, Sneha Mehta April 28, 2016 CS5604, Information Retrieval and Storage, Spring 2016 Virginia Polytechnic Institute and State University Blacksburg VA Professor: Dr E. Fox

Acknowledgements Prof. Edward Fox and GRA’s for IDEAL (Sunshin & Mohamed) Digital Libraries Research Laboratory (DLRL) NSF for grant IIS - 1319578, Integrated Digital Event Archiving and Library (IDEAL) All the teams in CS5604

Outline Goals Introduction Design and implementation Experiments on tweets Experiments on webpages Evaluation Conclusions and future work

Goals Create meaningful topic models for tweet and webpage collections Provide intuitive labels for the topics to enable faceted search

Introduction LDA is a generative statistical model Tweets/Webpages can be thought of as mixtures of topics Each word in a tweet/webpage is attributable to one or more topics Output of LDA: Topic-word matrix – Word distribution for each topic Document-topic matrix - Topic distribution for each document

Design

Implementation LDA in Scala using Spark MLlib Workflow: Data preparation Topic extraction Topic labeling Loading results into HBase

Data Preparation Based on test runs, devised our own set of data pre-processing steps Removed stop words, most frequent words, words < 4 characters and non-alphabetic letters Also, created a custom stop words list to remove means, while, because, would etc.

Topic Extraction Five topics extracted for each tweet collection Ten words with highest probabilities for each topic extracted Collection Topics WDBJ shooting Journalists, Roanoke, victims, shooting, virginia Topic Words Journalists Journalists, shooting, newscast, victims, tribute, heartbroken, condolence, watch, everytown, moment

Topic Labeling Word with highest probability chosen as the topic label Topics are given unique labels Topics Words Labels Topic 1 journalists, shooting, newscast, victims… Journalists Topic 2 roanoke, prayers, families, friends… Roanoke Topic 3 journalists, shooting, reporter, police… Shooting

Loading data into HBase Topics are sorted based on probabilities Created a document - topic column family in HBase Created topic - word table in HBase Rowkey Tweet_Topics Webpage_Topics Tweet / Webpage ID Labels Probabilities Rowkey Collection_ID Words Probabilities Topic_Label

Experiments on Tweets Ran LDA on both Spark standalone and Yarn mode Standalone mode supports 100 to 200 iterations in LDA Standalone mode results: Collection Size Time (in seconds) Houston Flood 14813 860 Obamacare (web page) 22393 1615 4thofJuly 31743 2220 NAACP Bombing 33865 4315 Germanwings 90040 4000

Experiments on Tweets Increase in number of iterations increases quality of topics 20 iterations 100 iterations Topic A Topic B reporter heartbroken condolence watch survivor westandwithwdbj police community photographer terrorism condition suspect obama fatal westandwithsdbj control moment

Experiments on Tweets Ran LDA for 100 iterations Found 5 topics on each collection Decrease / Increase in number of topics leads to loss of word coherence Collection Name Topics WDBJ Shooting journalists, shooting, victims, Roanoke, virginia NAACP Bombing naacp, naacpbombing, terrorism, michellemalkin, deray 4thofJuly american, independenceday, fireworks, happy, america Germanwings germanwings, ripley, world, copilot, crash Houston Flood houstonflood, texas, texasflood, flood, billbishopkhou

Experiments on Webpages Topics for #Obamacare Identified 4 topics in webpage collection 5 topics in tweet collection Collection Name Webpage_Topics Tweet_Topics #Obamacare insurance, plans, companies, court obamacare, uninsured, health, arrived, pjnet

Experiments on Webpages Top words for each topic in #Obamacare collection Topics Tweet_Topic_Words Webpage_Topic_Words Topic 1 obamacare,repeal,scottwalker,prolife, little, supports, sisters… plans, fixed, indemnity, insurance, people, judge, health, administration… Topic 2 uninsured, barackobama, americans, thanks, realdonaldtrump, president … percent, increases, federal, health, insurers, affordable, state, claims .. Topic 3 health, insurance, people, healthcare, medicaid, congress… court, religious, mandate, supreme, schools, baptist, organizations, christian… Topic 4 arrived, abcnews, antichrist, doomsday, cometcoreinc, comet, republicans … companies, books, playlist, cancer, medical, health, medicine, problem… Topic 5 pjnet, cruzcrew, benghazi, teaparty, tedcruz, healthcare, uniteblue …

Evaluation – (Part A) Conducted a user study with 2 questions Do these words describe WDBJ shooting? Out of 13 participants, 8 gave more than 8 or above on a scale of 10, 2 gave a score of 7

Evaluation – (Part B) Word intrusion - measure to identify quality of topics Which of these words do not belong to the group?

Lessons Learnt Problems Solutions Repeating words in topics Increase iterations Noisy words in topics Add to stop word list Topic similarity / dissimilarity Decrease / increase topics

Conclusions and Future Work Extracted meaningful topics from each tweet and web page collections Devised an automated technique to generate intuitive labels for each topic Evaluation results confirm the correctness of topic model Future work Create topic labels using metadata like hashtags, titles etc. Using model based approach for labeling - Word2Vec Implement Hierarchical / Online LDA models

Questions ?

Thank you!!!