NRV Tweets Final Presentation VT CS4624, Blacksburg, VA

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
2015 SLA IT Webinar Using Analytics to Understand Social Media Activity Michelle Chen School of Information San José State University February 4 th, 2015.
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
Topic Extraction From Turkish News Articles Anıl Armağan Fuat Basık Fatih Çalışır Arif Usta.
Tweetool ( version) Final Report Yilei Qian Computer Science University of Southern California A Twitter Recommend System.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Natural Language ToolKit ( What is nltk? A tool which allows you to do NLP stuff such as Finding similar words in context, POS tagging etc.
Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Natural language processing tools Lê Đức Trọng 1.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
CS4624S13P- Environment/VWRRC BEN KATZ ERIC HOTINGER BLACKSBURG, VIRGINIA. CLASS: CS VIRGINIA TECH, CLIENT: VWRRC,
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.
Big Data Processing of School Shooting Archives
Image taken from: slideshare
Collection Management (Tweets) Final Presentation
Collection Management Webpages
Collection Management
Floods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista
CS6604 Project Ensemble Classification
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
CS 5604 Information Storage and Retrieval
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
CS6604 Digital Libraries IDEAL Webpages Presented by
NRV Tweets Midterm Presentation VT CS4624, Blacksburg, VA
Event Focused URL Extraction from Tweets
Python on AWS Lambda: Practical Applications
Lesson 1: Introduction to Trifacta Wrangler
Collection Management Webpages Final Presentation
Twitter Equity Firm Value
CS6604 Digital Libraries IDEAL Webpages Presented by
Validation of Ebola LOD
Information Storage and Retrieval
Michael Shuffett Virginia Tech Blacksburg, VA
Text Transformation May 5th, 2015 CS Multimedia/Hypertext
Computational Linguistic Analysis of Earthquake Collections
Katrina Database SearchKat
Lesson 4: Advanced Transforms
Adam Lech Joseph Pontani Matthew Bollinger
Through the Fire and Flames
Lesson 6: Tools Chapter 6D – Lookup.
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624
Lesson 6: Tools Chapter 6C – Join.
Team 7 → Final Presentation
CS224N Section 3: Corpora, etc.
Lesson 5: Wrangling Tools
Lesson 4: Advanced Transforms
Lesson 5: Wrangling Tools
CS224N Section 3: Project,Corpora
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Lesson 2: Getting Started
Python4ML An open-source course for everyone
Presentation transcript:

NRV Tweets Final Presentation VT CS4624, Blacksburg, VA Sponsors: Mohamed Magdy, Dr. Andrea Kavanaugh, Ji Wang Ben Roble, Justin Cheng, Marwan Sbitani 5/06/14

NRVTweets Actually Tweets and RSS News stories Take 360,000 Tweets and 15,000 RSS News stories from Virtual Town Square DB Use Natural Language Processing to associate tweets and news stories with topics Upload all data into Solr to make it searchable Marwan

Selecting a Tool/API Dozens available Commercial licenses Free usage limits Marwan

Free Tools/API’s considered word2vec - inaccurate Weka - data import issues Stanford Topical Modeling Toolbox - only .csv files LingPipe - unclear Gensim - good PDLA C++ - lack of documentation MALLET - good Mahout - needs HADOOP environment Marwan

MALLET - Machine Learning for Language Toolkit java source - http://mallet.cs.umass.edu/ document classification Naive Bayes, Maximum Entropy, Decision Trees topical modeling LDA, Pachinko Allocation, Hierarchical LDA Marwan

Parsing the data Modify into JSON Large file size Stripping stop words/symbols Preparing for MALLET Train data in MALLET Infer topics from data Export back to JSON Ben

Topic Modeling Topics created from groups of tokens (words), each weighted differently NLP standard is 20 top tokens, but Tweets are on average 15 words [1] Needed to be specific The three most heavily weighted tokens from the most relevant topic was returned [1] http://blog.oup.com/2009/06/oxford-twitter/ Ben

Uploading to Solr Add schema.xml with our json fields Upload to Solr Now searchable by topics Justin

Results All tweets and RSS stories associated with topics virginia tech hokies county board supervisors police crash vehicle veterans war military food pantry program rotary club blacksburg Tweets associated with hashtags Marwan

<word count="2239" weight="0.1616256406554537">big</word> <topic titles="big ten, marshall wood fractures foot reducing, smokey classic, big ten challenge, basketball freshman marshall wood turning heads preseason practices, georgia, join, conference, " totalTokens="13853" alpha="0.2" id="6"> <word count="2239" weight="0.1616256406554537">big</word> <word count="857" weight="0.06186385620443225">georgia</word> <word count="785" weight="0.05666642604490002">ten</word> <word count="652" weight="0.04706561755576409">join</word> <word count="383" weight="0.02764744098751173">conference</word> <phrase count="98" weight="0.08376068376068375">big ten</phrase> <phrase count="42" weight="0.035897435897435895">marshall wood fractures foot reducing</phrase> <phrase count="35" weight="0.029914529914529916">smokey classic</phrase> <phrase count="31" weight="0.026495726495726495">big ten challenge</phrase> <phrase count="27" weight="0.023076923076923078">basketball freshman marshall wood turning heads preseason practices</phrase> </topic> Ben

Justin

Justin

Future work Modify NLP tool parameters Enhance topic association Use different NLP tool/algorithm Justin