Collection Management (Tweets) Final Presentation

Slides:

Advertisements

Similar presentations

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Advertisements

Introduction to Web Database Processing

Overview of Search Engines

Collections Management Museums Reporting in KE EMu.

Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.

DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Systems analysis and design, 6th edition Dennis, wixom, and roth

Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.

Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

SQL Queries Relational database and SQL MySQL LAMP SQL queries A MySQL Tutorial and applications Database Building Assignment.

Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,

MySQL to NoSQL Data Modeling Challenges in Supporting Scalability ΧΑΡΟΚΟΠΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΜΑΤΙΚΗΣ ΠΜΣ "Πληροφορική και Τηλεματική“

Curtis Spencer Ezra Burgoyne An Internet Forum Index.

1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.

Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.

Google Refine for Data Quality / Integrity. Context BioVeL Data Refinement Workflow Synonym Expansion / Occurrence Retrieval Data Selection Data Quality.

Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.

Animal Shelter Activity 2.

CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.

Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.

CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.

Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,

Understanding Core Database Concepts Lesson 1. Objectives.

Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

Big Data Processing of School Shooting Archives

Scaling Big Data Mining Infrastructure: The Twitter Experience

CS 405G: Introduction to Database Systems

Global Event Detector Final Project Presentation

Big Data is a Big Deal!.

CS6604 Digital Libraries Global Events Team Final Presentation

An Open Source Project Commonly Used for Processing Big Data Sets

CS122B: Projects in Databases and Web Applications Winter 2017

Running virtualized Hadoop, does it make sense?

Collection Management Webpages

Collection Management

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

CLA Team Final Presentation CS 5604 Information Storage and Retrieval

Virginia Tech, Blacksburg, VA 24061

Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:

Clustering tweets and webpages

CS 5604 Information Storage and Retrieval

CS6604 Digital Libraries IDEAL Webpages Presented by

SDMX: Enabling World Bank to automate data ingestion

Event Focused URL Extraction from Tweets

Team FE Final Presentation

Collection Management Webpages Final Presentation

Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.

Tracking FEMA Kevin Kays, Emily Maier, Tyler Leskanic, Seth Cannon

Introduction to Apache

Twitter Equity Firm Value

Validation of Ebola LOD

Information Storage and Retrieval

Overview of big data tools

Adam Lech Joseph Pontani Matthew Bollinger

Understanding Core Database Concepts

AI Discovery Template IBM Cloud Architecture Center

Best Practices in Higher Education Student Data Warehousing Forum

Python4ML An open-source course for everyone

Pig Hive HBase Zookeeper

Presentation transcript:

Collection Management (Tweets) Final Presentation CS5604, Information Retrieval, Fall 2016 Collection Management (Tweets) Final Presentation Mitch Wagner Shuangfei Fan Faiz Abidi December 1, 2016 Virginia Tech @ Blacksburg, VA Professor: Dr. Edward Fox University, other metadata

Additions regarding tweet updates Before Now MySQL to HDFS Mode of transfer Batch mode Incremental update HDFS to HBase

What features did we improve? What was done before? How did we improve it? Limited amount of tweet parsing. We are extracting a lot more fields now as per different teams’ requirements. Social network based on users as nodes, and links using mentions and re-tweets. Only one kind of node, with little emphasis on importance value. Three kinds of nodes - users, tweets, and URLs. We are using the Twitter API to calculate an importance value for the users and the tweets, and taking the number of occurrences of a URL in a tweet collection as an indication of its importance within that collection.

Incremental Update From MySQL to HDFS

MySQL - CollectDB (contains all new tweets) MySQL - ArchiveDB (contains all raw tweets) Uncleaned text file Tweets stored in MySQL server. We use pt-archiver to archive them to the ArchiveDB, and also save them to a text file. Some statistics (3.6 GHz, 16G Memory machine) No. of tweets Time %CPU Memory (MB) 155657 1 min 35 sec 29 19.7

MySQL - CollectDB (contains all new tweets) pt-archiver MySQL - ArchiveDB (contains all raw tweets) pt-archiver The tweets text file is parsed, and cleaned using bash (e.g., incorrectly placed “\r”, “\r\n” characters, all ASCII characters, etc.) Uncleaned text file Cleaned CSV file Some statistics (3.6 GHz, 16G Memory machine) No. of tweets Time %CPU Memory (MB) 155657 7.89 sec 57 169.9

MySQL - CollectDB (contains all new tweets) pt-archiver MySQL - ArchiveDB (contains all raw tweets) pt-archiver Uncleaned text file Some statistics (3.6 GHz, 16G Memory machine) No. of tweets Time %CPU Memory (MB) 155657 13.64 sec 92 18.2 Bash scripts Cleaned CSV file The tweets file is then converted to Avro file format using an open source tool called csv2avro. Avro file

MySQL - CollectDB (contains all new tweets) pt-archiver MySQL - ArchiveDB (contains all raw tweets) pt-archiver The Avro file is put into a specific location on HDFS depending on the table name from which the tweets were extracted. Uncleaned text file Bash scripts Cleaned CSV file csv2avro tool Avro file HDFS

MySQL - CollectDB (contains all new tweets) pt-archiver MySQL - ArchiveDB (contains all raw tweets) When a new Avro file is added to HDFS, the two files merge to become one using avro-tools. pt-archiver Uncleaned text file Merged Avro Files on HDFS Bash scripts Cleaned CSV file Some statistics (cluster machine - 3.3 GHz, 32G) No. of tweets Time %CPU Memory (MB) 155657 14.42 sec 45 439.5 csv2avro tool Bash scripts Avro file HDFS

(contains all new tweets) pt-archiver MySQL - ArchiveDB MySQL - CollectDB (contains all new tweets) pt-archiver MySQL - ArchiveDB (contains all raw tweets) pt-archiver Uncleaned text file Merged Avro Files on HDFS Bash scripts Cleaned CSV file avro-tools csv2avro tool Bash scripts Avro file HDFS

Incremental Update from HDFS to HBase + Tweet Processing

Tweet Loading Pipeline MySQL Server HDFS Processing Pipeline Temporary Collection Avros HBase ideal-cs5604f16 Final Collection Archive Avros Cluster Servers

Tweet Loading Pipeline MySQL Server 1) New data copied over to cluster HDFS Processing Pipeline Temporary Collection Avros HBase ideal-cs5604f16 Final Collection Archive Avros Cluster Servers

Tweet Loading Pipeline MySQL Server HDFS Processing Pipeline Temporary Collection Avros HBase 2) New data processed and merged into HBase ideal-cs5604f16 Final Collection Archive Avros Cluster Servers

Tweet Loading Pipeline MySQL Server HDFS Processing Pipeline Temporary Collection Avros HBase 3) Temporary Files Merged into Archive Files ideal-cs5604f16 Final Collection Archive Avros Cluster Servers

Tweet Processing Pipeline 1. Initial Read 2. Stanford NLP 3. Final Cleaning Avro File HBase HBase Pig scripts to load basic tweet info, & initialize various other columns to simplify later processing Java for Stanford Named Entity Recognition & lemmatization Pig + Python for Remaining “clean-tweet” Column Family HBase HBase HBase

Running Time Test Collection: 312 (Water Main Break) Number of Tweets: 155657 Initial Read: ~ 2 minutes Lemmatization: ~33 minutes Cleaning Step: ~27 minutes --------------------------- Total time: 1 hour

Asynchronous Updates HBase Two clean-tweet columns are better suited for asynchronous updates: URL Extraction (Twitter has best information on URLs in tweets, rate-limited) Google Geolocation (rate-limited) Scan for rows with API-dependent columns not yet populated, make API calls to gather data, and augment those rows HBase

Social Network

Build a social network based on the tweet collection Credit: http://www.touchgraph.com

Objective Rank the nodes for social network based recommendations Credit: http://thenextweb.com/twitter/2014/05/09/twitter-updates-web-layout-third-column-content-recommendation/

Objective Rank the nodes for social network based recommendations Hot topics Rank the nodes for social network based recommendations Credit: http://thenextweb.com/twitter/2014/05/09/twitter-updates-web-layout-third-column-content-recommendation/

Objective Rank the nodes for social network based recommendations Popular people Hot topics Rank the nodes for social network based recommendations Credit: http://thenextweb.com/twitter/2014/05/09/twitter-updates-web-layout-third-column-content-recommendation/

Pipeline

Previous work The S16 team built a social network G(V, E) where: Nodes (V): Users Edges (E): Edges between users according to RTs and mentions (@) Importance factor (IP): For edges (count)

Nodes Explain the figure

Edges

Importance Factor statuses _count: The number of tweets (including retweets) issued by the user. Listed_count: The number of public lists that this user is a member of

Visualization Tools Python (NetworkX) Statistics Number of tweets: 300 Collection z_3 Twitter API imposes size constraints (180 queries every 15 minutes) Nodes 300 tweet nodes 158 user nodes 110 URL nodes Edges 73 user-user edges 54 tweet-tweet edges 300 user-tweet edges 140 tweet-URL edges Work left, add the ips to HBase

Visualization Green: tweets Red: users Blue: URLs Work left, add the ips to HBase

Visualization Green: tweets Red: users Blue: URLs Work left, add the ips to HBase

Flexible scripts accommodate large or small volumes of tweets Can store and process thousands of tweets quickly Summary & Future Work We have delivered a robust ETL pipeline for moving tweets Flexible scripts accommodate large or small volumes of tweets Can store and process thousands of tweets quickly In the future: Do not remove comma, and double quotes from the text file of tweets Develop asynchronous scripts to enhance tweets via API calls Rigorous speed tests/processing pipeline optimization (including schema) More extensive plan for handling profanity Add hashtags to social network

Challenges Faced Incomplete documentation from the previous semester Schema Unfamiliarity with HBase, Pig, Twitter, Stanford NER Large, pre-existing system to understand Working in groups Meeting time that works for all Difficult to divide work based on our varying expertise Dilemma to work together, or individually on parts of the project What has been written? What hasn’t? What is the architecture of the system? What is the data pipeline? Who is responsible for what?

As a Learning Experience Exposure to different technologies: Hbase + Hadoop Framework Pig Stanford NLP Regex Concepts: Extract, Transform, Load (ETL) Pipeline NoSQL databases Text parsing Communication & synchronization between teams Overall Divide responsibilities Work iteratively Ask questions Don't spend too much time on background or history, rather focus on what was discovered or learned.

Acknowledgement IDEAL: NSF IIS-1319578 GETAR: NSF IIS-1619028 Dr. Edward A. Fox GRA: Sunshin Lee

References Percona, “Percona - the database performance experts.” https://www.percona.com/, 2016. “csv2avro - Convert CSV files to Avro .” https://github.com/sspinc/csv2avro, 2016. A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics, and function using NetworkX,” in Proceedings of the 7th Python in Science Conference (SciPy2008), (Pasadena, CA USA), pp. 11–15, Aug. 2008. “CMT Team’s Codebase on GitHub.” https://github.com/mitchwagner/CMT, 2016. “Touch Graph.” http://www.touchgraph.com/news, 2016. N. Garun, “Twitter updates its Web layout with a third column for content recommendation.” http://thenextweb.com/twitter/2014/05/09/ twitter-updates-web-layout-third-column-content- recommendation/, 2014.