Collection Management

Slides:



Advertisements
Similar presentations
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Advertisements

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Publishing Workflow for InDesign Import/Export of XML
Aki Hecht Seminar in Databases (236826) January 2009
Tutorial 11: Connecting to External Data
XP New Perspectives on Microsoft Access 2002 Tutorial 71 Microsoft Access 2002 Tutorial 7 – Integrating Access With the Web and With Other Programs.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
INSERT BOOK COVER 1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Excel 2010 by Robert Grauer, Keith.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
HTML ( HYPER TEXT MARK UP LANGUAGE ). What is HTML HTML describes the content and format of web pages using tags. Ex. Title Tag: A title It’s the job.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
MetricsVis: Interactive Visual System of Customized Metrics on Evaluating Multi-Attribute Dataset Nikhil Ghanta, Jieqiong Zhao, Calvin Yau, Hanye Xu, Brian.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Big Data Processing of School Shooting Archives
Building Dashboards with JMP 13 Dan Schikore SAS, JMP
Intro to HTML CS 1150 Spring 2017.
HTML CS 4640 Programming Languages for Web Applications
Intro to HTML CS 1150 Fall 2016.
Introduction to OBIEE:
Hadoop.
CS6604 Digital Libraries Global Events Team Final Presentation
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
An Open Source Project Commonly Used for Processing Big Data Sets
Collection Management (Tweets) Final Presentation
11 October Building a Web Site.
Collection Management Webpages
Presenters: Radha Krishnan, Sneha Mehta April 28, 2016
Tutorial 11: Connecting to External Data
Improving Braille accessibility and personalization on Internet
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Metadata Editor Introduction
Extraction, aggregation and classification at Web Scale
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
CS 5604 Information Storage and Retrieval
CS6604 Digital Libraries IDEAL Webpages Presented by
Hey everyone, I’m Sunny …harsh caroline xavier
Multimedia Database Virginia Polytechnic Institute and State University Blacksburg, VA CS 4624 Multimedia, Hypertext and Information Access Client.
5 Tips for Upgrading Reports to v 6.3
Team FE Final Presentation
Collection Management Webpages Final Presentation
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Tracking FEMA Kevin Kays, Emily Maier, Tyler Leskanic, Seth Cannon
Using JDeveloper.
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Overview of big data tools
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Adam Lech Joseph Pontani Matthew Bollinger
Lesson 1 The Web.
Tutorial 7 – Integrating Access With the Web and With Other Programs
PolyAnalyst Web Report Training
The Most Basic HTML Page
Bryan Burlingame 24 April 2019
CAII 4.01 Web Page Design Terms List 2.
Lecture 13 Teamwork Bryan Burlingame 1 May 2019.
Python4ML An open-source course for everyone
Presentation transcript:

Collection Management Presenters: Yufeng Ma & Dong Nan May 3, 2016 CS5604, Information Storage and Retrieval, Spring 2016 Virginia Polytechnic Institute and State University Blacksburg VA Professor: Dr. Edward A. Fox Good afternoon, everyone. We’re collection management team. It’s kind of funny that along the whole semester, we are always the first to present results, but now we are the last one to conclude. Anyway in the next 15 minutes or so, we will present our results regarding collection management.

Outline Goals & Data Flow Incremental update Tweet cleaning Webpage cleaning Here is the outline of our final presentation. First we will go over the top level functions of our team, and our connections to the other teams. Then we will look at the three main parts – Incremental update, cleaning tweets and webpages.

Goals Keep data in HBase current Providing “quality” data Identify and remove “noisy” data Process and clean “sound” data Extract and organize data Now let’s first look at the main goal of our team. For incremental update, we are supposed to import only new tweet records from mySQL to HDFS and finally into HBase. While for the cleaning part, our responsibility is to extract useful metadata or features , and then store them into HBase.

Data Flow Diagram Here is the graph, the three ovals indicate where we are, and you can also find out our collaborations/relationships with other teams in this figure. Actually all the other teams’ work will sit on the shoulder of ours. Image credit: Sunshin Lee

Incremental Update Now Dong will introduce the incremental update and tweet cleaning part, after that I will go over the whole process of webpage cleaning.

Incremental Update: MySQL  HDFS Previous bash script importing 700+ tables, without incremental feature. Incremental import new rows in the relational database (MySQL) to HDFS. Use incremental append mode of Sqoop to import data incrementally. Code credit: Kathleen Ting and Jarek Jarcec Cecho. Apache Sqoop Cookbook. Sebastopol: O'Reilly Media, Inc., 2013. Print.

Incremental Update: HDFS  HBase Keep HBase in sync with imported data on Hadoop. Write Pig script to import new data from HDFS to HBase. Use job scheduler Cron on Linux (by creating crontab file), periodically run the Pig script. Image credit: http://itekblog.com/wp-content/uploads/2013/03/crontab.png

Tweet Cleaning

Tweet: Text Cleaning and Info Extraction Remove URLs, profanities, and non-characters from raw tweets. Extract short URLs from raw tweets, expand, and map to corresponding web pages. Extract hash tags (#) and mentions (@) out from raw tweets. Store cleaned text, extracted hash tags and mentions from HDFS node to HBase. All the cleaning, extracting and storing process done by Pig Latin. rowkey clean_tweet collection # - tweet id clean_text urls hashtags mentions mappings empty collection doctype

Tweet: Text Cleaning and Info Extraction (Example) Raw tweet on HDFS Cleaned tweet and extracted info in HBase

Webpage Cleaning Thanks for Dong's detailed explanation of tweet cleaning.

Raw Data Before we step into technical details of webpage cleaning, let’s look at the raw data we got from two GRAs. All the webpage records we got are Tab Separated Value files with fixed formats like in this picture. The first field is collection number dash tweet ID, then comes the urls for this webpage. After this we will come across colon URL colon which separates URL and the corresponding webpage’s HTML code.

Webpage Sample We can visualize this webpage in any browser to get a more direct intuition. The only useful features here for our further processing like classification or LDA are the title and the body of texts in red boxes. Therefore, our main purpose is to extract these features.

Webpage Cleaning Rules Remove Non-ASCII characters Keep English text only Extract URLs Remove profane words Now comes the rules for webpage cleaning. One requirement for us is to remove Non-ASCII characters. Also only the English texts should be kept. Since most NLP processings are based on English corpus. Based on this, we can extract the URLs, and then replace the profane words with some special strings.

Libraries/Packages BeautifulSoup4 Readability Langdetect Re Parse text out of HTML and XML files Readability Pull out the title and main body text from a webpage Langdetect Detect language of a text using naive Bayesian filter Re Provide regular expression matching operations With these rules being clarified, we introduced several libraries and packages in python for our purpose here. The first is beautifulsoup, which is very powerful for parsing and extracting HTML tags. After this, we can apply the readability package to identify readable titles and contents in the webpage. Thereafter, langdetect is imposed to detect the main language code of the text. Finally regular expressions to extract URLs, remove profane words.

Webpage cleaning pipeline Raw consolidated files & metadata in Text format Webpage cleaning pipeline RE Readability Titles & “Useful” HTML HTML only (raw) Identified tweet ID, URL and original webpage code BeautifulSoup Content only English only Text ASCII only Text Langdetect RE Clean web pages in TSV format with preset schema Clean Text, Replace profanity and extract URLs Here is the whole pipeline for the webpage cleaning. First we use regular expression to identify the three key fields. Then BeautifulSoup and Readabilitiy together helps to filter out unnecessary tags and only keep the readable title and contents. After this we detect the language and use regular expression again to extract urls and remove profane words.

Cleaned Webpage Schema Rowkey clean_web URL collection lang domain doc_id title text_clean text_clean_profanity Rowkey clean_web URL urls empty mappings doctype web_original Finally, we will upload the whole stuff into HBase using Pig script. Here is the detailed schema for clean_web. We provide both the original HTML code and useful information with it.

Cleaned Webpages Here is a sampled cleaned webpage, which shows the readable title and content in the webpage.

Future work Clean big collection Clean documents with multiple languages Automating webpage crawling and cleanup Now for the future work, right now we only do processing on small collections, we will focus on big collection cleaning. What's more, it’s also quite necessary for us to deal with documents in languages other than English. Finally, I will also work with Mohamed to figure out automating webpage crawling and cleanup. As as soon as incremental update is done, we have new tweets, then we can extract urls in it, and crawl the correspond webpages.

Acknowledgements Integrated Digital Event Archiving and Library (IDEAL) NSF IIS – 1319578 Digital Library Research Laboratory (DLRL) Dr. Fox, IDEAL GRA’s (Sunshin & Mohamed) All of the teams in the class Finally, we would greatly appreciate the help from Dr. Fox for his edits on our reports and general suggestions for the project, and two GRAs Sunshin and Mohamed, especially Sunshin. He did sacrifice lots of his research time for helping us. Last but not least,we would also thank all the other teams in the class for your collaborations and feedbacks.

Thank You! Thank you guys. Any questions?