Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.

Slides:



Advertisements
Similar presentations
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Advertisements

1 Chapter 12 Working With Access 2000 on the Internet.
BankSystem By: Jon Lemming and Tim Jaffry. Overview System Selection System Selection System Analysis System Analysis System Design System Design Operating.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Python Concurrency Threading, parallel and GIL adventures Chris McCafferty, SunGard Global Services.
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Notes & Tickler Elliott® Business Software V7.X Edward M. Kwang, President.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.
1 Chapter The Impact of Database Customer centric approach - A highly personal approach Marketing databases are essential to the marketing process.
Introduction to ASP.NET development. Background ASP released in 1996 ASP supported for a minimum 10 years from Windows 8 release ASP.Net 1.0 released.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Manage your projects efficiently and on a high level PROJECT MANAGEMENT SYSTEM Enovatio Projects Efficient project management Creating project plans Increasing.
How do Web Applications Work?
Introduction to Dynamic Web Programming
CS6604 Digital Libraries Global Events Team Final Presentation
ITCS-3190.
ECRG High-Performance Computing Seminar
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Rdoc2vec Jake Clark, Austin Cooke, Steven Rolph, Stephen Sherrard
Background Check Website for R4 OpSec, LLC
SWITCHdrive Experience with running Owncloud on top of Openstack/Ceph
Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,
DevOps Deep Dive DevOps Deep Dive What you will learn
VR4GETAR CS4624: Multimedia, Hypertext and Information Access
5 Tips For Better And Quicker Web Research Services - Damco Solutions
Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.
Virginia Tech Blacksburg CS 4624
Tweet Collections Multimedia, Hypertext, and Information Access
Clustering tweets and webpages
CEED Phone App Madhur Mahajan, Zachary Hensley, Randy Liang, Sean Greynolds CS4624: Multimedia, Hypertext, and Information Access Edward A. Fox Virginia.
CS 5604 Information Storage and Retrieval
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
CS6604 Digital Libraries IDEAL Webpages Presented by
Hey everyone, I’m Sunny …harsh caroline xavier
Graph Query Portal Amit Dayal David Brock
Multimedia Database Virginia Polytechnic Institute and State University Blacksburg, VA CS 4624 Multimedia, Hypertext and Information Access Client.
Event Focused URL Extraction from Tweets
Collection Management Webpages Final Presentation
Stream Field Final Project Presentation
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Tracking FEMA Kevin Kays, Emily Maier, Tyler Leskanic, Seth Cannon
Twitter Equity Firm Value
Wikipedia Hadoop Steven Stulga Spring 2016
CS6604 Digital Libraries IDEAL Webpages Presented by
Validation of Ebola LOD
LucidWorks: Vectorize Workflow Module
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Overview of big data tools
Paleontology Topic Trends
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Katrina Database SearchKat
Adam Lech Joseph Pontani Matthew Bollinger
Why use Web Standards?.
Get your ETL flow under statistical process control
End to End Monitoring Solution using Open Source Technology where webMethods 9.10 is used as ESB IBM Confidential.
CS5984:Big Data Text Summarization
Multithreaded Programming
Blacksburg to Guatemala Archive
Software Requirement and Specification
Query Interface using Django
Indexing with ElasticSearch
Presentation transcript:

Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark (Eastman Chemical Company) Instructor: Edward Fox Multimedia, Hypertext, and Information Access (CS 4624) Virginia Tech, Blacksburg, VA 24061 Spring 2017

Project Goals Improve Eastman Chemical Company’s ability to inform key business decisions Assist product marketing and strategy development Generate information for trend analysis of keywords Publication date extraction Common Crawl keyword searches The main goal behind the Common Crawl Mining project was to improve Eastman Chemical Company’s ability to inform key business decisions, like whether or not to market a certain chemical. In order to do this, they wanted specific information, from the Common Crawl archive, to be used by their product marketing and strategy development divisions. This information would allow them to drill down to a particular time period and look at what people were writing about certain keywords and chemicals during that period. To perform such trend analysis, they simply needed assistance extracting publication dates and performing keyword searches on Common Crawl data; however the system we developed turned out to be much more than this.

History Established client contact Experimented with MapReduce searching Transitioned to Elasticsearch[1] Reduced dataset Configured prototype Tested Elasticsearch prototype Refined Elasticsearch prototype Parallelization Custom relevance scoring Deduplication Here is an overview of our project’s history. We started off trying to establish contact with our client, which took longer than we would have liked. So, we began experimenting with MapReduce to search Common Crawl WARC files for keywords & chemical names. From there, we moved to an Elasticsearch engine due to resource and time constraints. Once we had set up our Elasticsearch prototype, we tested it using keyword searches with chemical names. Lastly we worked on refining our prototype by parallelizing the processing of WARC records, implementing custom relevance scoring, and implementing deduplication during indexing. Now, Ali will go over some of the problems we ran into during this process and how we addressed them.

Problems EMR cluster crashing (node size and default timeout) Change configurations Webhose printing to stdout Dup2() JavaScript in HTTP responses Beautiful Soup Alternatives to Webhose date extraction (11,031/43,748) IBM AlchemyLanguage Scalability Multiprocessing One of the early problems we had during prototyping, was that the EMR cluster would run for a couple of hours and then crash. To limit the cost of production, we were initially using small and medium nodes for the master and core nodes and this was causing an issue. So in order for the cluster to initialize properly, we had to switch to larger nodes. We also ran into a problem with the default timeout for elasticsearch indexing which we had to increase. Another problem we had was when we would SSH into AWS, run the script and then exit, and one of the modules tried to write to STDOUT, webhose in this case, It would not have access to the terminal and so it would terminate. Simply changing STDOUT or running the script in the background wasn’t enough. We had to change it at the file descriptor level using the dup2 function. During the refinement stage, we wanted to get rid of the javascript code in the HTTP responses that were being stored in elasticsearch since they were consuming memory and making search more costly. So we used the decompose function provided by Beautiful soup to get rid of all the style and scripts text. The date extraction tool that we are using is Webhose, which is open source. The ratio of records for which date was extracted was about 25% which is not that great. We could use IBMs Alchemy to get the date but it would incur a large cost. For parallelization, we used multiprocessing instead of multithreading since the Cpythons global lock interpreter causes sequential execution, which isn’t much better.

Demo Querying[2] Visualization[2]

Consider all problems you may face and construct solutions ahead of time.

Acknowledgements Ken Denmark (Eastman Chemical Company) Dr. Edward Fox (Virginia Tech) Liuqing Li (Virginia Tech) Don Sanderson (Eastman Chemical Company)

References [1] "Elastic MapReduce Using Python and MRJob." Elastic MapReduce Using Python and MRJob - Cs4980s15. MoinMoin, 2015-02-12 Web. https://weblog.cs.uiowa.edu/cs4980s15/Elastic%20MapReduce%20using%20Python% 20and% 20MRJob 19 Feb. 2017. [2] “Kibana: Explore, Visualize, Discover Data”. Elastic. 10 Apr. 2017. Web. https://www.elastic.co/products/kibana. 10 Apr. 2017. [3]: Common Crawl. Common Crawl, 2017 Web. http://commoncrawl.org/the-data/get- started/ 19 Feb. 2017. http://commoncrawl.org/the-data/get-started/

Thanks! Questions? [3]