ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study

Slides:

Advertisements

Similar presentations

SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of.

Advertisements

HTML5 ETDs Edward A. Fox, Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray Digital Library Research Laboratory Virginia Tech ETD 2010, June.

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.

11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.

1 Archive-It Training University of Maryland July 12, 2007.

Web Archiving at the Innsbruck Newspaper Archive Innsbrucker Zeitungsarchiv / IZA Presentation by Renate Giacomuzzi, Elisabeth Sporer, Armin Schleicher.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

Tool Academy: Web Archiving Nicholas Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby.

OnTimeMeasure Integration with Gush Prasad Calyam, Ph.D. (PI) Tony Zhu (Software Programmer) Alex Berryman (REU Student) GEC10 Selected.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.

1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.

Integrated Collaborative Information Systems Ahmet E. Topcu Advisor: Prof Dr. Geoffrey Fox 1.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.

Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.

9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK.

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.

1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.

Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team

GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.

Event-Based Model for Reconciling Digital Entities Ahmet Fatih Mustacoglu Ahmet E. Topcu Aurel Cami Geoffrey C. Fox Indiana University Computer Science.

CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.

The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.

5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.

Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Big Data Processing of School Shooting Archives

Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences

Accessing the VI-SEEM infrastructure

Accessing Web Archives Web Science Course 2017

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,

Hadoop and Analytics at CERN IT

CS6604 Digital Libraries Global Events Team Final Presentation

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

ReproZip: Computational Reproducibility With Ease

Collection Management (Tweets) Final Presentation

Distributed Network Traffic Feature Extraction for a Real-time IDS

Running virtualized Hadoop, does it make sense?

Collection Management Webpages

Collection Management

VI-SEEM Data Discovery Service

Joseph JaJa, Mike Smorul, and Sangchul Song

Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.

Extraction, aggregation and classification at Web Scale

Big Data: GitHub & Spark

Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:

Microsoft Build /8/2018 5:15 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,

Clustering tweets and webpages

CS 5604 Information Storage and Retrieval

CS6604 Digital Libraries IDEAL Webpages Presented by

Haiyan Meng and Douglas Thain

Team FE Final Presentation

Access HDF5 Datasets via OPeNDAP’s Data Access Protocol (DAP)

Collection Management Webpages Final Presentation

Introduction to Apache

Title of session For Event Plus Presenters 12/5/2018.

Information Storage and Retrieval

Web archive data and researchers’ needs: how might we meet them?

Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li

CS5984:Big Data Text Summarization

Applying Data Warehousing and Big Data Techniques to Analyze Internet Performance Thiago Barbosa, Renan Souza, Sérgio Serra, Maria Luiza and Roger Cottrell.

Web archives as a research subject

This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.

Collecting, Analyzing, and Visualizing Data with Python Part II

Presentation transcript:

ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study Virginia Polytechnic Institute and State University, Blacksburg, VA 24061 Professor Edward A. Fox

AGENDA ArchiveSpark Overview/Recap Benchmarking Demo

Archivespark Framework - efficient data access, extraction and derivation on Web archive data ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation Helge Holzmann, Vinay Goel, Avishek Anand Published in JCDL 2016 Nominated for the Best Paper Award OS project - https://github.com/helgeho/ArchiveSpark

Two techniques Pre-generated CDX metadata index Smaller dataset Reduction based on Web archive metadata Incremental filtering workflow “Extract only what you need” Augment -> Filter -> Repeat Concept – Enrichments ArchiveSpark Record extension Featured – StringContent.scala, Html.scala, Json.scala, Entities.scala, Prefix.scala, … Custom – mapEnrich[Source, Target](sourceField, targetField) (f: Source => Target)

Workflow

Flexible deployment Ultimately a Scala/Spark library Environments Standalone solitary Spark instance Local HDFS-backed Spark cluster Large-scale YARN/Mesos-orchestrated cluster running Cloudera/Hortonworks Quickstart – Docker (latest version) ArchiveSpark version – 2.1.0 Spark 2.0.2 Scala 2.11.7 -> Java 8

Warc files Standard Web archiving format - ISO 28500 Single capture of web resource at a particular time Header section – Metadata (URL, timestamp, content length…) Payload section – Actual response body (HTML, JSON, binary data) HTTP Response – HTTP headers (origin, status code)

Cdx Index “Reduced” WARC file CDX WARC metadata Pointers to WARC records – offsets in WARC file CDX Header – specifies metadata fields contained in the index Body – typically 9 – 11 fields Original URL, SURT, date, filename, MIME type, response code, checksum, redirect, meta tags, compressed offset

Tools CDX Writer Jupyter Notebook Warcbase (only benchmarking) Python script for CDX extraction Alternative to Internet Archive’s Wayback Machine - http://archive.org Jupyter Notebook Web application for code sharing/results visualization Warcbase (only benchmarking) “State-of-the-art” platform for managing and analyzing Web archives Hadoop/HBase ecosystem – CDH Archive-specific Scala/Java objects for Apache Spark and HBase HBase command-line utilities - IngestFiles

Benchmarking Evaluation of 3 approaches Preprocessing ArchiveSpark Pure Spark using Warcbase library HBase using Warcbase library Preprocessing ArchiveSpark – CDX index files extraction HBase – WARC ingestion ArchiveSpark Benchmark subproject Requirements: Built and included Warcbase sbt assemblyPackageDependency -> sbt assembly

ENVIRONMENT Development Benchmarking Cloudera Quickstart VM – CDH 5.8.2 Benchmarking Cloudera CDH 5.8.2 cluster hosted on AWS (courtesy of Dr. Zhiwu Xie) 5-node cluster consisting of m4.xlarge AWS EC2 instances 4 vCPUs 16 GiB RAM 30 GB EBS storage 750 Mbps network

Benchmark 1 – small scale Filtering & corpus extraction 4 scenarios Filtering of the dataset for a specific URL (one URL benchmark) Filtering of the dataset for a specific domain (one domain benchmark) Filtering of the dataset for a date range of records (one month benchmark) Filtering of the dataset for a specific active (200 OK) domain (one active domain benchmark) Dataset - example.warc.gz One capture of archive.it domain 261 records 2.49 MB

Benchmark 1 – Results

Benchmark 2 – Medium scale Filtering & corpus extraction 4 scenarios Filtering of the dataset for a specific url (one url benchmark) Filtering of the dataset for a specific domain (one domain benchmark) Filtering of the dataset for a specific active (200 OK) domain (one active domain benchmark) Filtering of the dataset for pages containing scripts (pages with scripts benchmark) Dataset – WIDE collection Internet Archive crawl data from Webwide Crawl (02/25/2011) 214470 records 9064 MB 9 files – approx. 1 GB

Benchmark 2 – Preprocessing CDX Extraction 4 minutes 41 seconds HDFS Upload 2 minutes 46 seconds HBase Ingestion x9 1 file – (1 minute 10 seconds <–> 1 minute 32 seconds) Sequential ingestion – approx. 13 minutes 54 seconds

Benchmark 2 – Results

Benchmark 2 – Results

Benchmark 2 – Results

Benchmark 2 – Results

Acknowledgement & DISCLAIMER This material is based upon work supported by following grants: IMLS LG-71-16-0037-16: Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse NSF IIS-1619028, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR) NSF IIS - 1319578: III: Small: Integrated Digital Event Archiving and Library (IDEAL) Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

Thank you http://tinyurl.com/zejgc9f

tutorial