Extraction, aggregation and classification at Web Scale

Slides:

Advertisements

Similar presentations

Introduction to Hadoop Richard Holowczak Baruch College.

Advertisements

Performance Considerations of Data Acquisition in Hadoop System

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.

Information Retrieval in Practice

Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao

Overview of Search Engines

Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Hadoop and HDFS

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.

MapReduce M/R slides adapted from those of Jeff Dean’s.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.

Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.

1 CS 430: Information Discovery Lecture 17 Web Crawlers.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

BIG DATA/ Hadoop Interview Questions.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Data mining in web applications

Information Retrieval in Practice

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Image taken from: slideshare

Presented by: Omar Alqahtani Fall 2016

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Information Storage and Retrieval Fall Lecture 1: Introduction and History.

Hadoop-based Distributed Web Crawler

MapReduce Compiler RHadoop

Hadoop Aakash Kag What Why How 1.

Search Engine Architecture

Software Systems Development

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Original Slides by Nathan Twitter Shyam Nutanix

Pathology Spatial Analysis February 2017

Spark Presentation.

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Introduction to MapReduce and Hadoop

Hadoop Clusters Tess Fulkerson.

Ministry of Higher Education

Introduction to Spark.

Big Data - in Performance Engineering

CS6604 Digital Libraries IDEAL Webpages Presented by

湖南大学-信息科学与工程学院-计算机与科学系

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Collection Management Webpages Final Presentation

Ch 4. The Evolution of Analytic Scalability

Seattle Event Finder Justin Meyer Jessica Leung Jennifer Hanson

Overview of big data tools

CS 345A Data Mining MapReduce This presentation has been altered.

Charles Tappert Seidenberg School of CSIS, Pace University

Introduction to MapReduce

Map Reduce, Types, Formats and Features

Presentation transcript:

Extraction, aggregation and classification at Web Scale Philippe Rigaux Internet Memory Research

Internet Memory in brief Internet Memory Fundation (formerly European Archive) A non-profit institution devoted to the preservation (archives) a Web collections Internet Memory Research, a spin-off which provides services on Big datasets (crawls, analysis, classification, annotations)

What are we doing, and why? We collect and store collections of Web resources/documents A Big Data Warehouse made from Web documents We chose the Hadoop suite for scalable storage (HDFS), data modeling (HBase) and distributed computation (MapReduce) We support navigation in archived collections We produce structured content extracted (from single documents) and aggregated (from groups)

Overview of Mignify

Talk Outline Crawler (brief); document repository (brief) Design of Mignify How we build an ETL platform on top of Hadoop Open issues What we could probably do better What we don’t do at all (for the moment)

Our collections Periodic (monthly) crawls of the Web graph Ex: we can collect 1 Billion resources during a 3-weeks campaign, with a 10-servers cluster No complex operations at crawl time A « collection » is a set of unfiltered Web docs = mostly garbage! We store and post-process collections Size? Hundreds of TBs.

The document repository Built on the three main components of Hadoop Hadoop Distributed File System (HDFS) A FS specialized in large, append-only files, with replication and fault-tolerance HBase, a distributed document store More to come soon MapReduce = distributed computing Very basic, but designed for robustness

Hadoop essentials Hadoop brings Linear scalability Data locality All computations are « pushed » near the data Fault tolerance The computation eventually finishes, even if a components fails, … and it happens very often

Extraction: Typical scenarios Full text indexers Collect documents, prepare for indexing Wrapper extraction Get structured information from web sites Entity annotation Annotate documents with entity references (eg, Yago) Classification Aggregate subsets (e.g., domains); find relevant topics

Extraction Platform Main Principles A framework for specifying data extraction from very large datasets Easy integration and application of new extractors High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks An extraction process « specification » combines these elements in so-called views and subscriptions. [currently] A single extractor engine Based on the specification, data extraction is processed by a single, generic MapReduce job.

What is an Extractor A software component which Takes a resource as input Produces a set of features as output Does so very efficiently Some examples MIME type of a document Title and Plain text from HTML or PDF Title, author, date from RSS feeds

The Extraction Pipeline Given a resource as input, we apply a filter and an extractor Each extractor produces features which annotate the resource. The annotated resource is given as input to the next pair (filter, extractor) And we iterate.

What is an Aggregator A software component which Takes a group of features (produced by extractors) Computes some aggregated value from this group A classifier is a special kind of aggregator An example Group resources by domain Compute features with an extractor For each group, apply a SPAM classifier

What do we call « view » Extractions are specified as « views » The input (HBase, files, data flows, other…) and output (idem) The list of extractors to apply at resource level The list of aggregators to apply at group level (opt.) We attach views to collection, store and maintain the view results (materialized views). Initially we compute the views with a generic MapReduce job

Evaluation: a generic MapReduce Archive File Map Reduce HBase Resource HBase Aggregators (classifiers) Data File File Views Filters Extractors