CS6604 Digital Libraries IDEAL Webpages Presented by

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Implementing search with free software An introduction to Solr By Mick England.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Image taken from: slideshare
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Software Systems Development
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Spark Presentation.
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Virginia Tech Blacksburg CS 4624
Ministry of Higher Education
The Basics of Apache Hadoop
Hadoop Basics.
CS110: Discussion about Spark
Hadoop Technopoints.
Introduction to Apache
CS6604 Digital Libraries IDEAL Webpages Presented by
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Apache Hadoop and Spark
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

CS6604 Digital Libraries IDEAL Webpages Presented by Ahmed Elbery, Mohammed Farghally Project client Mohammed Magdy Virginia Tech, Blacksburg 11/16/2018

Agenda Project overview Solr and SolrCloud Solr for indexing the events Hadoop Indexing using Hadoop and SolrCloud Web Interface Overall Architecture Screen Shots 11/16/2018

Overview A tremendous amount ≈ 10TB of data is available about a variety of events crawled from the web. It is required to make this big data accessible and searchable conveniently through the web. ≈ 10TB of .warc. Use only HTML files. Services : Browsing : Descriptions, Locations, date, 11/16/2018

Big picture Crawled Data Hadoop Index Solr 11/16/2018

Solr Solr is an open source enterprise search server based on the Lucene Java search library. Solr can be integrated with, among others… PHP Java Python JSON Request Solr Server Reply

SolrCloud Whet will happen if the server becomes full? Shard 2 Shard 1 Request Solr Server Reply Shard 2 Shard 1 Whet will happen if the server becomes full? In terms of either storage capacity or processing capability Solr Server Solr Server Solr Server Solr Server Solr Server Solr Server

SolrCloud What is SolrCloud? Shard & Replicate Shard 2 Shard 1 Leader Leader Whet will happen if the server becomes full? In terms of either storage capacity or processing capability A leader is a node that can accept write requests without consulting with other nodes ZooKeeper server that helps manage the overall structure so that both indexing and search requests can be routed properly Scalability Fault Tolerance and Throughput Replica Replica Replica Replica

Schema schema.xml  is usually the first file we configure when setting up a new Solr installation. The schema declares: what kinds of fields there are which field should be used as the unique/primary key which fields are required how to index and search each field

Schema (Cont.) 11/16/2018 We use the following fields category: the event category or type name: the event name title: the file name content: the file content URL: the file path on the HDFS system id: document ID version: 11/16/2018

SolrCloud control solrctl instancedir --generate $HOME/solr_configs solrctl instancedir --create collection1 $HOME/solr_configs solrctl collection --create collection1 -s numOfShards http://128.173.49.32:8983/solr/#/~cloud 11/16/2018

Event Fields We use the following fields category: the event category or type name: the event name title: the file name content: the file content URL: the file path on the HDFS system id: document ID text: copy of the previous fields 11/16/2018

Hadoop What is Hadoop? Features:- Scalable Economical Efficient Reliable Uses 2 main Services HDFS Map-Reduce  Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

HDFS Architecture Master/slave architecture HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodes usually one per node in a cluster. The DataNodes manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Some Terminology Job – A “full program” - an execution of a Mapper and Reducer across a data set Task – An execution of a Mapper or a Reducer on a slice of data Task Attempt – A particular instance of an attempt to execute a task on a machine

MapReduce Overview User Program fork Master assign map reduce Worker Master fork assign map reduce Split 0 Split 1 Split 2 Input Data local write Output File 0 File 1 write read remote read, sort

MapReduce in Hadoop (1)

MapReduce in Hadoop (2)

MapReduce in Hadoop (3)

Job Configuration Parameters On cloudera /user/lib/hadoop-*-mapreduce/conf/mapred-site.xm

Map TO Reduce Partition Function Combiners Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k (e.g., popular words in Word Count) Can save network time by pre-aggregating at mapper combine(k1, list(v1))  v2 Usually same as reduce function Partition Function For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R

IndexerDriver.java 11/16/2018

Indexer,aper 11/16/2018

Solr REST API Solr is accessible through HTTP requests by using Solr’s REST API. Hard to create complex queries Results are returned as strings which requires some form of parsing. http://preston.dlib.vt.edu:8983/solr/collection1/select?q=Sisters&wt=json&indent=true&hl=true&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Solr REST API (Cont.) Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Solarium Solarium is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the indexed data. Solarium provides an Object Oriented interface to Solr which makes it easier for developers than the Solr’s REST API. Current version 3.2.0. Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Why Solarium Solarium makes it easier for creating queries. Object Oriented interface rather than URL REST interface. Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Why Solarium (Cont.) Solarium makes it easier for getting results Rather than parsing JSON or XML strings results are returned as a PHP associative arrays. Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Interface Architecture Search requests(AJAX) Query Web Interface (HTML) Server (PHP) Query Solr Server Solarium Results Response (Assoc. Array) Events Information Response (JSON or XML) Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. Events Information Index MYSQL DB 11/16/2018

Overall Architecture Map/Reduce Search requests(AJAX) Web Interface PHP Module Solr Server Query Query Results Solarium Response (Assoc. Array) Response (JSON or XML) Events Information MYSQL DB Events Information Hadoop Uploader Module Map/Reduce WARC Files Extraction/ Filtering Module .html Files Indexer Module Index 11/16/2018

Screen Shots Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Screen Shots (Cont.) Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Screen Shots (Cont.) Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Screen Shots (Cont.) Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Screen Shots (Cont.) Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Screen Shots (Cont.) Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 11/16/2018

Mohammed Farghally & Ahmed Elbery 11/16/2018