Hadoop-based Distributed Web Crawler

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
AStudy on the Viability of Hadoop Usage on the Umfort Cluster for the Processing and Storage of CReSIS Polar Data Mentor: Je’aime Powell, Dr. Mohammad.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Software Architecture
Authors: Jiann-Liang Chenz, Szu-Lin Wuy,Yang-Fang Li, Pei-Jia Yang,Yanuarius Teofilus Larosa th International Wireless Communications and Mobile.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Hadoop Ali Sharza Khan High Performance Computing 1.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Authors: Jiann-Liang Chenz, Szu-Lin Wuy, Yang-Fang Li, Pei-Jia Yang,
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Next Generation of Apache Hadoop MapReduce Owen
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Hadoop MapReduce Framework
Institute for Cyber Security
Map Reduce.
15-826: Multimedia Databases and Data Mining
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
Software Engineering Introduction to Apache Hadoop Map Reduce
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
CS110: Discussion about Spark
Introduction to Apache
Distributed Systems CS
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Hadoop-based Distributed Web Crawler Zhen-Feng Shi Search Engine Group Research Center of Intelligent Internet of Things(IIoT) Department of Electronic Engineering Shanghai Jiaotong Univ.

Look Ahead Motivation Brief Introduction to Hadoop Distributed Web Crawler Current Work Future Work

Motivation Systematic Crawler Data Academic Search Engine Fetch Data? Hadoop Distributed Web Crawler Current Work Future Work Data plays an essential role in our academic search engine. Once we have multiple data source on local, we can establish a unified data store, which is also called data warehouse to guarantee the accuracy and consistency. And then to support the upper layer functions like recommending or analyzing from the theory group. The problem is, how to fetch these data from web into our local server? Two strategy: 1. Aim at specific website, design specific crawler to crawl the data. Drawbacks: wasteful duplication of effort. So many websites. 2. Crawl papers from google scholar. Drawbacks: Google itself is the biggest crawler on earth. He knows how to crawl and how to prevent to be crawled better than us. Therefore, we are doomed to be blocked. Fetch Data? IEEE Systematic Crawler

Motivation Why not Nutch? A well matured, production ready Web crawler Great for batch processing Two thirds of the system is designed for search engine No helpful to accurate extraction Instead of the title of HTML, we need Paper titles, authors, abstracts, DOI, etc… Secondary development == Complete destruction Motivation Hadoop Distributed Web Crawler Current Work Future Work There are some famous open source web crawler such as nutch. Nutch is a well matured, production ready web crawler. And it is great for batch processing. But don’t we use it?

Web Crawler -- Not a common application of Hadoop open-source software Reliable, scalable, distributed computing Four modules Hadoop Common Hadoop Distributed File System (HDFS™) Hadoop YARN Hadoop MapReduce Map() -- filtering and sorting Reduce() -- summary operation Web Crawler -- Not a common application of Hadoop Zero input, infinity output Most of works done in Map() Motivation Hadoop Distributed Web Crawler Current Work Future Work Hadoop Common provides The common utilities that support the other Hadoop modules. HDFS is a distributed file system that provides high-throughput access to application data YARN is a framework for job scheduling and cluster resource management. MapReduce is based on YARN, for parallel processing of large data sets. Basically, it has two procedure, map and reduce. Map performs filtering and sorting, and reduce performs summary operation (such as sorting students by first name into queues, one queue for each name) (such as counting the number of students in each queue, yielding name frequencies) Web crawler is not a common scenario of Hadoop. It has zero input, but infinity output. Therefore, most of our work can be done in the map() procedure. But Hadoop is also helpful due to its powerful distribution management.

Distributed Web Crawler Crawled List IEEE Watch Dog(Job Scheduler) Data Warehouse Motivation Processing Node 1 Hadoop ACM Distributed Web Crawler Processing Node 2 MAG Current Work Processing Node N Future Work Arxiv /Processing

Current Work Cluster with Hadoop A running crawler on Hadoop One master Two Slaves A running crawler on Hadoop Efficiency Comparison 2 Cores + multi thread : 1.92 paper / sec Hadoop : 6.5 paper / sec Production CiNii Articles: 1 million papers ERIC: 1.3 million papers Motivation Hadoop Distributed Web Crawler Current Work Future Work

Future Work More sources Automation and Surveillance SAO/NASA Europe PMC CiteSeerX … Automation and Surveillance Few intervention Always up to date Motivation Hadoop Distributed Web Crawler Current Work Future Work

Thanks Q&A