Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis

Similar presentations


Presentation on theme: "Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis"— Presentation transcript:

1 Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu

2 Today’s Talk Why use Hadoop? What is Hadoop? How does Hadoop work? How are we using Hadoop? Issues encountered A broader view – future directions

3 Background “…WorldMap will be extended to be capable of gathering interactive map information from hundreds of other servers around the world and making this map layer information searchable together with the WorldMap layer information.” http://worldmap.harvard.edu/

4 Orientation / Motivation gathering interactive map information from hundreds of other servers around the world KML Shapefiles

5 Overall Process Billions of webpages Hundreds of terabytes of compressed HTML text data

6 We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

7

8 Process the Data Hundreds of terabytes of compressed HTML text data Thousands CPU hours Months of processing !

9 Common Crawl Frequency [ARC] s3://aws-publicdatasets/common-crawl/crawl-001/ - Crawl #1 (2008/2009) [ARC] s3://aws-publicdatasets/common-crawl/crawl-002/ - Crawl #2 (2009/2010) [ARC] s3://aws-publicdatasets/common-crawl/parse-output/ - Crawl #3 (2012) [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/ - Summer 2013 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/ - Winter 2013 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/ - March 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-15/ - April 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/ - July 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-35/ - August 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-41/ - September 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/ - October 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-49/ - November 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/ - December 2014 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-06/ - January 2015 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-11/ - February 2015 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-14/ - March 2015 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-18/ - April 2015 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-22/ - May 2015 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/ - June 2015 [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/ - July 2015

10 Master Slaves

11 Master Slaves

12 Process the data Hours Thousands CPU hours

13 Master Slaves Scalability Fault Tolerance Resource Sharing

14 Hadoop 1.0 Framework Hadoop Distributed File System - HDFS MapReduce - MR

15 MapReduce Implementation Key : Value or K : V K 1 : V 1 K O : V O

16 MapReduce Flow (K,V) (K,[V]) (K,V)

17 Hadoop HDFS

18

19 Hadoop 1.0 Issues Scalability – Job Tracker does it Job Tracker – single point of failure Resource Utilization – Map & Reduce slots Designed for MapReduce Applications

20 Hadoop Evolution Yet Another Resource Negotiator - YARN

21 Hadoop 2.0 Framework

22 Hadoop Environments Cloud Local Cluster ‘Virtual’

23 A Commodity Server 2009 – 8 cores, 16GB of RAM, 4x1TB disk 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk. http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

24 Amazon Web Services

25

26 Hadoop on AWS EMR Elastic Cloud Compute (EC2) Elastic Map Reduce (EMR) Amazon Web Services (AWS)

27 Implementing Hadoop at CGA AWS account –FREE 750 hrs/month t1.micro (Hadoop 1.0) Smallest Amazon EC2 Instance Good for learning basics Can’t execute Hadoop 2 – needed for libraries t1.micro  m1.medium Hadoop 2 Clusters Develop on local machine Create test specific test WARCs Process on cluster m1.medium  r3.xlarge

28 CommonCrawl Processing on AWS Local algorithm development Upload application (jar file) to S3 Ruby command-line-interface for EC2/EMR initialization

29 Implementing Hadoop at CGA WARCTagCounter.java TagCounterMap.java Hadoop ‘configuration’ Input data information Mapper selection Reducer selection – simple summer Mapper functionality Extends the Mapper class Mapper K 1 : V 1 K 2 : V 2

30 WARC/1.0 WARC-Type: response WARC-Date: 2014-08-02T09:52:13Z WARC-Record-ID: Content-Length: 43428 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: WARC-Concurrent-To: WARC-IP-Address: 212.58.244.61 WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO WARC-Truncated: length HTTP/1.1 200 OK Server: Apache Vary: X-CDN Cache-Control: max-age=0 Content-Type: text/html Date: Sat, 02 Aug 2014 09:52:13 GMT Expires: Sat, 02 Aug 2014 09:52:13 GMT Connection: close Set-Cookie: BBC-UID=...... BBC NEWS | Africa | Namibia braces for Nujoma exit

31 Signature Detection …

32 Signatures “http(s)://…/arcgis/rest/services” “http(s)://…/arcgiscache” “http(s)://…?request=getcapabilities” “http(s)://….kml” or “http(s)://….kmz”(“shape” || “shp”) && “.zip” “http(s)://… "${z}/${x}/${y}" || "${z}/${y}/${x}" || "$[z]/$[x]/$[y]" || "$[z]/$[y]/$[x]" ||"{z}/{x}/{y}" || "{z}/{y}/{x}" || "[z]/[x]/[y]" || "[z]/[y]/[x]" “http(s)://… request=getmap” “http(s)://….jp2”“http(s)://….ecw” “http(s)://….sid” “http(s)://….tfw” “http(s)://….gpx”“http(s)://….geojson”“http(s)://….gdb” “http(s)://…thredds…”“http(s)://…opendap…”

33 Reducer Output http://cinematreasures.org/theaters/10911.kml1 http://cinematreasures.org/theaters/10911/map|||http://cinematreasures.org/theaters/10911.kml -1 Signature match URI (base of URL)

34 Results It worked! Pre-built parsers vs. ‘homebrew’ Jsoup parser: inconsistent processing times RegEx parser: much more consistent results A wide array of geo services vis-à-vis signature choice

35 Issues Implementing Hadoop Hadoop learning curve Native Java application Tutorial information exists Hadoop on AWS: S3, EMR, terminology, billing / cluster size Optimizing cluster: Instance type, CPU, Memory, etc. A wide array of geo services vis-à-vis signature choice What’s out there and what’s its signature

36

37 Future Directions SpatialHadoop A MapReduce Framework for Spatial Data GIS Tools for Hadoop Processing GeoTweets

38 Backup


Download ppt "Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis"

Similar presentations


Ads by Google