BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial Controls & Engineering, SCADA Systems Email: kacper.szkudlarek@cern.chkacper.szkudlarek@cern.ch Supervised by: Piotr Golonka, Manuel Gonzalez Berges

What we are going to talk about: Today: – BigData – NoSQL – Not Only SQL – Hadoop - what it is all about? – HDFS/MAPR – Distributed File System, base of everything Next ICETea: – MapReduce – as a new paradigm for data processing – Hadoop ecosystem tools – Other NoSQL systems

BigData Combination of old and new technologies giving availability to: – Manage huge volume of data – Gain the right speed of processing – Within the right time frame to allow real-time analysis and reactions Designed for all types of data:

BigData

The BigData characteristic So called the 3 “V”: – Volumes petabytes and exabytes of data (limited number of files) – Variety any imaginable type of data – Velocity speed at which data is collected

NoSQL Not only SQL

What is NoSQL Next Generation Databases addressing new features: – Non-relational – Distributed – Open-source – Horizontally scalable Systems providing mechanisms for Big Data processing New approach for storage of huge amount of data – Not necessary structured data – Kept in many formats (e.g. key-value pairs, objects, tree …) Fast processing focused on data analytics

NoSQL examples Divided by Data Model

1: Key-value Hash map like data alignment, persistent to distributed file system. Example: Project Voldemort, riak 12345 ABCD 2014.06.19 Some data Other data Yet another data

2: Document Database as storage of mass of documents. Each documents is different data structure – No set schema Example: mongoDB, CouchDB. { _id: 101, type: "fruit", item: "jkl", qty: 10, price: 4.25, memos: [ { memo: "on time", by: "payment" }, { memo: "delayed", by: "shipping" } ] }

3: Column-family Stores multiple aggregates – Identified by row id and column family name, – More complex data model, – Gain on data retrieval. Example: Apache Hbase, Cassandra. 12345 Column family Name: Kacper Surname: Szkudlarek City: Saint-Genis-Puoilly Street: Rue du Bordeau Postal code: 01630

4: Graph Modeling of relations between data – Data decompositions. Example: Neo4j

Relaxed data consistency No ACID (atomicity, consistency, isolation, durability) in meaning as in relation databases – Exception graph DB due to data decompositions No really need for transactions – Data is kept aggregated, – Aggregate update is atomic.

Want more information? https://www.youtube.com/watch?v=qI_g07C_ Q5I

= Distributed FS clustering job scheduler MapReduce

What is hadoop? Apache licensed software Batch processing system for a cluster of nodes Underpinnings of Big Data processing systems – Storing huge amount of data – Fast local processing split into chunks Can work on any modern desktop PC as a node – Decent, automatic scalability Core and main API written in Java (unfortunately)

Who uses Hadoop? (in one or the another form)

A new Hadoop Paradigms Process data locally Reduce dependence on bandwidth Expect/accept failure – Handle failover elegantly Duplicate finite blocks of data to small groups of nodes (rather than entire database) Reduce elapsed seek time Data processing cost reduction

Source: http://bitquill.net/blog/?tag=hadoop The Hadoop Approach Distribute large amounts of data across thousands of commodity hardware nodes – Process data in parallel – Replicate data across cluster for reliability Analysis moved to data – Avoid data copy Scanning of data – Avoids random seeks – Easiest way to process

The Ecosystem of Projects associated with Hadoop Data Management Data Access HDFS (Hadoop Distributed File System) Batch Map Reduce Script Pig SQL Hive NoSQL HBase Stream Storm Others YARN (NextGen MapReduce) IntegrationOperations Sqoop Flume NFS WebHDFS Monitor Zookeeper Scheduling Oozie

Hadoop and Java Core of the Hadoop and base projects developed using Java All API’s for Mapper, Reducer, HDFS and so on based on Java interfaces Possible usage of other languages for defining certain jobs or part of jobs

and other distributed file systems

What is HDFS? Standard Hadoop Distributed File System Logical file system Primary storage system for Hadoop Specialized for read access Can handle enormous files (> 100 TB) Deployed currently only on Linux

HDFS Charactersistics Persistent Replicated Linear scalable Applications sequentially stream reads – Often from very large files Optimized for read performance – Avoids random disk seeks Write once and read many times Files append only Data stored in blocks – Distributed over many nodes – Block size often range from 128M to 1G

HDFS Architecture Secondary NameNode NameNode NameSpace Block Map DataNode BL1 BL7 BL8 BL11 DataNode BL1 BL6 BL2 BL7 DataNode Checkpoint Image and Edit Journal Log (backup) Namespace MetaData Image (Checkpoint) And Edit Journal Log

Logical File System File’s disk blocks are not physically contiguous – Distributed around many DataNode Data only logically contiguous Read/write mechanism transparent to the user

Data Organization Metadata – Organized into files and directories – Linux-like permissions prevent accidental deletions Files – Divided into uniform sized blocks – Default 64M – Distributed across clusters Rack-aware (HA, minimization of out of rack data transfers) Checksuming – Corruption detection

HDFS Cluster (I) HDFS runs in Hadoop distributed mode 3 main components: – Name node (eventually secondary NameNode): Manages DataNodes Keep Metadata for all nodes & blocks NOT auto failover (with secondary NameNode) Backups of logs

HDFS Cluster (II) DataNodes – Hold data blocks – Slave in hierarchy – Manages blocks for HDFS – If heartbeat fails: Removed from cluster Replicated blocks take over Client – Talks directly to NameNodes then DataNodes NameNode HDFS DataNode daemon heartbeats fsimage editlog

File Access – RPC NameNode NameSpace Block Map JVM Distributed File System FSData Output Stream Client Code PIG Hive HBase fsshell DataNode 1 2 3 4 5 6 1.Request (create/open/delete) Provide name of file or directory 2.Approval 3.Request for block 4.Block ID and list of DataNodes 5.Operation on DataNode Read Write Delete 6.Return Note: NameNode is not in the data path NameNode only stores metadata

Alternative to HDFS Build for business-critical production applications. – Commercial product – Free to use version available New container architecture different from HDFS Implements normal files, visible in operating system as soon as it is written, access via NFS Solve synchronization problem with commodity hardware Reliable

Container architecture Chops data of each node into 1000s pieces Replicate containers across the cluster If node dies, other replicates missing data with higher speed

HDFS vs MapR Disclaimer: Source: http://www.mapr.com/why-hadoop/why-mapr/architecture-matters

MapR advantages High Availability Cluster Better performance than HDFS – Data from HDFS NameNode moved into the cluster – No file count limitation – Lower costs, less hardware in the cluster NFS interface for clusters access, behaves like a giant NFS server with full HA Replicated, ultra-reliable solution available in M7 option Holder of the TeraSort world record (speed of writing of 1TB file) -> 55 seconds (youtube link…)

Other distributed file systems GFS – Google File System, proprietary file system developed for own use. GridFS – distributed file system used by MongoDB

es-hadoop Hadoop extension to work with Elasticsearch data. Near real-time responses (think milliseconds). Dedicated Input/Output classes to read data to Hadoop MapReduce. Usage of Hadoop paradigm of local data processing: – Each node works on shards stored on it. Integration with Hadoop tools (Pig, Hive, etc.). Horizontal scaling of cluster

Distributions of Hadoop Available many different distributions – Cloudera (under testing @CERN) Free VM images/Online Live Service – Hortonworks Free VM images – MapR(image) Many free and paid VM machines – Spring for Apache Hadoop Where to read about? – Online training by Hortonworks and Cloudera

To be continued… MapReduce – as a new paradigm for data processing Hive – SQL like interface data access tool Pig - high-level scripting tool for data processing HBase – NoSQL system, the new way of thinking about databases

BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Similar presentations

Presentation on theme: "BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Similar presentations

Presentation on theme: "BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial."— Presentation transcript:

Similar presentations

About project

Feedback