Download presentation
Presentation is loading. Please wait.
Published byChastity Price Modified over 8 years ago
1
Big Data Introduction to Big Data, Hadoop and Spark
2
Agenda About speaker and itversity Categorization of enterprise applications Typical data integration architecture Challenges with conventional technologies Big Data eco system Data Storage - Distributed file system and databases Data Processing - Distributed computing frameworks Data Ingestion - Open source tools Data Visualization - BI tools as well as custom Role of Apache and other companies Different Certifications in Big Data Resources for deep dive into Big Data Job roles in Big Data
3
About me IT Professional with 13+ years of experience in vast array of technologies including Oracle, Java/J2EE, Data Warehousing, ETL, BI as well as Big Data Website: http://www.itversity.comhttp://www.itversity.com YouTube: https://www.youtube.com/c/itversityinhttps://www.youtube.com/c/itversityin LinkedIn Profile: https://in.linkedin.com/in/durga0gadirajuhttps://in.linkedin.com/in/durga0gadiraju Facebook: https://www.facebook.com/itversityhttps://www.facebook.com/itversity Twitter: https://twitter.com/itversityhttps://twitter.com/itversity Google Plus: https://plus.google.com/+itversityinhttps://plus.google.com/+itversityin Github: https://github.com/dgadirajuhttps://github.com/dgadiraju Meetup: – https://www.meetup.com/Hyderabad-Technology-Meetup/ (local hyderabad meetup) https://www.meetup.com/Hyderabad-Technology-Meetup/ – https://www.meetup.com/itversityin/ (typically online) https://www.meetup.com/itversityin/
4
Categorization of Enterprise applications Enterprise applications Operational Decision Support Traditionally enterprise applications can be broadly categorized into Operational and Decision support systems. Lately new set of applications such as Customer Analytics is gaining momentum (eg: YouTube Channel for different categories of users) Customer Analytics
5
Enterprise Applications Most of the traditional applications are considered monolith (n-tier architecture) – Monoliths are typically built on RDBMS databases such as Oracle Modern applications use micro services and considered to be polyglot * Now we have choice of different types of databases (we will see later)
6
Enterprise applications Take the example use case (eCommerce platform) Operational – Transactional – check out – Not transactional – recommendation engine Decision support – sales trends Customer analytics – categories in which customers have spent money * Data from transactional systems need to be integrated to non transactional, decision support, customer analytics etc
7
Data Integration Data integration can be categorized into – Real time – Batch Traditionally when there are no customer analytics and recommendation engines, we used to have – ODS (compliance and single source of truth) – EDW (Facts and dimensions to support reports) – ETL (to perform transformations and load data into EDW) – BI (to visualize and publish reports)
8
OLTP Closed Main Frames Closed Main Frames XML External apps XML External apps Source(s) Data Integration (Current Architecture) Target system (eg; EDW) Data Integration (ETL/Real Time) Data Integration (ETL/Real Time) ODS EDW/ODS Visualization/ Reporting Decision Support
9
Data Integration - Technologies Batch ETL – Informatica, Data Stage etc Real time data integration – Goldengate, Shareplex etc ODS – Oracle, MySQL etc EDW/MPP – Teradata, Greenplum etc BI Tools – Cognos, Business Objects etc
10
Current Scenario - Challenges Almost all operational systems are using relational databases (RDBMS like Oracle). – RDBMS are originally designed for Operational and transactional. Not linearly scalable. – Transactions – Data integrity Expensive Predefined Schema Data processing do not happen where data is stored (storage layer) – no data locality – Some processing happens at database server level (SQL) – Some processing happens at application server level (Java/.net) – Some processing happens at client/browser level (Java Script) Almost all Data Warehouse appliances are expensive and not very flexible for customer analytics and recommendation engines
11
Evolution of Databases Now we have many choices of Databases Relational Databases (Oracle, Informix, Sybase, MySQL etc) Datawarehouse and MPP appliances (Teradata, Greenplum etc) NoSQL Databases (Cassandra, HBase, MongoDB etc) In memory Databases (Gemfire, Coherence etc) Search based Databases (Elastic Search, Solr etc) Batch processing frameworks (Map Reduce, Spark etc) Graph Databases (Neo4j) * Modern applications need to be polyglot (different modules need different category of databases)
12
Big Data eco system – History Started with Google search engine Google’s use case is different fromenterprises – Crawl web pages – Index based on key words – Return search results As conventional database technologies does not scale, them implemented – GFS (Distributed file system) – Google Map Reduce (Distributed processing engine) – Google Big Table (Distributed indexed table)
13
Big Data eco system - myths Big Data is Hadoop Big Data eco system can only solve problems with very large data sets Big Data is cheap Big Data provide variety of tools and can solve problems quickly Big Data is a technology Big Data is Data Science – Data Scientist need to have specialized mathematical skills – Domain knowledge – Minimal technology orientation – Data Science it self is separate domain - if required Big Data technologies can be used * Often people have unrealistic expectations on Big Data technologies
14
Big Data eco system – Characteristics Distributed storage – Fault tolerance (RAID is replaced by replication) Distributed computing/processing – Data locality (code goes to data) Scalability (almost linear) Low cost hardware (commodity) Low licensing costs * Low cost hardware and software does not mean that Big Data is cheap for enterprises
15
Big Data eco system Big Data eco system of tools can be categorized into Distributed file systems – HDFS – Cloud storage (s3/Azure blob) Distributed processing engines – Map Reduce – Spark Distributed databases (operational) – NoSQL databases (HBase, Cassandra) – Search databases (Elastic Search)
16
Big Data eco system - Evolution After successfully building search engine in new technologies, Google have published white papers Distributed file system – GFS Distributed processing Engine – Map Reduce Distributed database – Big Table * Development of Big Data technologies such as Hadoop is started with these white papers
17
Data Storage Data storage options in Big Data eco systems Distributed file systems (streaming and batch access) – HDFS – Cloud storage Distributed Databases (random access - distributed indexed tables) – Cassandra – HBase – MongoDB – Solr
18
Data Ingestion Data ingestion strategies are defined by sources from which data is pulled and sinks where data is stored Sources – Relational Databases – Non relational Databases – Streaming web logs – Flat files Sinks – HDFS – Relational or Non relational Databases – Data processing frameworks
19
Data Ingestion Sqoop is used to get data from relational databases Flume and/or Kafka is used to read data from web logs Spark streaming, Storm, Flink etc are used to process data from Flume and/or Kafka before loading data into sinks
20
Data processing - Batch I/O based – Map Reduce – Hive, Pig are wrappers on top of map reduce In memory – Spark – Spark Data Frames is wrapper on top of core spark As part of data processing typically we focus on transformations such as – Aggregations – Joins – Sorting – Ranking
21
Data processing - Operational Data is typically stored in distributed databases Supports CRUD operations Data is typically distributed Data is typically sorted by key Fast and scalable random reads NoSQL – HBase – Cassandra – MongoDB Search databases – Elastic Search
22
Data Analysis or Visualization Processed data is analyzed or visualized using BI Tools Custom visualization frameworks (d3js) Ad hoc query tools
23
Let us recollect how data integration is typically done in enterprises (eg: Data Warehousing)
24
OLTP Closed Main Frames Closed Main Frames XML External apps XML External apps Source(s) Data Integration (Current Architecture) Target system (eg; EDW) Data Integration (ETL/Real Time) Data Integration (ETL/Real Time) ODS EDW/ODS Visualization/ Reporting Decision Support
25
Use Case – EDW (Current Architecture) Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds Data Integration – ODS (Operational Data Store) Sources – Disparate Real time – Tools/custom (Goldengate, Shareplex etc) Batch – Tools/custom Uses – Compliance, data lineage, reports etc – Enterprise Datawarehouse Sources – ODS or other sources ETL – Tools/custom (Informatica, Ab Initio, Talend) Reporting/Visualization – ODS (Compliance related reporting) – Enterprise Datawarehouse – Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
26
Now we will see how data integration can be done using Big Data eco system
27
Data Ingestion Apache Sqoop to get data from relational databases into Hadoop Apache Flume or Kafka to get data from streaming logs If some processing need to be done before loading to databases or HDFS, data is processed through streaming technologies such as Flink, Storm etc
28
Data Processing There are 2 engines to apply transformation rules at scale – Map Reduce (uses I/O) Hive is the most popular map reduce based tool Map Reduce works well to process huge amounts of data in few hours – Spark (in memory) * We will see disadvantages of Map Reduce and why Spark with new programming languages such as Scala and python is gaining momentum
29
Disadvantages of Map Reduce Disadvantages of Map Reduce based solutions – Designed for batch, not meant for interactive and ad hoc reporting – I/O bound and processing of micro batches can be an issue – Too many tools/technologies (Map Reduce, Hive, Pig, Sqoop, Flume etc.) to build applications – Not suitable for enterprise hardware where storage is typically network mounted
30
Apache Spark Spark can work with any file system including HDFS Processing is done in memory – hence I/O is minimized Suitable for ad hoc or interactive querying or reporting Streaming jobs can be done much faster than map reduce Applications can be developed using Scala, Python, Java etc Choose one programming language and perform – Data integration from RDBMS using JDBC (no need of sqoop) – Stream data using spark streaming – Leverage data frames and SQL embedded in programming language – As processing is done in memory Spark works well with Enterprise Hardware with network file system
31
Data Integration (Big Data eco system) OLTP Closed Main Frames Closed Main Frames XML External apps XML External apps Source(s) Visualization/ Reporting Decision Support Node Big Data Cluster (EDW/ODS) ETL Real Time/Batch (No ETL) Reporting Database (optional) Reporting Database (optional)
32
Big Data eco system Core Components Non Map Reduce Hive Pig Sqoop Oozie Mahout Hadoop eco system Big Data Technologies Distributed File System (HDFS) Map Reduce Impala Presto Ad-hoc querying tools HBase Cassandra NoSQL Data Ingestion Kafka Flume Streaming Analytics Storm Flink Spark In Memory processing
33
33 Big Data eco system Distributed File System (HDFS) Map Reduce Hadoop Core Components Hive T and L Batch Reporting Hive T and L Batch Reporting Non Map Reduce Impala Interactive/adhoc Reporting Impala Interactive/adhoc Reporting Sqoop E and L Sqoop E and L Oozie Workflows Oozie Workflows Hadoop eco system Big Data Technologies Custom Map Reduce E, T and L Custom Map Reduce E, T and L HBase Real Time data integration or Reporting HBase Real Time data integration or Reporting
34
Role of Apache Each of these are separate projects incubated under Apache – HDFS and MapReduce/YARN – Hive – Pig – Sqoop – HBase Etc
35
Installation (plain vanilla) In plain vanilla mode, depending up on the architecture each tool/technology needs to be manually downloaded, installed and configured. Typically people use Puppet or Chef to set up clusters using plain vanilla tools Advantages – You can set up your cluster with latest versions from Apache directly Disadvantages – Installation is tedious and error prone – Need to integrate with monitoring tools
36
Hadoop Distributions Different vendors pre-package apache suite of big data tools into their distribution to facilitate – Easier installation/upgrade using wizards – Better monitoring – Easier maintenance – and many more Leading distributions include, but not limited to – Cloudera – Hortonworks – MapR – AWS EMR – IBM Big Insights – and many more
37
Hadoop Distributions HDFS/YARN/MR Hive Pig Apache Foundation Sqoop Impala Tez Flume Spark Ganglia HBase Impala Zookeeper Cloudera Hortonworks MapR AWS
38
Certifications Why to certify? – To promote skills – Demonstrate industry recognized validation for your expertise. – Meet global standards required to ensure compatibility between Spark and Hadoop – Stay up to date with the latest advances in Big Data technologies such as Spark and Hadoop Take certifications from only vendors like – Cloudera – Hortonworks – MapR – Databricks (oreilly) – http://www.itversity.com/2016/07/05/hadoop-and-spark-developer-certifications-faqs/ http://www.itversity.com/2016/07/05/hadoop-and-spark-developer-certifications-faqs/ – http://www.itversity.com/2016/07/02/hadoop-certifications/ http://www.itversity.com/2016/07/02/hadoop-certifications/
39
Resources Resources to learn Big Data with hands on practice YouTube Channel: www.YouTube.com/itversityin (please subscribe)www.YouTube.com/itversityin 900+ videos 100+ playlists 6 Certification courses www.itversity.com - launched recently www.itversity.com Few courses added Other courses will be added overtime Courses will be either role based or certification based Will be working on blogging platform for IT content
40
Job Roles Go through this blog - http://www.itversity.com/2016/07/02/hadoo p-certifications/ http://www.itversity.com/2016/07/02/hadoo p-certifications/ Job RoleExperience requiredDesired Skills Hadoop Developer0-7 Years Hadoop, Programming using java, spark, hive, pig, sqoop etc Hadoop Administrator0-10 Years Linux, Hadoop Administration using distributions Big Data Engineer3-15 Years Data Warehousing, ETL, Hadoop, hive, pig, sqoop, spark etc Big Data Solutions Architect 12-18 Years Deep understanding of Big Data eco system such as Hadoop, NoSQL etc Infrastructure Architect12-18 Years Deep understanding of infrastructure as well as Big Data eco system
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.