What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.

Slides:



Advertisements
Similar presentations
R and HDInsight in Microsoft Azure
Advertisements

Big Data and Predictive Analytics in Health Care Presented by: Mehadi Sayed President and CEO, Clinisys EMR Inc.
UNCLASSIFIED ENTERPRISE MANAGEMENT DECISION SUPPORT (EMDS) UNCLASSIFIED T HE 2013 C OMPUTERWORLD H ONORS L AUREATE I NNOVATION A WARD W INNER 2013 Federal.
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
A NEW PLATFORM FOR A NEW ERA. 2 Pivotal Confidential–Internal Use Only 2 The Pivotal Big Data Suite.
Big Data. What is Big Data? Analog starage vs digital. The FOUR V’s of Big Data. Who’s Generating Big Data The importance of Big Data. Optimalization.
CLOUD COMPUTING. A general term for anything that involves delivering hosted services over the Internet. And Cloud is referred to the hardware and software.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
© 2011 IBM Corporation Smarter Software for a Smarter Planet The Capabilities of IBM Software Borislav Borissov SWG Manager, IBM.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Apache and Hadoop are.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Hortonworks Page 1. © Hortonworks Inc Big Data Changes the Game Megabytes Gigabytes Terabytes Petabytes Purchase detail.
© 2012 IBM Corporation IBM Security Systems 1 © 2013 IBM Corporation 1 Ecommerce Antoine Harfouche.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
Big Data – Big Opportunity Mohammad Khansari ITRC President Jan 2015 ITRC, Tehran, Iran.
© 2012 IBM Corporation Converting Big Data into Big Knowledge.
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
IoT Meets Big Data Standardization Considerations
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
BUSINESS INTELLIGENCE & ADVANCED ANALYTICS DISCOVER | PLAN | EXECUTE JANUARY 14, 2016.
Next Generation of Apache Hadoop MapReduce Owen
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
© 2016 Catalyze, Inc. Go-To-Market Services HIPAA Compliance in the Cloud: Catalyze Provides Microsoft Azure Customers with a HITRUST Certified Platform-as-a-Service.
An Introduction To Big Data For The SQL Server DBA.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.
Microsoft Ignite /28/2017 6:07 PM
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Data Analytics (CS40003) Introduction to Data Lecture #1
CNIT131 Internet Basics & Beginning HTML
Big Data Analytics on Large Scale Shared Storage System
Organizations Are Embracing New Opportunities
SAS users meeting in Halifax
Big Data Enterprise Patterns
Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
April 25, 2012 The Three R’s Are Old School – Now It Is All About Volume, Velocity & Variety Peter Guest Alberta Public Sector Client Technical Advisor.
April 25, 2012 The Three R’s Are Old School – Now It Is All About Volume, Velocity & Variety Peter Guest Alberta Public Sector Client Technical Advisor.
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Accelerate Your Self-Service Data Analytics
Carl Data Solutions Collects Utility Sensor and Meter Data to Provide Advanced Reporting, Alarming, and Analytics with Microsoft Azure MICROSOFT AZURE.
Overview of big data tools
Zoie Barrett and Brian Lam
Big DATA.
Big Data.
Presentation transcript:

What we know or see What’s actually there

Wikipedia : In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. IDC : Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery, and/or analysis. IBM says that ”three characteristics define big data:” –Volume (Terabytes -> Zettabytes) –Variety (Structured -> Semi-structured -> Unstructured) –Velocity (Batch -> Streaming Data)

What Big data should achieve: Volume: Turn 12 terabytes of Tweets created each day into improved product sentiment analysis Convert 350 billion annual meter readings to better predict power consumption Velocity: Sometimes 2 minutes is too late. Scrutinize 5 million trade events created each day to identify potential fraud Analyze 500 million daily call detail records in real-time to predict customer churn faster Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. Monitor 100’s of live video feeds from surveillance cameras to target points of interest Exploit the 80% data growth in images, video and documents to improve customer satisfaction Veracity: Establishing trust in big data presents a huge challenge as the variety and number of sources grows.

Big data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. Here are three key technologies that can help you get a handle on big data – and even more importantly, extract meaningful business value from it. Information management for big data. Manage data as a strategic, core asset, with ongoing process control for big data analytics. High-performance analytics for big data. Gain rapid insights from big data and the ability to solve increasingly complex problems using more data. Flexible deployment options for big data. Choose between options for on premises or hosted, software-as-a-service (SaaS) approaches for big data and big data analytics

Massively Parallel Processing (MPP) – This involves a coordinated processing of a program by multiple processors (200 or more in number). Each of the processors makes use of its own operating system and memory and works on different parts of the program. Each part communicates via messaging interface. An MPP system is also known as “loosely coupled” or “shared nothing” system. Distributed file system or network file system allows client nodes to access files through a computer network. This way a number of users working on multiple machines will be able to share files and storage resources. The client nodes will not be able to access the block storage but can interact through a network protocol. This enables a restricted access to the file system depending on the access lists or capabilities on both servers and clients which is again dependent on the protocol.

Apache Hadoop is key technology used to handle big data, its analytics and stream computing. Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It can be scaled up from a single server to thousands of machines and with a very high degree of fault tolerance. Instead of relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. Data Intensive Computing is a class of parallel computing application which uses a data parallel approach to process big data. This works based on the principle of collocation of data and programs or algorithms used to perform computation. Parallel and distributed system of inter-connected stand alone computers that work together as a single integrated computing resource is used to process / analyze big data.

HDFS: - Large files are split into parts -Move file parts into a cluster -Fault-tolerant through replication across nodes while being rack-aware -Bookkeeping via NameNode MapReduce -Move algorithms close to the data by structuring them for parallel execution so that each task works on a part of the data. The power of Simplicity! noSQL: -Old technology – before RDBMS -Sequential files -Hierarchical DB -Network DB -Graph DB (Neo4j, AllegroGraph) -Document Stores (Lotus Domino, MongoDB, JackRabbit) -Memory Caches

Common Tools: -Infrastructure Management: -Chef & Puppet from DevOps movement -Grid Monitoring tools: -Operational insight with Nagios, Ganglia, Hyperic and ZenOSS -Development support is a need of the hour Operational Tools: –Chef & Puppet from DevOps movement + Ganglia Analytics Platforms (DW Solutions, BI solutions and analytics) –Netezza, GreenPlum, Vertica Universal Data (All data storage needs at enterprise level) –Lily & Spire

Source links and some more useful information… _cooper_mell.pdfhttp://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june20 12_cooper_mell.pdf