Defining Data-intensive computing

Slides:



Advertisements
Similar presentations
Business Intelligence Systems
Advertisements

Emerging Platform#6: Cloud Computing B. Ramamurthy 6/20/20141 cse651, B. Ramamurthy.
M.A.Doman Model for enabling the delivery of computing as a SERVICE.
Adopting Big-Data Computing Across the Undergraduate Curriculum Bina Ramamurthy (Bina) This talk.
Data-intensive Computing on the Cloud: Concepts, Technologies and Applications B. Ramamurthy This talks is partially supported by National.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
Web 2.0: Concepts and Applications 6 Linking Data.
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
The Storage B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Big Data analytics in the Cloud Ahmed Alhanaei. What is Cloud computing?  Cloud computing is Internet-based computing, whereby shared resources, software.
Cloud Computing: Concepts, Technologies and Business Implications B. Ramamurthy & K. Madurai &
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Web 2.0: Concepts and Applications 6 Linking Data.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Data Analytics (CS40003) Introduction to Data Lecture #1
CNIT131 Internet Basics & Beginning HTML
Unit 3 Virtualization.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
Cloud Computing With Real Example
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Discovering Computers 2010: Living in a Digital World Chapter 14
Introduction to Distributed Platforms
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
IOT Critical Impact on DC Design
Status and Challenges: January 2017
Chapter 14 Big Data Analytics and NoSQL
Hadoopla: Microsoft and the Hadoop Ecosystem
Recap: introduction to e-science
Hadoop Clusters Tess Fulkerson.
Cloud Computing: Concepts, Technologies and Business Implications
Data-Intensive Computing
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Ministry of Higher Education
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Defining Data-intensive computing
Big Data - in Performance Engineering
Accelerate Your Self-Service Data Analytics
Defining Data-intensive computing
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Overview of big data tools
The Memory B. Ramamurthy C B. Ramamurthy.
Cloud Programming Models
Syllabus and Introduction Keke Chen
Big Data Young Lee BUS 550.
Emerging technologies-
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter 3 Database Management
Big DATA.
Big-Data Analytics with Azure HDInsight
Word Co-occurrence Chapter 3, Lin and Dryer.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Big Data.
Presentation transcript:

Defining Data-intensive computing B. Ramamurthy 11/30/2018

Data-intensive computing What is it? Volume, velocity, variety, veracity (uncertainty) (Gartner, IBM) How is it addressed? Why now? What do you expect to extract by processing this large data? Intelligence for decision making What is different now? Storage models, processing models Big Data, analytics and cloud infrastructures Summary 11/30/2018

Top Ten Largest Databases Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/ 11/30/2018 3

Top Ten Largest Databases in 2007 vs Facebook ‘s cluster in 2010 21 PetaByte In 2010 Facebook Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world 11/30/2018 4

Big-data Problem Solving Approaches Algorithmic: after all we have working towards this for ever: scalable/tracktable High Performance computing (HPC: multi-core) CCR has machines that are: 16 CPU , 32 core machine with 128GB RAM: openmp, MPI, etc. GPGPU programming: general purpose graphics processor (NVIDIA) Statistical packages like R running on parallel threads on powerful machines Machine learning algorithms on super computers Hadoop MapReduce like parallel processing.

Processing Granularity 11/30/2018 Data size: small Single-core, single processor Single-core, multi-processor Single-core Multi-core, single processor Multi-core, multi-processor Multi-core Cluster of processors (single or multi-core) with shared memory Cluster of processors with distributed memory Cluster Grid of clusters Embarrassingly parallel processing MapReduce, distributed file system Cloud computing Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Bina Ramamurthy 2011 Mega Block level Virtual System Level Data size: large

A Golden Era in Computing Heavy societal involvement Powerful multi-core processors Superior software methodologies Virtualization leveraging the powerful hardware Wider bandwidth for communication Proliferation of devices Explosion of domain applications 11/30/2018

Data Deluge: smallest to largest Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics … Financial applications: that analyze volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars: Sloan Digital Sky Survey: http://www.sdss.org/ 11/30/2018

Intelligence and Scale of Data Intelligence is a set of discoveries made by federating/processing information collected from diverse sources. Information is a cleansed form of raw data. For statistically significant information we need reasonable amount of data. For gathering good intelligence we need large amount of information. As pointed out by Jim Grey in the Fourth Paradigm book enormous amount of data is generated by the millions of experiments and applications. Thus intelligence applications are invariably data-heavy, data-driven and data-intensive. Data is gathered from the web (public or private, covert or overt), generated by large number of domain applications. 11/30/2018

Intelligence (or origins of Big-data computing?) Search for Extra Terrestrial Intelligence (seti@home project) The Wow signal http://www.bigear.org/wow.htm 11/30/2018

Characteristics of intelligent applications Google search: How is different from regular search in existence before it? It took advantage of the fact the hyperlinks within web pages form an underlying structure that can be mined to determine the importance of various pages. Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to go to CityGrille”? Learning capacity from previous data of habits, profiles, and other information gathered over time. Collaborative and interconnected world inference capable: facebook friend suggestion Large scale data requiring indexing …Do you know amazon is going to ship things before you order? Here 11/30/2018

Data-intensive application characteristics Models Algorithms (thinking) Data structures (infrastructure) AggregatedContent (Raw data) Reference Structures (knowledge) 11/30/2018

Basic Elements Aggregated content: large amount of data pertinent to the specific application; each piece of information is typically connected to many other pieces. Ex: DBs Reference structures: Structures that provide one or more structural and semantic interpretations of the content. Reference structure about specific domain of knowledge come in three flavors: dictionaries, knowledge bases, and ontologies Algorithms: modules that allows the application to harness the information which is hidden in the data. Applied on aggregated content and some times require reference structure Ex: MapReduce Data Structures: newer data structures to leverage the scale and the WORM characteristics; ex: MS Azure, Apache Hadoop, Google BigTable 11/30/2018

Examples of data-intensive applications Search engines Recommendation systems: CineMatch of Netflix Inc. movie recommendations Amazon.com: book/product recommendations Biological systems: high throughput sequences (HTS) Analysis: disease-gene match Query/search for gene sequences Space exploration Financial analysis 11/30/2018

More intelligent data-intensive applications Social networking sites Mashups : applications that draw upon content retrieved from external sources to create entirely new innovative services. Portals Wikis: content aggregators; linked data; excellent data and fertile ground for applying concepts discussed in the text Media-sharing sites Online gaming Biological analysis Space exploration 11/30/2018

Algorithms Statistical inference Machine learning is the capability of the software system to generalize based on past experience and the use of these generalization to provide answers to questions related old, new and future data. Data mining Soft computing We also need algorithms that are specially designed for the emerging storage models and data characteristics. 11/30/2018

Different Type of Storage Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data, or “bank account data” : The data type is “write once read many (WORM)” ; Privacy protected healthcare and patient information; Historical financial data; Other historical data Relational file system and tables are insufficient. Large <key, value> stores (files) and storage management system. Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,… Clusters of distributed nodes for storage and computing. Computing is inherently parallel 11/30/2018

Big-data Concepts Originated from the Google File System (GFS) is the special <key, value> store Hadoop Distributed file system (HDFS) is the open source version of this. (Currently an Apache project) Parallel processing of the data using MapReduce (MR) programming model Apache Spark Eco system Challenges Formulation of algorithms Proper use of the features of infrastructure (Ex: sort) Best practices in parallel processing An extensive ecosystem consisting of other components such as column-based store (Hbase, BigTable), big data warehousing (Hive), workflow languages, etc. 11/30/2018

Data & Analytics We have witnessed explosion in algorithmic solutions. “In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” Grace Hopper What you cannot achieve by an algorithm can be achieved by more data. Big data if analyzed right gives you better answers: Center for disease control prediction of flu vs. prediction of flu through “search” data 2 full weeks before the onset of flu season! http://www.google.org/flutrends/ 11/30/2018

Cloud Computing Cloud is a facilitator for Big Data computing and is an indispensable in this context Cloud provides processor, software, operating systems, storage, monitoring, load balancing, clusters and other requirements as a service Cloud offers accessibility to Big Data computing Cloud computing models: platform (PaaS), Microsoft Azure software (SaaS), Google App Engine (GAE) infrastructure (IaaS), Amazon web services (AWS) Services-based application programming interface (API) 11/30/2018

Data We are entering a watershed moment in the internet era. This involves in its core and center, big data analytics and tools that provide intelligence in a timely manner to support decision making. Newer storage models, processing models, and approaches have emerged. UB does have a Data-intensive Computing Certificate Program. https://catalog.buffalo.edu/academicprogra ms/data-intensive_computing_cert.html 11/30/2018

Data Intensive Computing Certificate Pre-reqs: CSE115, CSE116 CSE250 CSE486 : Distributed Systems CSE 487: Data-intensive Computing XYZ course in any major (including CSE) Capstone research: 1-3 credits Example 1: BIO 419, CSE 495 (1 credit) Research in genomics Example 2: Math 337, MTH 495 Research in Math Example 3: CSE 442, CSE453 If you are interested please fill this short survey so that I can guide you: https://goo.gl/forms/0Kkx6c2wT43IF4g32 11/30/2018