The Memory B. Ramamurthy C B. Ramamurthy.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
High Performance Computing Course Notes High Performance Storage.
IT Systems Memory EN230-1 Justin Champion C208 –
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
B. RAMAMURTHY MapReduce and Hadoop Distributed File System 10/6/ Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY)
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Hadoop and HDFS
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Overview of Physical Storage Media
The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Parts of the Computer System
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
11 Intel Modular Server Understanding the Storage MFSYS25 MFSYS35.
A.Abhari CPS1251 Topic 1: Introduction to Computers Computer Hardware Computer components Connecting Computers Computer Software Operating System (OS)
The Storage B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
BIG DATA/ Hadoop Interview Questions.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Table General Guidelines for Better System Performance
Hadoop Aakash Kag What Why How 1.
CSE 451: Operating Systems
Memory Key Revision Points.
Slides modified from presentation by B. Ramamurthy
Software Systems Development
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
An Open Source Project Commonly Used for Processing Big Data Sets
Large-scale file systems and Map-Reduce
Memory Main memory consists of a number of storage locations, each of which is identified by a unique address The ability of the CPU to identify each location.
Memory Main memory consists of a number of storage locations, each of which is identified by a unique address The ability of the CPU to identify each location.
CSE-291 Cloud Computing, Fall 2016 Kesden
Unit OS10: Fault Tolerance
Gowtham Rajappan.
Introduction to Computing
Primary Storage and Secondary Storage Devices Chapter 3
Software Engineering Introduction to Apache Hadoop Map Reduce
Storage Virtualization
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 7.
Introduction to Operating Systems
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
COT 5611 Operating Systems Design Principles Spring 2014
COT 5611 Operating Systems Design Principles Spring 2012
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Table General Guidelines for Better System Performance
Cloud computing mechanisms
Cloud Programming Models
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
MICROPROCESSOR MEMORY ORGANIZATION
Zoie Barrett and Brian Lam
Database System Architectures
Primary Storage 1. Registers Part of the CPU
MapReduce: Simplified Data Processing on Large Clusters
CS 295: Modern Systems Organizing Storage Devices
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility A well.
Presentation transcript:

The Memory B. Ramamurthy C B. Ramamurthy

Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory File system abstraction Offline/ tertiary memory RAID: Redundant Array of Inexpensive Disks NAS: Network Accessible Storage SAN: Storage area networks DB and DBMS: Data base and DB management systems Distributed file system Google file system Hadoop file system C B. Ramamurthy

Data and Computation Continuum Compute intensive Ex: computation of digits of PI Data intensive Ex: analyzing web logs C B. Ramamurthy

On chip memory Registers Cache Buffers (instruction pipeline) Characteristics: volatile C B. Ramamurthy

On board memory Cache Instructions cache Data cache Translation look aside buffers (TLB) Characteristics: content addressable, set-associative organization C B. Ramamurthy

System memory Erasable/writable non-volatile memory RAM : Random access memory: main memory Read and write possible volatile ROM: Read only memory: boot programs for operating systems Flash memory: Erasable/writable non-volatile memory SDRAM: synch dynamic RAM others EAROM C B. Ramamurthy

Off-system storage (Earlier Lectures covered these) Off system/online storage/ secondary memory File system abstraction Offline/ tertiary memory RAID: Redundant Array of Inexpensive Disks NAS: Network Accessible Storage SAN: Storage area networks C B. Ramamurthy

Database and Database Management System Data source Transactional Data base server Relational db or similar foundation Tables, rows, result set, SQL ODBC: open data base connectivity Very successful business model: Oracle, DB2, MySQL, and others Persistence models: EJB, DAO, ADO (I am not going to expand the abbreviation.. ) C B. Ramamurthy

Distributed file system(DFS) A dedicated server manages the files for an compute environment For example, nickelback,cse.buffalo.edu is your file server and that is why we did not want you to run your user applications on this machine. DFS addresses various transparencies: location transparency, sharing, performance etc. Examples: NFS, NFS+, AFS (Andrew FS)… (you will study these in Distributed Systems course) C B. Ramamurthy

Issues with ultra-scale data How to store the large amount of data? On commodity hardware or special hardware Large storage implies large number of devices to store them. How to address shortening MTTF (Mean time to failure)? How to realize “fault tolerance”? Redundancy/replication is a solution How to manage the replication and the health of the large number of devices? More importantly how to partition the large scale data to store in these storage devices (nodes)? How to parallelize processing of the data stored at multiple “nodes”? C B. Ramamurthy

On to Google File Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” But observe that this type of data has an uniquely different characteristic than your transactional or the “order” data on amazon.com: “write once” ; so is HIPPA protected healthcare and patient information; Google exploited this characteristics in its Google file system: S. Ghemavat C B. Ramamurthy

Hadoop File System (HFS) Hadoop file system is a reverse engineered version of the GFS : this is my first opinion on HFS HFS is a distributed file system for large scale data Data throughput is more important than latency Batch computing than interactive time shared computing C B. Ramamurthy

MapReduce Cat combine reduce part0 map split Bat part1 Dog Other Words (size: TByte) map split combine reduce part0 part1 part2

Exercise: Count the number of occurrences of the word in the text This is a cat. Cat sits on a roof. The roof is a tin roof. There is a tin can on the roof. Cat kicks the can. It rolls on the roof and falls on the next roof. The cat rolls too. It sits on the can. C B. Ramamurthy