Lecture 1 Big Data + Cloud Computing 张奇 复旦大学 COMP630030 Data Intensive Computing 1.

Slides:



Advertisements
Similar presentations
Chapter 4 Infrastructure as a Service (IaaS)
Advertisements

Cloud Computing Brandon Hixon Jonathan Moore. Cloud Computing Brandon Hixon What is Cloud Computing? How does it work? Jonathan Moore What are the key.
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 6 2/13/2015.
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 1.
What is cloud computing? Why is this different? Jimmy Lin The iSchool University of Maryland Monday, March 30, 2009 This work is licensed under a Creative.
What is Cloud Computing? o Cloud computing:- is a style of computing in which dynamically scalable and often virtualized resources are provided as a service.
Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.
© 2009 IBM Corporation ® IBM Software Group Introduction to Cloud Computing Vivek C Agarwal IBM India Software Labs.
Internet (large) scale Applications L. Grewe. What do I mean? Examples include Web, , Search, content delivery networks (e.g., Akamai, and Limelight),
Cloud Computing Lecture #1 What is Cloud Computing? (and an intro to parallel/distributed processing) Jimmy Lin The iSchool University of Maryland Wednesday,
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 4.
M.A.Doman Model for enabling the delivery of computing as a SERVICE.
SPRING 2011 CLOUD COMPUTING Cloud Computing San José State University Computer Architecture (CS 147) Professor Sin-Min Lee Presentation by Vladimir Serdyukov.
Cloud computing Tahani aljehani.
Cloud Computing By Alex Chiu. What is Cloud Computing?
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.
Introduction to Cloud Computing
1 Introduction to Cloud Computing Jian Tang 01/19/2012.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud Computing Brian Curran, Sabatino DeRico, Mike Delisa, Mudit Goel, Jon Guagenti, Jess Caso, Greg Flynn.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Cloud Computing Saneel Bidaye uni-slb2181. What is Cloud Computing? Cloud Computing refers to both the applications delivered as services over the Internet.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
Cloud Computing Part #2 Materials adopted from the slides by Jimmy Lin, The iSchool, University of Maryland Zigmunds Buliņš, Mg. sc. ing 1.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Software Architecture
Introduction to Cloud Computing
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.
Presented by: Mostafa Magdi. Contents Introduction. Cloud Computing Definition. Cloud Computing Characteristics. Cloud Computing Key features. Cost Virtualization.
Cloud Computing Presented by Boyoung Kim.
INTRODUCTION TO CLOUD COMPUTING ggg UNDERSTANDING CLOUD COMPUTING UNDERSTANDING CLOUD COMPUTING DEFINITION CLOUD COMPUTING.
Intro to Parallel and Distributed Processing Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google.
Cloud Computing Lecture #1 What is Cloud Computing? (and an intro to parallel/distributed processing) Jimmy Lin The iSchool University of Maryland Modified.
Hwajung Lee. Maximilien Brice, © CERN  Fact: Processor population is exploding. Technology has dramatically reduced the price of processors.  Geographic.
Enterprise Cloud Computing
Paperless Timesheet Management Project Anant Pednekar.
Cloud Computing is a Nebulous Subject Or how I learned to love VDF on Amazon.
CLOUD COMPUTING. What is cloud computing ??? What is cloud computing ??? Cloud computing is a general term for anything that involves delivering hosted.
Chapter 8 – Cloud Computing
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
Web Technologies Lecture 13 Introduction to cloud computing.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.
Submitted to :- Neeraj Raheja Submitted by :- Ghelib A. Shuaib (Asst. Professor) Roll No : Class :- M.Tech(CSE) 2 nd Year.
PRESENTED BY– IRAM KHAN ISHITA TRIPATHI GAURAV AGRAWAL GAURAV SINGH HIMANSHU AWASTHI JAISWAR VIJAY KUMAR JITENDRA KUMAR VERMA JITENDRA SINGH KAMAL KUMAR.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Intro to Parallel and Distributed Processing Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Introduction to Cloud Computing 1. 2 Performance progress 2010: 2.57 petaflops 2005: teraflops 2000: 4.94 teraflops 1995: 170 gigaflops 15,100.
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
Unit 3 Virtualization.
Cloud computing-The Future Technologies
Prepared by: Assistant prof. Aslamzai
Recap: introduction to e-science
Introduction to Cloud Computing
Cloud Computing.
Lecture 1 Big Data + Cloud Computing
Brandon Hixon Jonathan Moore
Internet and Web Simple client-server model
Emerging technologies-
Hwajung Lee ITEC452 Distributed Computing Lecture 1 Introduction to Distributed Systems.
Presentation transcript:

Lecture 1 Big Data + Cloud Computing 张奇 复旦大学 COMP Data Intensive Computing 1

2

3

4

How much data? Wayback Machine (Internet Archive) has 2 PB + 20 TB/month (2006) Google processes 20 PB a day (2008) “all words ever spoken by human beings” ~ 5 EB NOAA has ~1 PB climate data (2007) CERN’s LHC will generate 15 PB a year (2008) 640K ought to be enough for anybody. 5

Happening everywhere! 6 Molecular biology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations (Millennium) Network traffic (spam) fiber optics 300M/day 1B

7 Maximilien Brice, © CERN

8

年 5 月 17 日,阿里集团最后一台 IBM 小机在支付宝下线。

Example: Wikipedia Anthropology Experiment – Download entire revision history of Wikipedia – 4.7 M pages, 58 M revisions, 800 GB – Analyze editing patterns & trends Computation – Hadoop on 20-machine cluster 10 Kittur, Suh, Pendleton (UCLA, PARC), “He Says, She Says: Conflict and Coordination in Wikipedia” CHI, 2007 Increasing fraction of edits are for work indirectly related to articles

Example: Scene Completion Image Database Grouped by Semantic Content – 30 different Flickr.com groups – 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole – Classify images with gist scene detector [Torralba] – Color similarity – Local context matching Computation – Index images offline – 50 min. scene matching, 20 min. local matching, 4 min. compositing – Reduces to 5 minutes total by using 5 machines Extension – Flickr.com has over 500 million images … 11 Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007

Example: Web Page Analysis Experiment – Use web crawler to gather 151M HTML pages weekly 11 times Generated 1.2 TB log information – Analyze page statistics and change frequencies Systems Challenge “ Moreover, we experienced a catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl. ” Fetterly, Manasse, Najork, Wiener (Microsoft, HP), “A Large-Scale Study of the Evolution of Web Pages,” Software-Practice & Experience,

GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT ? ? Subject genome Sequencer Reads 13

DNA Sequencing Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG – Bacteria: ~5 million bp(base pairs) – Humans: ~3 billion bp Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp) – Shorter reads, but much higher throughput – Per-base error rate estimated at 1-2% (Simpson, et al, 2009) Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads – ~144 GB of compressed sequence data ATCTGATAAGTCCCAGGACTTCAGT GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC GGAGTTAGTAAAAGTCCACATTGAG 14

15 CGGTCTAGATGCTTAGCTATGCGGGCCCCTT Reference sequence Alignment GCTTA T CTAT TTA T CTATGC A T CTATGCGG GCTTA T CTAT TCTAGATGCTTCTAGATGCT CTATGCGGGC CTAGATGCTT A T CTATGCGG CTATGCGGGC A T CTATGCGG Subject reads

16 CGGTCTAGATGCTTATCTATGCGGGCCCCTT GCTTATCTAT TTATCTATGC ATCTATGCGG GCTTATCTAT GGCCCCTT GCCCCTT CCTT CGG CGGTC CGGTCT CGGTCTAG TCTAGATGCTTCTAGATGCT CTATGCGGGCCTAGATGCTT CTT ATGCGGGCCC Reference sequence Subject reads

Example: Bioinformatics Evaluate running time on local 24 core cluster – Running time increases linearly with the number of reads 17 Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009

Example: Data Mining del.icio.us crawl->a bipartite graph covering Webpages and tags. 18 Haoyuan LiHaoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang: Pfp: parallel fp-growth for query recommendation. RecSys 2008: Yi Wang Ming ZhangEdward Y. ChangRecSys 2008

Example: Information Flow 19

Example: Opinion Leader 20

There’s nothing like more data! s/inspiration/data/g; (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) 21

22

What is Cloud Computing? 1.First write down your own opinion about “cloud computing”, whatever you thought about in your mind. 2.Question: What ? Who? Why? How? Pros and cons? 3.The most important question is: What is the relation with us? 23 From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang

What Is Cloud Computing? Internet computing – Computation done through the Internet – No concern about any maintenance or management of actual resources Shared computing resources – As opposed to local servers and devices Comparable to Grid Infrastructure Web applications Specialized raw computing services 24 From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang

Cloud Computing Resources Large pool of easily usable and accessible virtualized resources – Dynamic reconfiguration for adjusting to different loads (scales) – Optimization of resource utilization Pay-per-use model – Guarantees offered by the Infrastructure Provider by means of customized SLAs(Service Level Agreements) Set of features – Scalability, pay-per-use utility model and virtualization 25 From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang

Technical Key Points User interaction interface: how users of cloud interface with the cloud Services catalog: services a user can request System management: management of available resources Provisioning tool: carving out the systems from the cloud to deliver on the requested service Monitoring and metering: tracking the usage of the cloud Servers: virtual or physical servers managed by system administrators 26 From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang

Key Characteristics (1/2) Cost savings for resources – Cost is greatly reduced as initial expense and recurring expenses are much lower than traditional computing – Maintenance cost is reduced as a third party maintains everything from running the cloud to storing data Platform, Location and Device independency – Adoptable for all sizes of businesses, in particular small and mid-sized ones 27 From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang

Key Characteristics (2/2) Scalable services and applications – Achieved through server virtualization technology Redundancy and disaster recovery 28 From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang

Timeline 29 From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang

Cloud Computing Architecture Front End – End user, client or any application (i.e., web browser, etc.) Back End (Cloud services) – Network of servers with any computer program and data storage system It is usually assumed that clouds have infinite storage capacity for any software available in market 30

31

Cloud Service Taxonomy Layer – Software-as-a-Service (SaaS) – Platform-as-a-Service (PaaS) – Infrastructure-as-a-Service (IaaS) – Data Storage-as-a-Service (DaaS) – Communication-as-a-Service (CaaS) – Hardware-as-a-Service(HaaS ) Type – Public cloud – Private cloud – Inter-cloud 32

Infrastructure as a Service 33

What Is Infrastructure-as-a-Service (IaaS) Characteristics – Utility computing and billing model – Automation of administrative tasks – Dynamic scaling – Desktop virtualization – Policy-based services – Internet connectivity 34

Use Scenario for IaaS 35

Business Example: Amazon EC2 36

Amazon EC2 CLI (Client Level Interface) 37

Amazon EC2 Automated Management 38

Data Storage as a Service (DaaS) Definition – Delivery of data storage as a service, including database-like services, often billed on a utility computing basis Database (Amazon SimpleDB & Google App Engine's BigTable datastore) Network attached storage (MobileMe iDisk & Nirvanix CloudNAS) Synchronization (Live Mesh Live Desktop component & MobileMe push functions) Web service (Amazon Simple Storage Service & Nirvanix SDN) 39

Amazon S3 40

Platform as a Service 41

PaaS Example: Google App Engine Service that allows user to deploy user’s Web applications on Google's very scalable architecture – Providing user with a sandbox for user’s Java and Python application that can be referenced over the Internet – Providing Java and Python APIs for persistently storing and managing data (using the Google Query Language or GQL) 42

Google App Engine 43

Google App Engine (Python Example) 44

Software-as-a-Service (SaaS) Definition – Software deployed as a hosted service and accessed over the Internet Features – Open, Flexible – Easy to Use – Easy to Upgrade – Easy to Deploy 45

Software as a Service 46

Human as a Service 47

Administration/Business Support 48

Cloud Architecture -> Cloud Players 49

Players 50

Players: Providers 51

Players: Cloud Intermediaires 52

Players: Application Providers 53

54

Web-Scale Problems? Don’t hold your breath: – Biocomputing – Nanocomputing – Quantum computing –…–… It all boils down to… – Divide-and-conquer – Throwing more hardware at the problem Simple to understand… a lifetime to master… 55

Divide and Conquer “Work” w1w1 w2w2 w3w3 r1r1 r1r1 r2r2 r2r2 r3r3 r3r3 “Result” “worker” Partition Combine 56

Different Workers Different threads in the same core Different cores in the same CPU Different CPUs in a multi-processor system Different machines in a distributed system 57

Choices, Choices, Choices Commodity vs. “exotic” hardware Number of machines vs. processor vs. cores Bandwidth of memory vs. disk vs. network Different programming models 58

Flynn’s Taxonomy Instructions Single (SI)Multiple (MI) Data Multiple (MD) SISD Single-threaded process MISD Pipeline architecture SIMD Vector Processing MIMD Multi-threaded Programming Single (SD) 59

SISD DDDDDDD Processor Instructions 60

SIMD D0D0 Processor Instructions D0D0 D0D0 D0D0 D0D0 D0D0 D1D1 D2D2 D3D3 D4D4 … DnDn D1D1 D2D2 D3D3 D4D4 … DnDn D1D1 D2D2 D3D3 D4D4 … DnDn D1D1 D2D2 D3D3 D4D4 … DnDn D1D1 D2D2 D3D3 D4D4 … DnDn D1D1 D2D2 D3D3 D4D4 … DnDn D1D1 D2D2 D3D3 D4D4 … DnDn D0D0 61

MIMD DDDDDDD Processor Instructions DDDDDDD Processor Instructions 62

Memory Typology: Shared Memory Processor 63

Memory Typology: Distributed MemoryProcessorMemoryProcessor MemoryProcessorMemoryProcessor Network 64

Memory Typology: Hybrid Memory Processor Network Processor Memory Processor Memory Processor Memory Processor 65

Parallelization Problems How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What is the common theme of all of these problems? 66

General Theme? Parallelization problems arise from: – Communication between workers – Access to shared resources (e.g., data) Thus, we need a synchronization system! This is tricky: – Finding bugs is hard – Solving bugs is even harder 67

Managing Multiple Workers Difficult because – (Often) don’t know the order in which workers run – (Often) don’t know where the workers are running – (Often) don’t know when workers interrupt each other Thus, we need: – Semaphores (lock, unlock) – Conditional variables (wait, notify, broadcast) – Barriers Still, lots of problems: – Deadlock, livelock, race conditions,... Moral of the story: be careful! – Even trickier if the workers are on different machines 68

Patterns for Parallelism Parallel computing has been around for decades Here are some “design patterns” … 69

Master/Slaves slaves master 70

Producer/Consumer Flow CP P P C C CP P P C C 71

Work Queues C P P P C C shared queue WWWWW 72

Rubber Meets Road From patterns to implementation: – pthreads, OpenMP for multi-threaded programming – MPI for clustering computing –…–… The reality: – Lots of one-off solutions, custom code – Write you own dedicated library, then program with it – Burden on the programmer to explicitly manage everything MapReduce to the rescue! – (for next time) 73

Questions? 74