Download presentation
Presentation is loading. Please wait.
1
Lecture 1 Big Data + Cloud Computing
张奇 复旦大学 COMP Data Intensive Computing
2
Big Data Cloud Computing Data Intensive Computing Grid Computing Parallel Computing
3
MPI HADOOP MAP REDUCE NO SQL GFS EC2 HDFS S3
5
How much data? Wayback Machine has 2 PB + 20 TB/month (2006)
Google processes 20 PB a day (2008) “all words ever spoken by human beings” ~ 5 EB NOAA has ~1 PB climate data (2007) CERN’s LHC will generate 15 PB a year (2008) 640K ought to be enough for anybody.
6
Network traffic (spam)
Happening everywhere! Molecular biology (cancer) microarray chips fiber optics Network traffic (spam) 300M/day Simulations (Millennium) microprocessors particle colliders Particle events (LHC) 1B
7
Maximilien Brice, © CERN
8
Maximilien Brice, © CERN
9
Example: Wikipedia Anthropology
Kittur, Suh, Pendleton (UCLA, PARC), “He Says, She Says: Conflict and Coordination in Wikipedia” CHI, 2007 Increasing fraction of edits are for work indirectly related to articles Computation Hadoop on 20-machine cluster Experiment Download entire revision history of Wikipedia 4.7 M pages, 58 M revisions, 800 GB Analyze editing patterns & trends 3 slides from “Data Intensive Super Computing” Randal E. Bryant, Carnegie Mellon University
10
Example: Scene Completion
Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007 Image Database Grouped by Semantic Content 30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million images …
11
Example: Web Page Analysis
Fetterly, Manasse, Najork, Wiener (Microsoft, HP), “A Large-Scale Study of the Evolution of Web Pages,” Software-Practice & Experience, 2004 Experiment Use web crawler to gather 151M HTML pages weekly 11 times Generated 1.2 TB log information Analyze page statistics and change frequencies Systems Challenge “Moreover, we experienced a catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.”
12
? Reads Subject genome Sequencer GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT ? TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Reads Subject genome Sequencer
13
DNA Sequencing Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG Bacteria: ~5 million bp(base pairs) Humans: ~3 billion bp Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp) Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al, 2009) Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads ~144 GB of compressed sequence data 4 slides from Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press. ATCTGATAAGTCCCAGGACTTCAGT GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC GGAGTTAGTAAAAGTCCACATTGAG
14
Subject reads CTATGCGGGC CTAGATGCTT A T CTATGCGG TCTAGATGCT GCTTA T CTAT A T CTATGCGG A T CTATGCGG A T CTATGCGG TTA T CTATGC CTATGCGGGC GCTTA T CTAT Alignment CGGTCTAGATGCTTAGCTATGCGGGCCCCTT Reference sequence
15
Subject reads ATGCGGGCCC CTAGATGCTT CTATGCGGGC TCTAGATGCT ATCTATGCGG CGGTCTAG ATCTATGCGG CTT CGGTCT TTATCTATGC CCTT CGGTC GCTTATCTAT GCCCCTT CGG GCTTATCTAT GGCCCCTT CGGTCTAGATGCTTATCTATGCGGGCCCCTT Reference sequence
16
Example: Bioinformatics
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press. Evaluate running time on local 24 core cluster Running time increases linearly with the number of reads
17
Example: Data Mining Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang: Pfp: parallel fp-growth for query recommendation. RecSys 2008: del.icio.us crawl->a bipartite graph covering Webpages and tags.
18
There’s nothing like more data!
s/inspiration/data/g; (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007)
19
Big Data + Cloud Computing
20
What is Cloud Computing?
First write down your own opinion about “cloud computing” , whatever you thought about in your mind. Question: What ? Who? Why? How? Pros and cons? The most important question is: What is the relation with me? From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang
21
What Is Cloud Computing?
Internet computing Computation done through the Internet No concern about any maintenance or management of actual resources Shared computing resources As opposed to local servers and devices Comparable to Grid Infrastructure Web applications Specialized raw computing services From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang
22
Cloud Computing Resources
Large pool of easily usable and accessible virtualized resources Dynamic reconfiguration for adjusting to different loads (scales) Optimization of resource utilization Pay-per-use model Guarantees offered by the Infrastructure Provider by means of customized SLAs(Service Level Agreements) Set of features Scalability, pay-per-use utility model and virtualization From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang
23
Technical Key Points User interaction interface: how users of cloud interface with the cloud Services catalog: services a user can request System management: management of available resources Provisioning tool: carving out the systems from the cloud to deliver on the requested service Monitoring and metering: tracking the usage of the cloud Servers: virtual or physical servers managed by system administrators From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang
24
Key Characteristics (1/2)
Cost savings for resources Cost is greatly reduced as initial expense and recurring expenses are much lower than traditional computing Maintenance cost is reduced as a third party maintains everything from running the cloud to storing data Platform, Location and Device independency Adoptable for all sizes of businesses, in particular small and mid-sized ones From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang
25
Key Characteristics (2/2)
Scalable services and applications Achieved through server virtualization technology Redundancy and disaster recovery From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang
26
Timeline From “Introduction to Cloud Computing and Technical Issues” Eom, Hyeonsang
27
Cloud Computing Architecture
Front End End user, client or any application (i.e., web browser, etc.) Back End (Cloud services) Network of servers with any computer program and data storage system It is usually assumed that clouds have infinite storage capacity for any software available in market
29
Cloud Service Taxonomy
Layer Software-as-a-Service (SaaS) Platform-as-a-Service (PaaS) Infrastructure-as-a-Service (IaaS) Data Storage-as-a-Service (DaaS) Communication-as-a-Service (CaaS) Hardware-as-a-Service(HaaS) Type Public cloud Private cloud Inter-cloud
30
Infrastructure as a Service
31
What Is Infrastructure-as-a-Service (IaaS)
Characteristics Utility computing and billing model Automation of administrative tasks Dynamic scaling Desktop virtualization Policy-based services Internet connectivity
32
Use Scenario for IaaS
33
Business Example: Amazon EC2
34
Amazon EC2 CLI (Client Level Interface)
35
Amazon EC2 Automated Management
36
Data Storage as a Service (DaaS)
Definition Delivery of data storage as a service, including database-like services, often billed on a utility computing basis Database (Amazon SimpleDB & Google App Engine's BigTable datastore) Network attached storage (MobileMe iDisk & Nirvanix CloudNAS) Synchronization (Live Mesh Live Desktop component & MobileMe push functions) Web service (Amazon Simple Storage Service & Nirvanix SDN)
37
Amazon S3
38
Platform as a Service
39
PssS Example: Google App Engine
Service that allows user to deploy user’s Web applications on Google's very scalable architecture Providing user with a sandbox for user’s Java and Python application that can be referenced over the Internet Providing Java and Python APIs for persistently storing and managing data (using the Google Query Language or GQL)
40
Google App Engine
41
Google App Engine (Python Example)
42
Software-as-a-Service (SaaS)
Definition Software deployed as a hosted service and accessed over the Internet Features Open, Flexible Easy to Use Easy to Upgrade Easy to Deploy
43
Software as a Service
44
Human as a Service
45
Administration/Business Support
46
Cloud Architecture -> Cloud Players
47
Players
48
Players: Providers
49
Players: Cloud Intermediaires
50
Players: Application Providers
51
Programming Model
52
Web-Scale Problems? Don’t hold your breath: It all boils down to…
Biocomputing Nanocomputing Quantum computing … It all boils down to… Divide-and-conquer Throwing more hardware at the problem Simple to understand… a lifetime to master…
53
Divide and Conquer Partition Combine “Work” w1 w2 w3 r1 r2 r3 “Result”
“worker” “worker” “worker” r1 r2 r3 Combine “Result”
54
Different Workers Different threads in the same core
Different cores in the same CPU Different CPUs in a multi-processor system Different machines in a distributed system
55
Choices, Choices, Choices
Commodity vs. “exotic” hardware Number of machines vs. processor vs. cores Bandwidth of memory vs. disk vs. network Different programming models
56
Flynn’s Taxonomy Instructions Single (SI) Multiple (MI) Single (SD)
SISD Single-threaded process MISD Pipeline architecture SIMD Vector Processing MIMD Multi-threaded Programming Data Multiple (MD)
57
SISD Processor D D D D D D D Instructions
58
SIMD D0 D0 D0 D0 D0 D0 D0 D1 D1 D1 D1 D1 D1 D1 D2 D2 D2 D2 D2 D2 D2 D3
Processor D0 D0 D0 D0 D0 D0 D0 D1 D1 D1 D1 D1 D1 D1 D2 D2 D2 D2 D2 D2 D2 D3 D3 D3 D3 D3 D3 D3 D4 D4 D4 D4 D4 D4 D4 … … … … … … … Dn Dn Dn Dn Dn Dn Dn Instructions
59
MIMD D Processor Instructions D Processor Instructions
60
Memory Typology: Shared
Processor Processor Memory Processor Processor
61
Memory Typology: Distributed
Processor Memory Processor Memory Network Processor Memory Processor Memory
62
Memory Typology: Hybrid
Processor Memory Processor Memory Processor Processor Network Processor Memory Processor Memory Processor Processor
63
Parallelization Problems
How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What is the common theme of all of these problems?
64
General Theme? Parallelization problems arise from:
Communication between workers Access to shared resources (e.g., data) Thus, we need a synchronization system! This is tricky: Finding bugs is hard Solving bugs is even harder
65
Managing Multiple Workers
Difficult because (Often) don’t know the order in which workers run (Often) don’t know where the workers are running (Often) don’t know when workers interrupt each other Thus, we need: Semaphores (lock, unlock) Conditional variables (wait, notify, broadcast) Barriers Still, lots of problems: Deadlock, livelock, race conditions, ... Moral of the story: be careful! Even trickier if the workers are on different machines
66
Patterns for Parallelism
Parallel computing has been around for decades Here are some “design patterns” …
67
Master/Slaves master slaves
68
Producer/Consumer Flow
69
Work Queues P C shared queue P W W W W W C P C
70
Rubber Meets Road From patterns to implementation: The reality:
pthreads, OpenMP for multi-threaded programming MPI for clustering computing … The reality: Lots of one-off solutions, custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything MapReduce to the rescue! (for next time)
71
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.