Hadoop & Neptune Feb 김형준
The Data 'Tsunami'
More CPU Faster Disk Program Tuning More Memory
Uninstall
Where? Distributed File System How? Distributed/Parallel Computing
Hadoop DFS Unlimited Storage No Backup, Self-healing Thousands Nodes But, No POSIX No Random write
: machine : daemon process NameNode (DFS Master) JobTracker (Job Master) DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk Secondary NameNode ClientAPI control data control data
Hadoop MapReduce 1TB group by -> 10 분 More Machine -> 1 분
map (k1,v1) → list(k2,v2) reduce (k2, list (v2)) → result value This is a book. That book is on the desk. I like that book. This is a book. That book is on the desk. I like that book. (This,1) (book, 1) (That, 1) (book, 1) … (I,1) (that, 1) (book, 1) … map() (book, [1,1,1]) … (is, [1,1]) … (This,[1]) (book, 3) … (is, 2) … (This,1) reduce() Exec distributed/parallel Map&Reduce execution platform Split Partition Merge Sort
: machine : daemon process NameNode (DFS Master) JobTracker (Job Master) DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk Secondary NameNode ClientAPI control data control data
A piece of Cake
Neptune Database running on DFS(Hadoop) Unlimited Structured Data No Backup But, No JOIN, No SQL No Multiple row operation No Aggregation function
Operation Create/Drop Table put/get like/between scan/merge scan(join) MapReduce
Why Neptune? Tablet A-3 Tablet A-N … Tablet A-2 TabletA-1 TableA JobTracker Make Map&Reduce function Run on Map&Reduce framework META Table Get tablet list Map Task TaskTracker Map Task TaskTracker Map Task TaskTracker Map Task Task assign to each node TaskTracker Reduce Task TaskTracker Reduce Task TableB Tablet B-2 Tablet B-1 분산 / 병렬처리 : Speed, Scalability 분산 / 병렬처리 : Speed, Scalability
분산파일시스템 (Hadoop or other) TabletServer #1 TabletServer #2 TabletServer #n Cluster Management System Neptune Master Neptune Master 분산 / 병렬컴퓨팅 플랫폼 (Hadoop) 사용자 애플리케이션 Neptune ( 대용량분산 데이터 저장소 ) 논리적 Table 물리적 저장소
When use Neptune Large Data Online put/get and analysis Less complex Google Personalized Search Google analytics
Finding developer
Cheap Hardware and Smart Software Use cheap commodity hardware frequent failure Develop smart software for reducing the cost of failure Easy Management High Scalability by automatic discovery of new servers and racks High Redundancy for failure of servers, racks, even data centers Speed and Then More Speed High speed with low cost Rapid development and deployment of new products Use existing technologies Use techniques from the leading edge of computer science Use open source codes as a starting point Principle of Google Infra
Google Infra Google Linux GFS Bigtable Map & Reduce Client API Chubby Cluster Mgmt Batch application Online Services Hardware Low-end commodity servers 40 or more pizza box server per rack Google’s core competency Google’s software stack
Q&A