Download presentation
Presentation is loading. Please wait.
Published byAnn Nicholson Modified over 9 years ago
1
The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved
2
Outline LinkedIn Confidential ©2013 All Rights Reserved 2 1.Company and Mission 2.Products and Science 3.Data Infrastructure 4.Conclusion
3
The World’s Largest Professional Network Members Worldwide 2+ new Members Per Second 132M+ Monthly Unique Visitors 225M+ 2.9M+ Company Pages Connecting the world’s professionals to make them more productive and successful LinkedIn Confidential ©2013 All Rights Reserved 3
4
4 Member Profiles Large dataset Medium writes Very high reads Freshness <1s
5
People You May Know 5 Large dataset Compute intensive High reads Freshness ~hrs
6
LinkedIn Today 6 Moving dataset High writes High reads Freshness ~mins
7
LinkedIn Data Infrastructure: Three-Phase Abstraction LinkedIn Confidential ©2013 All Rights Reserved 7 Users Online Data Infra Near-Line Infra Application Offline Data Infra InfrastructureLatency & Freshness RequirementsProducts Online Activity that should be reflected immediately Member Profiles Company Profiles Connections Messages Endorsements Skills Near-Line Activity that should be reflected soon Activity Streams Profile Standardization News Recommendations Search Messages Offline Activity that can be reflected later People You May Know Connection Strength News Recommendations Next best idea…
8
The Big-Data Feedback Loop LinkedIn Confidential ©2013 All Rights Reserved 8 Value Insights Scale Product Science Data Member Engagement Virality Signals Refinement Infrastructure Analytics
9
LinkedIn Data Infrastructure: Sample Stack 9 Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms
10
The Original RDBMS Model LinkedIn Confidential ©2013 All Rights Reserved 10
11
Streaming Transactions for Search/Connections 11
12
Databus : Timeline-Consistent Change Data Capture LinkedIn Data Infrastructure Solutions
13
Streaming Transactions for Search/Connections 13 RO
14
Databus at LinkedIn 14 DB Bootstrap Capture Changes On-line Changes On-line Changes DB Compressed Delta Since T Consistent Snapshot at U Transport independent of data source: Oracle, MySQL, … Transactional semantics In order, at least once delivery Tens of relays Hundreds of sources Low latency - milliseconds Relay Event Win
15
Scaling Core Databases 15 RO
16
Voldemort: Highly-Available Distributed KV Store LinkedIn Data Infrastructure Solutions 16
17
Scaling Core Databases 17
18
Pluggable components Tunable consistency / availability Highly scalable key/value store 14 clusters, 400 nodes 400K peak QPS 100TB data 2~3ms avg latency Voldemort: Architecture
19
Scaling Core Databases 19 Secondary Index
20
Espresso: Indexed Timeline-Consistent Distributed Data Store LinkedIn Data Infrastructure Solutions 20
21
Storage with Richer Data Model 21 Espresso
22
Application View 22 Hierarchical data model Rich functionality on resources Conditional updates Partial updates Atomic counters Rich functionality within resource groups Transactions Secondary index Text search
23
Espresso: System Components 23 Partitioning/replication Timeline consistency Change propagation
24
Generic Cluster Manager: Helix Generic Distributed State Model Config Management Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing Espresso, Databus and Search Open Source Apr 2012 https://github.com/linkedin/helix 24
25
Streaming Non-transactional Events 25 Hadoop/D W Espresso
26
Kafka: High-Volume Low-Latency Messaging System LinkedIn Data Infrastructure Solutions 26
27
Ingress – Offline Data Analytics 27 Secured Hadoop/D W Secured Hadoop/D W
28
Kafka Architecture Producer Consumer Producer Consumer Zookeeper topic1-part1 topic2-part2 topic2-part1 topic1-part2 topic2-part2 topic2-part1 topic1-part1 topic1-part2 topic1-part1 topic1-part2 topic2-part2 topic2-part1 Broker 1 Broker 2 Broker 3 Broker 4 Key features Scale-out architecture High throughput Automatic load balancing Intra-cluster replication Per day stats writes: 10+ billion messages reads: 50+ billion messages
29
Egress – Analytics Results for Online Serving 29 Secured Hadoop/D W Secured Hadoop/D W
30
WebHDFS + Faust LinkedIn Data Infrastructure Solutions 30 +
31
Egress – Getting Data Out from Offline 31 Secured Hadoop/D W Secured Hadoop/D W WebHDFS Kafka Faust
32
Batch Environment Data Flow 32
33
Workflow management: Azkaban 33
34
LinkedIn Confidential ©2013 All Rights Reserved 34 Map-reduce jobs generate RO files All index fits in memory for fast reads File system cache for data Data transferred in parallel via WebHDFS Authentication always required for each file transfer out of Hadoop Read-only Data Generation and Transfer
35
LinkedIn Confidential ©2013 All Rights Reserved 35 Map-reduce jobs generate records In Avro format Annotated key and value fields Records published from Hadoop to Kakfa Faust consumes records from Kafka Faust streams records into Voldemort, Espresso, and other serving platforms Modifiable Data Generation and Transfer Plug- ins V. Plug-in E. Plug-in Plug- ins Kafka Plug- in Datab us Plug-in Other Data Sources Voldemort Espresso Other Data Sources Hadoop Teradata/ DWH Kafka Monitoring ThrottlingScheduling Faust
36
Summary LinkedIn Confidential ©2013 All Rights Reserved 36 Read more @ data.linkedin.com 1.E2E: The Big-Data feedback loop is essential for product design 2.Infrastructure 1.Data Infra needs continuous innovation and iteration to scale out 2.Fast moving, Big, Clean Data + Agile Metadata = Goodness 3.Data-driven products need agile feedback infrastructure and measurement methodology. 3.Methodology 1.Data-Driven experimentation enables insights and agile products 2.Recommendation-driven products have big impact.
37
Help us. Come Have Fun with Us! LinkedIn Confidential ©2013 All Rights Reserved 37 Info: data.linkedin.com 1.Science and Data Mining: Recommendation and Optimization Problems 2.Next-generation ad-hoc and OLAP query processing on Hadoop 3.Graph Computations: Off-line mining and On-line integration loops 4.nRT Data Streams in Near-line infrastructure 5.And much more…
38
In Closing LinkedIn Confidential ©2013 All Rights Reserved 38 lgao@linkedin.com Thank You!
39
39
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.