Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.

Similar presentations


Presentation on theme: "The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved."— Presentation transcript:

1 The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

2 Outline LinkedIn Confidential ©2013 All Rights Reserved 2 1.Company and Mission 2.Products and Science 3.Data Infrastructure 4.Conclusion

3 The World’s Largest Professional Network Members Worldwide 2+ new Members Per Second 132M+ Monthly Unique Visitors 225M+ 2.9M+ Company Pages Connecting the world’s professionals to make them more productive and successful LinkedIn Confidential ©2013 All Rights Reserved 3

4 4 Member Profiles Large dataset Medium writes Very high reads Freshness <1s

5 People You May Know 5 Large dataset Compute intensive High reads Freshness ~hrs

6 LinkedIn Today 6 Moving dataset High writes High reads Freshness ~mins

7 LinkedIn Data Infrastructure: Three-Phase Abstraction LinkedIn Confidential ©2013 All Rights Reserved 7 Users Online Data Infra Near-Line Infra Application Offline Data Infra InfrastructureLatency & Freshness RequirementsProducts Online Activity that should be reflected immediately Member Profiles Company Profiles Connections Messages Endorsements Skills Near-Line Activity that should be reflected soon Activity Streams Profile Standardization News Recommendations Search Messages Offline Activity that can be reflected later People You May Know Connection Strength News Recommendations Next best idea…

8 The Big-Data Feedback Loop LinkedIn Confidential ©2013 All Rights Reserved 8 Value  Insights  Scale  Product Science Data Member Engagement  Virality  Signals  Refinement  Infrastructure Analytics 

9 LinkedIn Data Infrastructure: Sample Stack 9 Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms

10 The Original RDBMS Model LinkedIn Confidential ©2013 All Rights Reserved 10

11 Streaming Transactions for Search/Connections 11

12 Databus : Timeline-Consistent Change Data Capture LinkedIn Data Infrastructure Solutions

13 Streaming Transactions for Search/Connections 13 RO

14 Databus at LinkedIn 14 DB Bootstrap Capture Changes On-line Changes On-line Changes DB Compressed Delta Since T Consistent Snapshot at U  Transport independent of data source: Oracle, MySQL, …  Transactional semantics  In order, at least once delivery  Tens of relays  Hundreds of sources  Low latency - milliseconds Relay Event Win

15 Scaling Core Databases 15 RO

16 Voldemort: Highly-Available Distributed KV Store LinkedIn Data Infrastructure Solutions 16

17 Scaling Core Databases 17

18 Pluggable components Tunable consistency / availability Highly scalable key/value store 14 clusters, 400 nodes 400K peak QPS 100TB data 2~3ms avg latency Voldemort: Architecture

19 Scaling Core Databases 19 Secondary Index

20 Espresso: Indexed Timeline-Consistent Distributed Data Store LinkedIn Data Infrastructure Solutions 20

21 Storage with Richer Data Model 21 Espresso

22 Application View 22 Hierarchical data model Rich functionality on resources Conditional updates Partial updates Atomic counters Rich functionality within resource groups Transactions Secondary index Text search

23 Espresso: System Components 23 Partitioning/replication Timeline consistency Change propagation

24 Generic Cluster Manager: Helix Generic Distributed State Model Config Management Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing Espresso, Databus and Search Open Source Apr 2012 https://github.com/linkedin/helix 24

25 Streaming Non-transactional Events 25 Hadoop/D W Espresso

26 Kafka: High-Volume Low-Latency Messaging System LinkedIn Data Infrastructure Solutions 26

27 Ingress – Offline Data Analytics 27 Secured Hadoop/D W Secured Hadoop/D W

28 Kafka Architecture Producer Consumer Producer Consumer Zookeeper topic1-part1 topic2-part2 topic2-part1 topic1-part2 topic2-part2 topic2-part1 topic1-part1 topic1-part2 topic1-part1 topic1-part2 topic2-part2 topic2-part1 Broker 1 Broker 2 Broker 3 Broker 4 Key features Scale-out architecture High throughput Automatic load balancing Intra-cluster replication Per day stats writes: 10+ billion messages reads: 50+ billion messages

29 Egress – Analytics Results for Online Serving 29 Secured Hadoop/D W Secured Hadoop/D W

30 WebHDFS + Faust LinkedIn Data Infrastructure Solutions 30 +

31 Egress – Getting Data Out from Offline 31 Secured Hadoop/D W Secured Hadoop/D W WebHDFS Kafka Faust

32 Batch Environment Data Flow 32

33 Workflow management: Azkaban 33

34 LinkedIn Confidential ©2013 All Rights Reserved 34 Map-reduce jobs generate RO files All index fits in memory for fast reads File system cache for data Data transferred in parallel via WebHDFS Authentication always required for each file transfer out of Hadoop Read-only Data Generation and Transfer

35 LinkedIn Confidential ©2013 All Rights Reserved 35 Map-reduce jobs generate records In Avro format Annotated key and value fields Records published from Hadoop to Kakfa Faust consumes records from Kafka Faust streams records into Voldemort, Espresso, and other serving platforms Modifiable Data Generation and Transfer Plug- ins V. Plug-in E. Plug-in Plug- ins Kafka Plug- in Datab us Plug-in Other Data Sources Voldemort Espresso Other Data Sources Hadoop Teradata/ DWH Kafka Monitoring ThrottlingScheduling Faust

36 Summary LinkedIn Confidential ©2013 All Rights Reserved 36 Read more @ data.linkedin.com 1.E2E: The Big-Data feedback loop is essential for product design 2.Infrastructure 1.Data Infra needs continuous innovation and iteration to scale out 2.Fast moving, Big, Clean Data + Agile Metadata = Goodness 3.Data-driven products need agile feedback infrastructure and measurement methodology. 3.Methodology 1.Data-Driven experimentation enables insights and agile products 2.Recommendation-driven products have big impact.

37 Help us. Come Have Fun with Us! LinkedIn Confidential ©2013 All Rights Reserved 37 Info: data.linkedin.com 1.Science and Data Mining: Recommendation and Optimization Problems 2.Next-generation ad-hoc and OLAP query processing on Hadoop 3.Graph Computations: Off-line mining and On-line integration loops 4.nRT Data Streams in Near-line infrastructure 5.And much more…

38 In Closing LinkedIn Confidential ©2013 All Rights Reserved 38 lgao@linkedin.com Thank You!

39 39


Download ppt "The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved."

Similar presentations


Ads by Google