Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.

Similar presentations


Presentation on theme: "©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential."— Presentation transcript:

1 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.

2 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian and the marks used herein are service marks or registered trademarks of Experian Information Solutions, Inc. Other product and company names mentioned herein are the trademarks of their respective owners. No part of this copyrighted work may be reproduced, modified, or distributed in any form or manner without the prior written permission of Experian. Experian Confidential. Data Hub Enabling easy and safe access to Experian’s data Greg Bonin Principal Scientist | Experian DataLabs

3 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 3 Introduction and overview How do we cost-effectively and safely provide simple access to Experian’s internal data to clients and ourselves? From Experian Analytical Sandbox™ to data hub  Extending the Experian Analytical Sandbox™ to other parts of Experian  Making the Experian Analytical Sandbox™ into a delivery platform The Experian Analytical Sandbox™ – a case study  What is it?  How did we build it?

4 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 4 An ad-hoc environment where clients and internal users can access something like MAD(Monthly Analytic Dataset) and perform statistical analysis What is the Experian Analytical Sandbox™  Dataset will be shared across many users (should be scalable)  Underlying data will be anonymized (but real data)  Dataset should contain all records (not a sample)  Dataset will be shared across many users (should be scalable)  Underlying data will be anonymized (but real data)  Dataset should contain all records (not a sample)  Client’s have their own environment, where they may bring in data  Clients should not be able to pull data out of the system  Client’s have their own environment, where they may bring in data  Clients should not be able to pull data out of the system  Clients must be able to access data through SAS Key design goals

5 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 5 What is the MAD data?  Raw tradeline data (one record per trade per consumer)  Various scores and attributes (one record per consumer)  MAD data is a 10% sample of U.S. consumers and is typically produced monthly How much storage do we need?  We want to store 100% of the raw files ► One month of 100% file is approximately 10TB (uncompressed)  Five years of monthly history needed for analytical use  Our total storage needs are around 700TB! Experian Analytical Sandbox™ – Data requirements

6 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 6 Experian Analytical Sandbox™ – Design overview  Utilize Hadoop as a cost efficient scalable data store  Access data through HIVE  Strong authentication via Kerberos  Leverage CITRIX to ensure all data stays within Experian  Utilize Hadoop as a cost efficient scalable data store  Access data through HIVE  Strong authentication via Kerberos  Leverage CITRIX to ensure all data stays within Experian

7 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 7 Cluster Specs  30 node Hadoop cluster running CDH  128GB and 16 cores per data node  700TB total disk (usable ~230TB) Cost  ~$700,000 for hardware  Funded by CIS Usage  Currently have one client(AMEX). Current contract recovers most of initial cost What Do We Have Now

8 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 8 We need to store and access large amounts of data in a cost-effective way  Works well with off-the-shelf hardware  Can meet performance needs by adding servers  Limited licensing costs We want to make the data access easy and flexible  Hadoop supports several SQL like languages (Hive, Impala, etc.)  We needed to integrate with SAS, which works with Hive Usage pattern fits well with Hadoop Shared data store – Why Hadoop?

9 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 9 Hadoop does not have strong authentication by default  We used Kerberos to handle the authentication … which was painful to setup  Complicates client applications as they need to support Kerberos SAS and Hadoop are not ideal bed-fellows  Pulling large quantities of data down through SAS is slow  It is hard to force SAS to utilize the cluster efficiently  Managing DB permissions with SAS is annoying Technical challenges

10 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 10 Case study – Using the Experian Analytical Sandbox™ to answer questions Auto opened recentlyHas autoOverall “What is the trend of VantageScore ® for people who recently obtained an auto loan?”  A simple SQL query was able to answer this question in 2.5 minutes ► Process involved joining a 2TB file with a 250GB file  Similar analysis using SAS on a single server could take 50- 100x longer

11 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 11 Building Experian Analytical Sandbox’s™ for other parts of Experian Experian Analytical Sandbox™ name Type of dataPotential use Business Information Services Raw trade line data Similar to use case for Experian Analytical Sandbox™ except less regulatory sensitivity Healthcare Claims and eligibility checks from Experian Healthcare Provide researchers or private parties a rich data set to analyze Digital Advertising IP impression information (Audience IQ SM ) Device ID’s (41 st Parameter ® ) Allow third parties to use this data for model-building or reporting ConsumerView SM Monthly-trended ConsumerView SM data Provide insight into changes in demographic data over time  Opportunities exist to build more sandboxes  Building more sandboxes across Experian’s data assets will allow broad, safe access to data  We believe this would lead to increased opportunities for innovation

12 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 12 The “Cloud” landscape Data Tools Client-driven Proprietary Client-driven Proprietary

13 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 13  Extending our design will allow solutions developed in the Experian Analytical Sandbox™ to be deployed  Using Experian tools will allow quick deployment of models ► Example: Model outputs written in PMML would allow quick deployment From Experian Analytical Sandbox™ to Data Hub

14 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 14  The Experian Analytical Sandbox™ is one way to make Experian’s internal data easier to access and use ► Making access to data easier reduces barriers to innovation  Extending the functionality of the Experian Analytical Sandbox™ could lead to a new way of using Experian data ► Easy and safe access to raw data can allow clients to understand their customers better ► Streamlined deployment can make those insights actionable Conclusion

15 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. #FOIC2014

16 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Greg Bonin Principal Scientist Experian DataLabs e: greg.bonin@experian.com t: (858)314-2613


Download ppt "©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential."

Similar presentations


Ads by Google