Analytics as a First-Class Concern June 3, 2016 Calum Murray Small Business Data Chief Architect, Intuit
Accounting Professionals Who we serve: Small Businesses Accounting Professionals Consumers
Our mission: To improve our customers’ financial lives so profoundly… they can’t imagine going back to the old way
Transformation to a cloud ecosystem As Intuit evolved QuickBooks, QuickBooks Payroll, QuickBooks Payments, and other product offerings into a SaaS business and an open cloud platform, business analytics could no longer be treated as an afterthought – it had to be part of the platform architecture as a first-class concern. Desktop Business SAAS Business Portfolio of Products Ecosystem
Intuit analytics problem space Solve for data lifecycle All stages are needed Solve for internal users Data runs the business Solve for external users Data enables customer delight
Internal stakeholders Marketing – Level of campaign success Product – First time use, driving attach Care – 360 view of customer, understanding product usage Sales – success against sale’s targets Finance – financial reports, how is the business doing
Top platform analytics data concerns Applications 1 Key data sources Clickstream Transactional user-entered data Back office data and insights Key cross-cutting concerns Traceability – customer ID, transaction ID REACTive platform architecture Analytics infrastructure Model congruity Sources of truth Micro services Key data sources Write Read Read Read Key cross-cutting concerns 2 7 8 OLTP product DBs 4 PUB REACT REACT REACT 5 KAFKA REACT 5 Analyst tools Back office systems SPARK Streaming 6 3 Consume Ingest Enterprise 4 Ingest (batch) EL Marketing Care Data lake (Hadoop/Hive) Data warehouse (Vertica)
Key data sources Entry points Product usage Product data Clickstream Transactional Billing Customer contacts Campaign metrics Life-time value Propensity scores Enterprise Insights
Key cross-cutting concerns Analytics Infrastructure Designed as part of SAAS platform Batch and near-realtime Congruent models Single sources of truth Reactive pattern Clickstream Transactional Traceability One ID(s) to bind them Customer ID Transaction ID Consume and feed back Enterprise Insights
Where we started (in the cloud) Applications 1 1 Monolithic, siloed applications, inconsistent clickstream collection 4 4 Monolithic data stores with disparate models, multiple sources of truth. 2 2 3 Siloed enterprise data Analyst tools Analyst tools Fragmented IDs, no traceability across applications 4 5 Ingest Ingest Ingest Consume 5 Batch transactional data ingestion. No real time. Consume 3 Consume Enterprise 6 Enterprise data/insights not going into lake. Enterprise systems pulling data into their own data warehouses. Marketing EL EL 6 Care Data lake (Hadoop/Hive) Data warehouse (Neteeza)
Not big but complex given the many sources and shapes Size of data OLTP Customer data ~70 TB across 10+ schemas Data warehouses Analytics ~100 TB Risk ~ 31 TB Click stream ~50TB Not big but complex given the many sources and shapes
The journey we are on ... Decomposition and re-decomposition of platform Break up monoliths and reassemble as decomposed services Define single sources of truth Data encapsulation and model alignment – data storage and APIs 1 Micro services Write Read Read Read 1 2 Asynchronous near real-time architecture Move platform to REACT pattern Make analytics part of the platform Single sources of truth PUB REACT REACT REACT 2 3 One data lake and analytics system Kill the clones and centralize KAFKA REACT Analyst tools 4 Back office integration-virtuous cycle Kill the clones and centralize Back office systems SPARK Streaming Consume 3 Consume Ingest Enterprise 4 Ingest (batch) EL Marketing Care Data lake (Hadoop/Hive) Data warehouse (Vertica)
The journey we are on – people Insufficient investment in people for data Concentration on application/services engineers Congruent horizontal data not viewed as necessity Analytics was an afterthought Investment after the fact was even bigger Cleaning up the mess All layers are impacted to get to good state Invest in data early or pay the price later
Key takeaways so far Analytics needs to be part of your platform – not an adjunct Data models in application have big impact on ability to get insights Lack of traceability in application will torpedo you – hard to add after the fact Analytics pipeline needs to be treated as first-class, deployable software You need engineers as well as data scientists You need CI/CD, unit testing, the right environments REACTive platform architecture makes it easier To decompose your models To do near real-time analytics Tooling is very important Dashboards Automated reporting
Q&A