Business Discovery, Monitoring & Reporting Data Flow iCLM UI Operator Systems OCS IN CDR PCC CRM Marketing Operations CSR Monitoring Marketing Integration Layer RT Complex Event Processing Decisioning Engine Decisioning Engine Business Discovery, Monitoring & Reporting Visual Rules Subscriber Profile Channels Subscriber Data Store HBase Big Data Analytics Hive DWH
Legacy Architecture
Legacy System analysis In 2011 we reached the system’s glass-ceiling: 10 M subscribers. 120 M events per day. We analyzed the architecture bottle-necks and identified the following issues: Real-Time sub-system: Queue management in Oracle – incoming/outgoing data streams, application logs. Subscriber state BLOB update in Oracle (random access). Analytical sub-system: Large joins between dozens of facts/dimension/entity tables. ETL from OLTP DB.
Architecture blueprint To overcome the problems raised in the analysis phase we have made several architectural decisions. Queue’s management – in a distributed file-system supporting 10s of millions of small files (<10s of MB) Subscriber OLTP state management – by using NoSQL, Key-Value store. Analytical workflows – should work over files holding Subscriber aggregates in BLOB and thus avoiding large joins. We examined several solution technologies and concluded that Big-Data will provide the best TCO, but is lacking in enterprise readiness We identified extra functional requirements to support the system quality attributes and conducted an RFP to select the Big-Data vendor
We conducted an RFP for selecting the most Telco-Grade platform. The RFP focused on non-functional capabilities such as sustainable performance, high-availability and manageability.
The approach Each step should increase scalability and reduce TCO. Runtime (OLTP) processing: We replace the underline plumbing's-minimal changes to business logic. All changes can be turned on/off by GUI configurations: Modular hybrid architecture. Ability to work in dual mode - Good for QA…But also for production (legacy)… Upgrade path from legacy is kept in all phases Analytics processing: Calculate the Profile in M/R (Java). Scalable. We have the best Java developers. Wrap it with a DSL (Domain-Specific-Languages) That’s how we work for years – (ModelTalk paper) Non-Java-programmers can do the Job.
Phase 1 Phase # Customers # Events Legacy 10M 120M Phase 1 200M
Phase 1 – File queues in NFS Resulting context Pure plumbing change – no changes to business logic code. Offloading oracle: *2 Performance boost. No BigData technology. Windows NFS client performance is a bottleneck. Phase # Customers # Events Legacy 10M 120M Phase 1 200M
Reverse engineering of the SQL code
Phase 2 Phase # Customers # Events Legacy 10M 120M Phase 1 200M unlimited
Phase 2 – Introducing MapR Hadoop Cluster Resulting Context MapR FS + NFS : Horizontally scalable Cheap compared to high-end NFS solutions. Fast and High-Available (using VIPs) Avoiding another hop to HDFS (Flume, Kafka). Many small files are stored in HDFS (100s of millions) – no need to merge files Phase # Customers # Events Legacy 10M 120M Phase 1 200M Phase 2 unlimited
Phase 2 – Introducing MapR Hadoop Cluster Resulting Context Avro files: Complex Object Graph Troubleshooting with PIG Out-of-the-box upgrade (e.g. adding field) Map/Reduce is incremental – Avro record capture the subscriber state Map/Reduce efficiency - avoiding huge joins Subscriber Profile calculation: Performance : 2-3 hours. Linear scalability: No limitation on number of subscribers/raw data (buy more nodes) Fast run over history data allows for early launch Sqoop - very fast insertions to MS-SQL (10s of millions of records in minutes). Data-Analysts started working over Hive environment. No HA for OOZIE yet… Hue is premature MS-SQL and ODBC over Hive is slow and limited
Phase 3 Phase # Customers # Events Legacy 10M 120M Phase 1 200M unlimited Phase 3 300M
Phase 3 –Introducing MapR M7 Table Extensive YCSB load tests to find best table structure and read/update granularity. Main conclusions: M7 knows how to handle very big heap – 90GB. Update granularity : small updates (using columns) = fast reads (*)While in other KV store need to update the entire BLOB CSR tables migrated from Oracle to M7 Table: 10s of billions of records Need sub-second random access per subscriber 99.9% Writes – by Runtime machines (almost each event processing operation produces update) 0.1% Reads – by Customer’s CSR representative. Rows – per subscriber key, 10’s of millions 2 CFs – TTL 365 days. 1 version. Qualifier: key:[date_class_event_id], value: record Up to thousands per Row
Phase 3 –Introducing MapR M7 Table Resulting Context Choosing the right features – no too demanding performance wise. Easy to create and manage tables– still there’s some tweaking. No cross-table ACID - need to develop a solution for keeping consistency across M7 Table/Oracle/Files-system. Hard for QA - compared to RDBMS. No easy way to query. Need to develop tools. Phase # Customers # Events Legacy 10M 120M Phase 1 200M Phase 2 unlimited Phase 3 300M
Phase 4 Phase # Customers # Events Legacy 10M 120M Phase 1 200M unlimited Phase 3 300M Phase 4
Phase 4 – Migrating OLTP features to M7 tables Subscriber State table migrated from Oracle to M7 Table: 25% Writes– by Runtime machines updating the state 100% Reads – by Runtime. Rows – per subscriber key, 10’s of millions 1 CFs – TTL -1. 1 version. YCSB to validate the solution Sizing model Qualifier: key:state_name, value: state value. Dozens per Row. But….Only 10% are being updated per event Subscriber Profile Table migrated from MS-SQL to M7 Table. Bulk insert once a day Outbound Queue Table migrated from MS-SQL to M7 Table.
Phase 4 – Migrating OLTP features to M7 tables Resulting Context No longer dependent on Oracle for OLTP. Real-time processing can handle billions of events per day. Sizing is linear and easy to calculate: Number of subscribers * state size * 80% should reside in cache. HW spec: 128GB RAM, 12 SAS drives. Consistency management is very complicated. Phase # Customers # Events Legacy 10M 120M Phase 1 200M Phase 2 unlimited Phase 3 300M Phase 4
Phase 5
Summary We migrated to Big-Data over 5 product versions spanning 1.5 years The Software Architects were dominant in defining the product roadmap The Software Architect has a paramount role in Big-Data architecture Having a well defined architecture allows for controlled, well planned architecture changes with minimal to no rework
Atzmon Hen-Tov Lior Schachter