Building a Threat-Analytics Multi-Region Data Lake on AWS Ori Nakar 2018
About me Researcher at Imperva Web application and database security Software development methodology and architecture Cloud computing, AWS, Docker and Big Data
Agenda What is a Data Lake? Data Lake structure and flow example Threat-Analytics Data Lake architecture Multi-region queries Demo
Because data is large and the requirements are unknown – Our Story Data was almost in our hands – we saw it coming and going We needed a solution for storing it in a way we can use it We did not know: How much data we are going to keep and for how long New business use-cases are on the way Because data is large and the requirements are unknown – we decided to go with a Data Lake
Data Lake Collection of files stored in a distributed file system Information is stored in its native form, with little or no processing Flexible and allows great amount of data to be stored, queried and analyzed
The Data Data Lake's Data Database Data All data, even unused Structured, Semi-Structured, or Unstructured Transformed when ready to be used Database Data Structured and Tranformed Added per use case
The Users Operational Data Experts Want to get their reports and slice their data Advanced Go back to the data source Data Experts Deep analysis
Answer new business questions faster Data Lake Database Add indices Plan your queries Create a schema Store what you get
Query Engine Data Lake Database Query Engine 2 Query Engine 1
Data Structure Example raw data/events day=2018-1-1 file1.csv file2.csv file3.csv day=2018-1-2 tables/events day=2018-1-1 type=1 file1.parquet file2.parquet type=2 file3.parquet file4.parquet type=3
CSV to Parquet Example Metadata Count: 4 Column metadata – place in file, min, max, compression info Time 1/1/2018 1:00, 1/1/2018 1:01, 1/1/2018 1:05, 1/1/2018 1:11 Type 1, 1, 2, 3 Message Text 1, Text 2, Text 3, text 4 Severity Low, High, Medium, Low Time Type Message Severity 1/1/2018 1:00 1 Text 1 Low 1/1/2018 1:01 Text 2 High 1/1/2018 1:05 2 Text 3 Medium 1/1/2018 1:11 3 Text 4
AWS Management Console Architecture and Flow Amazon Athena Python boto3 DBeaver SQL Client AWS Management Console ap-northeast-1 eu-west-1 us-east-1 S3 raw data S3 aggregated data Amazon Athena
Flow Example day=2018-1-1 file1.csv file2.csv SELECT time, message, severity FROM events WHERE day='2018-1-1' GROUP BY type ORDER BY severity type=1 data_file1.parquet data_file2.parquet type=2 AWS Athena is used to hourly / daily process the data – filter, join, partition, sort and aggregate data into parquet files
Multi-region queries Data is saved in multiple regions Two available options with Athena: Single query engine in one of the regions – Like in the good old days Query engine per region – Better performance, but more work
Demo Query
Will it work for you too? Tip: Do a POC with real data Summary We got to better analytics by: Using more data SQL and other engines capabilities Queries on multiple regions Improvements: Cost reduction in storage and compute No need to maintain servers Will it work for you too? Tip: Do a POC with real data
orin@imperva.com