Building a Threat-Analytics Multi-Region Data Lake on AWS

Building a Threat-Analytics Multi-Region Data Lake on AWS
Ori Nakar 2018

About me Researcher at Imperva Web application and database security
Software development methodology and architecture Cloud computing, AWS, Docker and Big Data

Agenda What is a Data Lake? Data Lake structure and flow example
Threat-Analytics Data Lake architecture Multi-region queries Demo

Because data is large and the requirements are unknown –
Our Story Data was almost in our hands – we saw it coming and going We needed a solution for storing it in a way we can use it We did not know: How much data we are going to keep and for how long New business use-cases are on the way Because data is large and the requirements are unknown – we decided to go with a Data Lake

Data Lake Collection of files stored in a distributed file system
Information is stored in its native form, with little or no processing Flexible and allows great amount of data to be stored, queried and analyzed

The Data Data Lake's Data Database Data All data, even unused
Structured, Semi-Structured, or Unstructured Transformed when ready to be used Database Data Structured and Tranformed Added per use case

The Users Operational Data Experts
Want to get their reports and slice their data Advanced Go back to the data source Data Experts Deep analysis

Answer new business questions faster
Data Lake Database Add indices Plan your queries Create a schema Store what you get

Query Engine Data Lake Database Query Engine 2 Query Engine 1

Data Structure Example
raw data/events day= file1.csv file2.csv file3.csv day= tables/events day= type=1 file1.parquet file2.parquet type=2 file3.parquet file4.parquet type=3

CSV to Parquet Example Metadata Count: 4
Column metadata – place in file, min, max, compression info Time 1/1/2018 1:00, 1/1/2018 1:01, 1/1/2018 1:05, 1/1/2018 1:11 Type 1, 1, 2, 3 Message Text 1, Text 2, Text 3, text 4 Severity Low, High, Medium, Low Time Type Message Severity 1/1/2018 1:00 1 Text 1 Low 1/1/2018 1:01 Text 2 High 1/1/2018 1:05 2 Text 3 Medium 1/1/2018 1:11 3 Text 4

AWS Management Console
Architecture and Flow Amazon Athena Python boto3 DBeaver SQL Client AWS Management Console ap-northeast-1 eu-west-1 us-east-1 S3 raw data S3 aggregated data Amazon Athena

Flow Example day= file1.csv file2.csv SELECT time, message, severity FROM events WHERE day=' ' GROUP BY type ORDER BY severity type=1 data_file1.parquet data_file2.parquet type=2 AWS Athena is used to hourly / daily process the data – filter, join, partition, sort and aggregate data into parquet files

Multi-region queries Data is saved in multiple regions
Two available options with Athena: Single query engine in one of the regions – Like in the good old days Query engine per region – Better performance, but more work

Demo Query

Will it work for you too? Tip: Do a POC with real data
Summary We got to better analytics by: Using more data SQL and other engines capabilities Queries on multiple regions Improvements: Cost reduction in storage and compute No need to maintain servers Will it work for you too? Tip: Do a POC with real data

Building a Threat-Analytics Multi-Region Data Lake on AWS

Similar presentations

Presentation on theme: "Building a Threat-Analytics Multi-Region Data Lake on AWS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building a Threat-Analytics Multi-Region Data Lake on AWS

Similar presentations

Presentation on theme: "Building a Threat-Analytics Multi-Region Data Lake on AWS"— Presentation transcript:

Similar presentations

About project

Feedback