Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building a Threat-Analytics Multi-Region Data Lake on AWS

Similar presentations


Presentation on theme: "Building a Threat-Analytics Multi-Region Data Lake on AWS"— Presentation transcript:

1 Building a Threat-Analytics Multi-Region Data Lake on AWS
Ori Nakar 2018

2 About me Researcher at Imperva Web application and database security
Software development methodology and architecture Cloud computing, AWS, Docker and Big Data

3 Agenda What is a Data Lake? Data Lake structure and flow example
Threat-Analytics Data Lake architecture Multi-region queries Demo

4 Because data is large and the requirements are unknown –
Our Story Data was almost in our hands – we saw it coming and going We needed a solution for storing it in a way we can use it We did not know: How much data we are going to keep and for how long New business use-cases are on the way Because data is large and the requirements are unknown – we decided to go with a Data Lake

5 Data Lake Collection of files stored in a distributed file system
Information is stored in its native form, with little or no processing Flexible and allows great amount of data to be stored, queried and analyzed

6 The Data Data Lake's Data Database Data All data, even unused
Structured, Semi-Structured, or Unstructured Transformed when ready to be used Database Data Structured and Tranformed Added per use case

7 The Users Operational Data Experts
Want to get their reports and slice their data Advanced Go back to the data source  Data Experts Deep analysis

8 Answer new business questions faster
Data Lake Database Add indices Plan your queries Create a schema Store what you get

9 Query Engine Data Lake Database Query Engine 2 Query Engine 1

10 Data Structure Example
raw data/events day= file1.csv file2.csv file3.csv day= tables/events day= type=1 file1.parquet file2.parquet type=2 file3.parquet file4.parquet type=3

11 CSV to Parquet Example Metadata Count: 4
Column metadata – place in file, min, max, compression info Time 1/1/2018 1:00, 1/1/2018 1:01, 1/1/2018 1:05, 1/1/2018 1:11 Type 1, 1, 2, 3 Message Text 1, Text 2, Text 3, text 4 Severity Low, High, Medium, Low Time Type Message Severity 1/1/2018 1:00 1 Text 1 Low 1/1/2018 1:01 Text 2 High 1/1/2018 1:05 2 Text 3 Medium 1/1/2018 1:11 3 Text 4

12 AWS Management Console
Architecture and Flow Amazon Athena Python boto3 DBeaver SQL Client AWS Management Console ap-northeast-1 eu-west-1 us-east-1 S3 raw data S3 aggregated data Amazon Athena

13 Flow Example day= file1.csv file2.csv SELECT time, message, severity FROM events WHERE day=' ' GROUP BY type ORDER BY severity type=1 data_file1.parquet data_file2.parquet type=2 AWS Athena is used to hourly / daily process the data – filter, join, partition, sort and aggregate data into parquet files

14 Multi-region queries Data is saved in multiple regions
Two available options with Athena: Single query engine in one of the regions – Like in the good old days Query engine per region – Better performance, but more work

15 Demo Query

16 Will it work for you too? Tip: Do a POC with real data
Summary We got to better analytics by: Using more data SQL and other engines capabilities Queries on multiple regions Improvements: Cost reduction in storage and compute No need to maintain servers Will it work for you too? Tip: Do a POC with real data

17

18


Download ppt "Building a Threat-Analytics Multi-Region Data Lake on AWS"

Similar presentations


Ads by Google