Download presentation
Presentation is loading. Please wait.
Published byPaule Blanchard Modified over 5 years ago
1
Building a Threat-Analytics Multi-Region Data Lake on AWS
Ori Nakar 2018
2
About me Researcher at Imperva Web application and database security
Software development methodology and architecture Cloud computing, AWS, Docker and Big Data
3
Agenda What is a Data Lake? Data Lake structure and flow example
Threat-Analytics Data Lake architecture Multi-region queries Demo
4
Because data is large and the requirements are unknown –
Our Story Data was almost in our hands – we saw it coming and going We needed a solution for storing it in a way we can use it We did not know: How much data we are going to keep and for how long New business use-cases are on the way Because data is large and the requirements are unknown – we decided to go with a Data Lake
5
Data Lake Collection of files stored in a distributed file system
Information is stored in its native form, with little or no processing Flexible and allows great amount of data to be stored, queried and analyzed
6
The Data Data Lake's Data Database Data All data, even unused
Structured, Semi-Structured, or Unstructured Transformed when ready to be used Database Data Structured and Tranformed Added per use case
7
The Users Operational Data Experts
Want to get their reports and slice their data Advanced Go back to the data source Data Experts Deep analysis
8
Answer new business questions faster
Data Lake Database Add indices Plan your queries Create a schema Store what you get
9
Query Engine Data Lake Database Query Engine 2 Query Engine 1
10
Data Structure Example
raw data/events day= file1.csv file2.csv file3.csv day= tables/events day= type=1 file1.parquet file2.parquet type=2 file3.parquet file4.parquet type=3
11
CSV to Parquet Example Metadata Count: 4
Column metadata – place in file, min, max, compression info Time 1/1/2018 1:00, 1/1/2018 1:01, 1/1/2018 1:05, 1/1/2018 1:11 Type 1, 1, 2, 3 Message Text 1, Text 2, Text 3, text 4 Severity Low, High, Medium, Low Time Type Message Severity 1/1/2018 1:00 1 Text 1 Low 1/1/2018 1:01 Text 2 High 1/1/2018 1:05 2 Text 3 Medium 1/1/2018 1:11 3 Text 4
12
AWS Management Console
Architecture and Flow Amazon Athena Python boto3 DBeaver SQL Client AWS Management Console ap-northeast-1 eu-west-1 us-east-1 S3 raw data S3 aggregated data Amazon Athena
13
Flow Example day= file1.csv file2.csv SELECT time, message, severity FROM events WHERE day=' ' GROUP BY type ORDER BY severity type=1 data_file1.parquet data_file2.parquet type=2 AWS Athena is used to hourly / daily process the data – filter, join, partition, sort and aggregate data into parquet files
14
Multi-region queries Data is saved in multiple regions
Two available options with Athena: Single query engine in one of the regions – Like in the good old days Query engine per region – Better performance, but more work
15
Demo Query
16
Will it work for you too? Tip: Do a POC with real data
Summary We got to better analytics by: Using more data SQL and other engines capabilities Queries on multiple regions Improvements: Cost reduction in storage and compute No need to maintain servers Will it work for you too? Tip: Do a POC with real data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.