Denis Reznik Data Architect, Intapp, Inc. Microsoft Data Platform MVP Data Driven Future Denis Reznik Data Architect, Intapp, Inc. Microsoft Data Platform MVP
About Me Denis Reznik Kyiv, Ukraine Data Architect at Intapp, Inc. Microsoft Data Platform MVP PASS Regional Mentor, CEE Ukrainian Data Community Kyiv Co-Founder Co-author of “SQL Server MVP Deep Dives vol. 2” Organizer of SQLSaturday Kyiv Conference
Agenda Data is a new Oil (c) Data and Science Data in Big Companies Data and Application Development Data-Driven Future
Data is a New Oil “Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” (c) Clive Humby, UK Mathemetician
Data and Science Thousands of years Few hundreds of years Empirical Few hundreds of years Theoretical Last fifty years Computational “Query the world” Last twenty years eScience (Data Science) “Download the world”
Data Science is a new term Data Science is a new term. But in the same sense as Columbus was discovered NEW continent 1000 years ago (c) Hector Garcia-Molina. Professor in the Departments of Computer Science and Electrical Engineering at Stanford University
Unsupervised Learning Machine Learning Supervised Learning Unsupervised Learning Classification Regression
Distance from the Continent Linear Regression Training Data Learning Algorithm Ocean Temperature Oil Derricks in Area Distance from the Continent Whales Population h h - Hypothesis
DEMO Linear Regression
Data in Big Companies
source: http://www. visualcapitalist
source: http://www. visualcapitalist
source: http://www. visualcapitalist
source: http://www. visualcapitalist
source: http://www. visualcapitalist
Parallel Processing Q: How many times temperature was above the norm during the last week? Temperature Sensor Datasets (n Items) A: 5 Time: 2 hours Algorithmic Complexity: O(n)
Parallel Processing Q: How many times temperature was above the norm during the last week? Temperature Sensor Datasets (k Items in each one) A: 1 A: 0 A: 3 A: 4 Time: 0.5 hour Algorithmic Complexity: O(n/k)
Map-Reduce Map -> COUNT(*) WHERE Value > 40 A: 1 A: 0 A: 3 A: 4 Reduce -> COUNT(*) Reduce A: 5
DEMO Map-Reduce
RDMS Commercial Success Database History Amazon Dynamo Paper RDBMS Ingress System R Object Databases CODASYL IMS Google BigTable Paper SQL NewSQL (?) 1960s 1970s 1980s 1990s 2000s Nowadays E.F. Codd’s Paper RDMS Commercial Success NoSQL (Johan Oskarsson)
NoSQL SQL
Databases Key-Value Relational Column-Family Graph Document
… … Index (B-Tree) - Seek SELECT * FROM Users WHERE Id = 523 1 .. 1M 1M-2K .. 1M 1 .. 2K 2K+1 .. 4K … 1 .. 300 301..800 801..1,5K 1,5K+1..2K …
… … Index (B-Tree) - Scan SELECT * FROM Users 1 .. 1M 1M-2K .. 1M 2K+1 .. 4K … 1 .. 300 301..800 801..1,5K 1,5K+1..2K …
Hashtable Hash Function John Snow Jim Beam John Snow Jim Beam 2 3 1 4 Jim Beam Jim Beam Peter Parker John Snow Peter Parker Hash Function 2
Q&A Web Site (StackOverflow)
Domain Model Questions Answers Users Comments Votes
StackOverflow Architecture source: https://www.youtube.com/watch?v=t6kM2EM6so4
DEMO Relational vs. NoSQL
Data-Driven Future Data amount is growing and this is cool More and more decisions are based on data More and more applications are developed It is exciting to be a Software Engineer now!
Thank you! Denis Reznik Twitter: @denisreznik Email: denisreznik@live.ru Blog: http://reznik.uneta.com.ua (rus) Facebook: https://www.facebook.com/denis.reznik.5 LinkedIn: http://ua.linkedin.com/pub/denis-reznik/3/502/234