Big Data Management – Fall 2016 5/27/2018 Introduction Big Data Management – Fall 2016 © Philippe Bonnet, 2014
Outline Big Data Defined Trends Underlying Big Data Relational vs. Big Data Management Lambda Architecture Course Outline Examples Trends Underlying Big Data Application Pull Business Push Technology Push: Modern IT Infrastructure © Philippe Bonnet, 2015
Relational Database Management Narrow scope A database is created to serve a well defined purpose Structured data Conceptual/Logical/Physical schema Relational model dominates since 80s Entity Relationship defines conceptual schema Close world assumption Data as an instance of the schema The data which is not part of an instance does not exist Any query on the database returns a value based on the current instance Data at rest Data is loaded and stored in the database, on disk. Fully interactive architecture 3 tier architecture: Web server; App Server; Database Server.
Not a product, but a collection of processes. Big Data Not a product, but a collection of processes. Big Data Resource Data Collection Data Cleaning Extraction, Transfer, Load Federation DBs Docs Feeds Analog Data analysis Data mining Long-term Archival Data maintenance Data Preparation Data Integration © Philippe Bonnet, 2014
What is a Big Data Resource? A big data resource is a collection of data which is made available for analysis What is data analysis? Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. For example: The databases that underly the learnit blog or FM’s Building Management System are not big data resources The data made available by Eurostat on unemployment in Europe can be considered a big data resource. More examples of this kind at Google Public Data Explorer © Philippe Bonnet, 2014
What is Big Data Analysis? Need for insight based on data not currently available for analysis How can we set goals for energy management at the IT University? We do not know how electricity is consumed. We do not know where electricity is consumed. We do not know how to link electricity consumption and people’s practices. How long does it take you to answer your mails? © Philippe Bonnet, 2014
Who is involved? Data Analyst Data Manager DBs Big Data Resource Docs Feeds Analog Long-term Archival Data Curator Management © Philippe Bonnet, 2014
Big Data Management Wide scope Data Variety Open world assumption Data is made available for yet-to-be-defined analysis Data Variety Time series are highly structured; Text is not Open world assumption Data sources might be added or removed So any analysis is only valid based on the current state of the big data resource Data in movement, and at rest Data streams complements stored data Some data streams are stored, others are not Lambda Architecture (or variant thereof) Batch layer; serving layer; speed layer; Analytics © Philippe Bonnet, 2014
The 3 Vs Volume, Velocity, Variety, Veracity, Validity, Viability, Value, ... At best, the Vs are dimensions to structure: Non-functional requirements Capacity sizing Performance evaluation © Philippe Bonnet, 2014
Lambda Architecture © Philippe Bonnet, 2015 http://www.drdobbs.com/database/applying-the-big-data-lambda-architectur/240162604 © Philippe Bonnet, 2015
Course Outline How is data represented? Batch processes Beyond Relations (lecture 3) Dealing with High-Volume (lecture 4) Data Integration (lecture 10) Primary vs. Derived Data Batch processes Data derivation processes; Map-reduce (lecture 5) Systems underlying Lambda Architecture Modern IT infrastructure (lecture 1) Hadoop ecosystem and Spark (exercises and lecture 6) NoSQL and NewSQL (lecture 7) Data Streaming Platforms (lecture 8) Data Pipeline Management (lecture 9) Analytics 101 OLAP (lecture11) Big Data Mining (lecture 12) © Philippe Bonnet, 2015
Big Data at City Scale End-users App developers Technology providers, Solution providers Cyber-Physical System Building instrumentation, wireless infrastructure, … Data Sets City map, bus routes, meter data, … Data Streams Live position of all buses, el-distribution network status,, … Data Marketplace Batch Transformations Real-Time Analytics Apps Data cleaning, machine learning inference, aggregate views, … Motion inference, pollution alerts, … Mobility app, Energy tracker app, … Storage (raw and derived data), computation, security, … Digital Infrastructure Cloud-based IoT-based Digital Infrastructure Data Providers Data Providers Public Administration © Philippe Bonnet, 2015 Hitachi City Data Market
Big Data at Building Scale Light on/off events not logged Big Data at Building Scale Wireless Router Monitoring IT Dept Building Management System Facility Management Learnit Log RL Teacher Students? © Philippe Bonnet, 2014
Big Data at Building Scale Light on/off events Big Data at Building Scale EF FM Mgt Wireless Router Monitoring Building Management System ITU Big Data Resources available for Analytics within and outside the IT University Learnit Log © Philippe Bonnet, 2014
Is Pokemon Go Big Data? © Philippe Bonnet, 2015
Search and Rescue Example Map by David Strip based on Google maps and OurAirports.com See Jame Fallows’article at the Atlantic. © Philippe Bonnet, 2014
Outline Big Data Defined Trends Underlying Big Data Relational vs. Big Data Management Lambda Architecture Course Outline Examples Trends Underlying Big Data Application Pull Business Push Technology Push: Modern IT Infrastructure © Philippe Bonnet, 2015
Application Pull: Sense making Build conceptual model Build a physical model Answer the questions Questions to answer Build a logical model Collect the data Load the data (Tune) t Time to Insight: Weeks to Months OLD SCHOOL @ Dennis Shasha and Philippe Bonnet, 2013 Source - http://www.youtube.com/watch?v=2YWvmBtEymE
@ Dennis Shasha and Philippe Bonnet, 2013 Available Data Scope of Analysis Model Available Data Model Traditional System Model Traditional System Model Model New System Model @ Dennis Shasha and Philippe Bonnet, 2013 Source – http://www.vldb.org/2011/files/slides/keynotes/campbell_keynote.pptx
@ Dennis Shasha and Philippe Bonnet, 2013 Monitor, Mine, Manage Knowledge Application Knowledge Application Knowledge Knowledge Model Generation Structure / Value Information Information BIG DATA Data Data Transform & Load Information Production Signal Digital Shoebox t Time to Insight Effort / Latency @ Dennis Shasha and Philippe Bonnet, 2013 Source: http://www.vldb.org/pvldb/vol4/p694-campbell.pdf
Business Push: Data Growth @ Dennis Shasha and Philippe Bonnet, 2013 Source: http://permabit.com/media-center/blogs/
@ Dennis Shasha and Philippe Bonnet, 2013 Source: https://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoop-to-management-whats-the-big-data-deal/ @ Dennis Shasha and Philippe Bonnet, 2013
Source: http://blogs. technet © Philippe Bonnet, 2015
Technology Push: Warehouse-Scale Computer LOOK UP: Werner Voegels on virtualization. @ Dennis Shasha and Philippe Bonnet, 2013 Source: http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006
Technology Push: Storage source: Virtual Geek’s take on storage tree of life M.Wei et al. I/O speculation in the microsecond era. Usenix ATC’14. SSD Architecture © Philippe Bonnet, 2014
Storage Architectures source: Virtual Geek’s take on storage tree of life – A MUST READ!! Storage RAM Interconnect © Philippe Bonnet, 2014
Data-Intensive Applications: Server-side Architectures Look up Fabric Computing on Wikipedia. source: Virtual Geek’s take on storage tree of life – A MUST READ!! © Philippe Bonnet, 2014
© Philippe Bonnet, 2016
Take Away Points Big data is not a product but a collection of processes centered around big data resources Collections of data made available for analysis Primary focus on data manager (less on data analyst) Not a data mining class Lambda Architecture is a good way to organise Big Data management Evolution of storage helps structure application architectures and database landscape © Philippe Bonnet, 2014