CPT-S 483-05 Topics in Computer Science Big Data 1 Yinghui Wu EME 49.

Slides:

Advertisements

Similar presentations

Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.

Advertisements

1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

Jennifer Widom NoSQL Systems Overview (as of November 2011 )

MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.

Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.

Overview of Search Engines

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Business Intelligence: The Next Big Thing (Really!) John Bair CTO, Ajilitee Sep 14, 2012 Presented to TDWI St. Louis Chapter.

Chapter 1 Overview of Databases and Transaction Processing.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.

Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

NoSQL Databases NoSQL Concepts SoftUni Team Technical Trainers Software University

XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.

NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

Information System Development Courses Figure: ISD Course Structure.

+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:

When bet365 met Riak and discovered a true, “always on” database.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.

SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.

Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Database System Concept.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

Nov 2006 Google released the paper on BigTable.

NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.

Big Data Yuan Xue CS 292 Special topics on.

Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.

Chapter 1 Overview of Databases and Transaction Processing.

MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

BIG DATA/ Hadoop Interview Questions.

B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.

IS6146 Databases for Management Information Systems Lecture 12: Exam Revision Rob Gleasure robgleasure.com.

Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,

The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS

CS 405G: Introduction to Database Systems

Big Data is a Big Deal!.

Big Data Enterprise Patterns

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Data Driven Resource Allocation for Distributed Learning

CPT-S 415 Topics in Computer Science Big Data

An Open Source Project Commonly Used for Processing Big Data Sets

Chapter 14 Big Data Analytics and NoSQL

CLOUDERA TRAINING For Apache HBase

NOSQL databases and Big Data Storage Systems

Relational Algebra Chapter 4, Part A

Ministry of Higher Education

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Database Systems Summary and Overview

Query Processing CSD305 Advanced Databases.

Charles Tappert Seidenberg School of CSIS, Pace University

Query Optimization.

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49

2 Course summary Course project and presentation CPT_S Big Data Big data: conclusion

3 Big Data model Big V’s Relational model XMLs and RDFs Big Data systems Relational DBMS noSQL DBMS Parallel system architectures Search Big Data query models Sequential search strategies Parallel and distributed query processing Parallel and distributed systems Special topics In Big Data Cloud computing Selected topics in Data mining Visualization Data security and privacy Big picture what’s Big Data how to present and store Big Data how to search Big Data critical topics related with Big Data

4 Big data models and systems

Big data: the 4 big V’s 5 Big data models: the 4 big V’s, data types, applications., research trend /topics What is big data? a large, complex data set; a challenge; a trend; an approach of data analytics What is the volume of big data? Variety? Velocity? Veracity? Why do we care about big data? Is there any fundamental challenge introduced by querying big data? Why study Big Data?

Relational data models and DBMS 6 Relational DBMS Relational Model Concepts Relational Model Constraints and Schemas Update Operations and Dealing with Constraint Violations The relational model has rigorously defined query languages that are simple and powerful. Relational algebra is more operational; useful as internal representation for query evaluation plans. Several ways of expressing a given query; a query optimizer should choose the most efficient version.

DBMS architectures & design 7 DBMS: architecture Centralized Client-server Parallel Distributed RDBMS design the normal forms 1NF, 2NF, 3NF, BCNF normal forms transformation

Beyond Relational Data 8 Introduction to XML XML basics DTD XML Schema XML Constraints Introduction to RDF RDF data model and syntax RDF schemas RDF inferencing XML is a prime data exchange format. DTD provides useful syntactic constraints on documents. XML Schema extends DTD by supporting a rich type system Integrity constraints are important for XML, yet are nontrivial RDF provides a foundation for representing and processing metadata RDF has a graph-based data model RDF has an XML-based syntax to support syntactic interoperability RDF has a decentralized philosophy and allows incremental building of knowledge, and its sharing and reuse

Beyond RDBMS: noSQL 9 noSQL: concept and theory CAP theory ACID vs EASE noSQL vs RDBMS noSQL databases Key-value stores Document DBs Column family Graph databases Cheap, easy to implement (open source) Data are replicated to multiple nodes (fault-tolerant) Easy to distribute Don't require a schema Can scale up and down Relax the data consistency requirement (CAP) Joins, ACID transactions SQL as a sometimes frustrating but still powerful query language easy integration with other applications that support SQL

10 Big data search

Query Processing 11 Querying framework overview Measures of Query Cost Basic of Database operations Basics of Graph Queries Basics of Graph Algorithms Graph search (traversal) PageRank Nearest neighbors Keyword search Graph pattern matching

Big data search: Make big data small 12 Exact Query Exact Answer “Big Data” Long Response Times! “Small Data” Approximate Query KB/MB Approximate Answer FAST!! Approximate query evaluation query driven: approximate query models data driven: synopses, histogram, sampling, sketches, spanners… View-based query evaluation Make big data small: indexing, sketch, sampling, spanners Cope with data streams: incremental query evaluation

Parallel data management 13 parallel DBMS Architectures 4 Parallelism: Intraquery, Interquery Intraoperation,Interoperation MapReduce mapper reducer Q( ) D D D1D1 D1D1 DnDn DnDn D2D2 D2D2 …

Hadoop 14 Hadoop: history, features and design Hadoop ecosystem HDFS Hive & Pig Hbase Zookeeper

Querying Big Graphs: Beyond MapReduce Parallel Graph programming models –MapReduce for BFS for distance queries, PageRank.. –Vertex Centric Programming: GraphLab and Pregel –Graph Centric Programming: Giraph ++ –GRAPE: Hybrid models 15 Virtual Processors

Cloud Computing 16 Cloud computing concept Service models and architecture Features and characteristics Pros and cons

17 Big data: Special topics

Data Mining and Graph Mining Basic 18 Data mining: from data to knowledge Graph Mining: pattern mining Classification: decision tree Clustering: k-means, DBSCAN age ? student?credit rating? <=30 >40 noyes no fairexcellent yesno

Veracity of Big Data 19 Discover rulesReasoningDetect errorsRepair Data quality management: An overview Central aspects of data quality Data consistency Entity resolution Information completeness Data currency Data accuracy Deducing the true values of objects in data fusion

Data Visualization, security and privacy 20 Information Visualization Graph visualization Graph drawing and graph visualization Graph layout Information Security: basic concepts Privacy: basic concepts and comparison with security

The future of Big Data 21

Machine BI Business Intelligence 22

Smart Living 23

Big data & Healthcare 24

Future of Big Data techs (NSF National Priorities) 25

Project presentation Presentation (15 minutes minutes Q&A) –Background and motivation why the problem is important application of the solutions Challenges and difference with related work –Problem formulation: Input and output Object function, if any –Algorithm description Correctness analysis Complexity analysis Properties/features/optimization techniques –Experimental study Data sets/generation of dataset Algorithm implemented/baseline algorithms/platforms/test settings Figures/trend/explain Summary of experimental result –Conclusion and Future work How your current work can be improved 26

General tips 27 Every talk motivates a problem Talk is about idea Simple Slides are better A picture is worth a thousand words Keep logic flow Prepare for Questions Practice makes perfect

Course project report You have milestones 1-5. Combine your report and enrich with experimental results, references and future work. Do not simply put them together. Make title concise and right to the point Abstract: describe your problem, solution and experimental result. Section 1: Introduction: –Background –challenges (why your problem is hard) –Related work Section 2: Problem statement –Input/output –Object function –Hardness of the problem Section 3: Solution –Algorithm description –correctness and time complexity analysis (try big O notation) –Optimization techs (e.g., applying Big data search strategy) 28

Course project report Section 4: Experimental study Data sets/generation of dataset Algorithm implemented/baseline algorithms/platforms/test settings Figures/trend/explain Summary of experimental result Section 5: Conclusion & Future work –What have you observed in your project –What problem remains to be unresolved? What are possible extension of your problem? What’s your plan to solve it in future? 29

CPT_S Big Data 30 I hope you enjoyed this course And found it useful! Thank you