Data Management Part 1.1 DBMS.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Copyright © 2004 Pearson Education, Inc.. Chapter 1 Introduction and Conceptual Modeling.
Copyright © 2004 Pearson Education, Inc.. Chapter 1 Database Concepts.
Copyright © 2004 Pearson Education, Inc. Instructor Dr. Amr Mahmoud Tolba Office No : 106 Floor 7 Website:
Slide 1- 1 Database Systems Hanem A. Eladly Computer Engineering Department Faculty of Engineering Cairo University
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Introduction and Conceptual Modeling
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe DATABASES AND INFORMATION SYSTEMS Ivan LANESE Lecture 1 Master Degree in BioInformatics University.
Chapter 1 Database and Database Users Dr. Bernard Chen Ph.D. University of Central Arkansas.
Chapter 1 Database and Database Users Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1 Outline Types of Databases and Database Applications Basic Definitions Typical DBMS Functionality.
Databases and Database Users
Copyright © 2004 Pearson Education, Inc. Chapter 1 Introduction.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1 Chapter 1 - Introduction: Databases and Database Users - Outline Types of Databases and.
Database and Database Users. Outline Database Introduction An Example Characteristics of the Database Actors on the Scene Advantages of using the DBMS.
1 CSBP430 – Database Systems Chapter 1: Databases and Database Users Mamoun Awad College of Information Technology United Arab Emirates University
Introduction: Databases and Database Users
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 1 Introduction: Databases and Database Users.
Slide Chapter 1 Introduction: Databases and Database Users.
Grades, Book & Blackboard IS2511| Database. Grading Grades will be divided as follows:  10%Homework and Tutorials  10% Quizzes  20% First Midterm Exam.
1Mr.Mohammed Abu Roqyah. Introduction and Conceptual Modeling 2Mr.Mohammed Abu Roqyah.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1 Databases and Database Users.
Copyright © 2004 Pearson Education, Inc. Chapter 1 Introduction and Conceptual Modeling.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Chapter 1 Databases & Database Users. Slide 1-2 Acknowledge The main reference of this presentation is the textbook and PPT from : Elmasri & Navathe,
Chapter(1) Introduction and conceptual modeling. Basic definitions Data : know facts that can be recorded and have an implicit. Database: a collection.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Copyright © 2004 Pearson Education, Inc. METU Department of Computer Eng Ceng 302 Introduction to DBMS Introduction and Conceptual Modeling by Pinar Senkul.
1-1 Chapter 1 Databases and Database Users 1.1 Introduction 1.2 An Example 1.3 Characteristics of the Database Approach 1.4 Actors on the Scene 1.5 Workers.
DatabaseCSIE NUK1 Fundamentals of Database Systems Chapter 1 Database and Database Users.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 1 Introduction: Databases and Database Users.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1 Databases and Database Users.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1 Databases and Database Users.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 1- 1 Copyright © 2011 Pearson Education, Inc. Publishing as Pearson.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
ISC321 Database Systems I Chapter 1: Introduction to Databases Fall 2015 Dr. Abdullah Almutairi.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 1 Introduction: Databases and Database Users.
Copyright © 2004 Pearson Education, Inc. Chapter 1 Introduction and Conceptual Modeling.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Slide Chapter 1 Introduction: Databases and Database Users.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
10/3/2017.
Chapter 1 Database and Database Users
Introduction: Databases and Database Users
Databases and Database Users
CS4222 Principles of Database System
Introduction: Databases and Database Users
Introduction to Database Systems
Outline Types of Databases and Database Applications Basic Definitions
Database and Database Users
Introduction: Databases and Database Users
7/4/2018.
9/22/2018.
Databases and Database Users
Introduction to Database
11/14/2018.
1/2/2019.
Databases and Database Users
Chapter 1 Outline Types of Databases and Database Applications
Databases and Database Users
Terms: Data: Database: Database Management System: INTRODUCTION
Introduction: Databases and Database Users
Databases and Database Users
Databases and Database Users
Presentation transcript:

Data Management Part 1.1 DBMS

Outline Types of Databases and Database Applications Basic Definitions Typical DBMS Functionality Example of a Database (UNIVERSITY) Main Characteristics of the Database Approach Database Users Advantages of Using the Database Approach When Not to Use Databases

Types of Databases and Database Applications Traditional Applications: Numeric and Textual Databases More Recent Applications: Multimedia Databases Geographic Information Systems (GIS) Data Warehouses Real-time and Active Databases Many other applications First part of book focuses on traditional applications A number of recent applications are described later in the book (for example, Chapters 24,26,28,29,30)

Basic Definitions Database: A collection of related data. Data: Known facts that can be recorded and have an implicit meaning. Mini-world: Some part of the real world about which data is stored in a database. For example, student grades and transcripts at a university. Database Management System (DBMS): A software package/ system to facilitate the creation and maintenance of a computerized database. Database System: The DBMS software together with the data itself. Sometimes, the applications are also included.

Simplified database system environment

Typical DBMS Functionality Define a particular database in terms of its data types, structures, and constraints Construct or Load the initial database contents on a secondary storage medium Manipulating the database: Retrieval: Querying, generating reports Modification: Insertions, deletions and updates to its content Accessing the database through Web applications Processing and Sharing by a set of concurrent users and application programs – yet, keeping all data valid and consistent

Typical DBMS Functionality Other features: Protection or Security measures to prevent unauthorized access “Active” processing to take internal actions on data Presentation and Visualization of data Maintaining the database and associated programs over the lifetime of the database application Called database, software, and system maintenance

Example of a Database (with a Conceptual Data Model) Mini-world for the example: Part of a UNIVERSITY environment. Some mini-world entities: STUDENTs COURSEs SECTIONs (of COURSEs) (academic) DEPARTMENTs INSTRUCTORs

Example of a Database (with a Conceptual Data Model) Some mini-world relationships: SECTIONs are of specific COURSEs STUDENTs take SECTIONs COURSEs have prerequisite COURSEs INSTRUCTORs teach SECTIONs COURSEs are offered by DEPARTMENTs STUDENTs major in DEPARTMENTs Note: The above entities and relationships are typically expressed in a conceptual data model, such as the ENTITY-RELATIONSHIP data model (see Chapters 3, 4)

Example of a simple database

Main Characteristics of the Database Approach Self-describing nature of a database system: A DBMS catalog stores the description of a particular database (e.g. data structures, types, and constraints) The description is called meta-data. This allows the DBMS software to work with different database applications. Insulation between programs and data: Called program-data independence. Allows changing data structures and storage organization without having to change the DBMS access programs.

Example of a simplified database catalog

Main Characteristics of the Database Approach (continued) Data Abstraction: A data model is used to hide storage details and present the users with a conceptual view of the database. Programs refer to the data model constructs rather than data storage details Support of multiple views of the data: Each user may see a different view of the database, which describes only the data of interest to that user.

Main Characteristics of the Database Approach (continued) Sharing of data and multi-user transaction processing: Allowing a set of concurrent users to retrieve from and to update the database. Concurrency control within the DBMS guarantees that each transaction is correctly executed or aborted Recovery subsystem ensures each completed transaction has its effect permanently recorded in the database OLTP (Online Transaction Processing) is a major part of database applications. This allows hundreds of concurrent transactions to execute per second.

Database Users Users may be divided into Those who actually use and control the database content, and those who design, develop and maintain database applications (called “Actors on the Scene”), and Those who design and develop the DBMS software and related tools, and the computer systems operators (called “Workers Behind the Scene”).

Database Users Actors on the scene Database administrators: Responsible for authorizing access to the database, for coordinating and monitoring its use, acquiring software and hardware resources, controlling its use and monitoring efficiency of operations. Database Designers: Responsible to define the content, the structure, the constraints, and functions or transactions against the database. They must communicate with the end-users and understand their needs.

Categories of End-users Actors on the scene (continued) End-users: They use the data for queries, reports and some of them update the database content. End-users can be categorized into: Casual: access database occasionally when needed Naïve or Parametric: they make up a large section of the end-user population. They use previously well-defined functions in the form of “canned transactions” against the database. Examples are bank-tellers or reservation clerks who do this activity for an entire shift of operations.

Categories of End-users (continued) Sophisticated: These include business analysts, scientists, engineers, others thoroughly familiar with the system capabilities. Many use tools in the form of software packages that work closely with the stored database. Stand-alone: Mostly maintain personal databases using ready-to-use packaged applications. An example is a tax program user that creates its own internal database. Another example is a user that maintains an address book

Advantages of Using the Database Approach Controlling redundancy in data storage and in development and maintenance efforts. Sharing of data among multiple users. Restricting unauthorized access to data. Providing persistent storage for program Objects In Object-oriented DBMSs – see Chapters 20-22 Providing Storage Structures (e.g. indexes) for efficient Query Processing

Advantages of Using the Database Approach (continued) Providing backup and recovery services. Providing multiple interfaces to different classes of users. Representing complex relationships among data. Enforcing integrity constraints on the database. Drawing inferences and actions from the stored data using deductive and active rules

Additional Implications of Using the Database Approach Potential for enforcing standards: This is very crucial for the success of database applications in large organizations. Standards refer to data item names, display formats, screens, report structures, meta-data (description of data), Web page layouts, etc. Reduced application development time: Incremental time to add each new application is reduced.

Additional Implications of Using the Database Approach (continued) Flexibility to change data structures: Database structure may evolve as new requirements are defined. Availability of current information: Extremely important for on-line transaction systems such as airline, hotel, car reservations. Economies of scale: Wasteful overlap of resources and personnel can be avoided by consolidating data and applications across departments.

Historical Development of Database Technology Early Database Applications: The Hierarchical and Network Models were introduced in mid 1960s and dominated during the seventies. A bulk of the worldwide database processing still occurs using these models, particularly, the hierarchical model. Relational Model based Systems: Relational model was originally introduced in 1970, was heavily researched and experimented within IBM Research and several universities. Relational DBMS Products emerged in the early 1980s.

Historical Development of Database Technology (continued) Object-oriented and emerging applications: Object-Oriented Database Management Systems (OODBMSs) were introduced in late 1980s and early 1990s to cater to the need of complex data processing in CAD and other applications. Their use has not taken off much. Many relational DBMSs have incorporated object database concepts, leading to a new category called object-relational DBMSs (ORDBMSs) Extended relational systems add further capabilities (e.g. for multimedia data, XML, and other data types)

Historical Development of Database Technology (continued) Data on the Web and E-commerce Applications: Web contains data in HTML (Hypertext markup language) with links among pages. This has given rise to a new set of applications and E-commerce is using new standards like XML (eXtended Markup Language). (see Ch. 27). Script programming languages such as PHP and JavaScript allow generation of dynamic Web pages that are partially generated from a database (see Ch. 26). Also allow database updates through Web pages

Extending Database Capabilities New functionality is being added to DBMSs in the following areas: Scientific Applications XML (eXtensible Markup Language) Image Storage and Management Audio and Video Data Management Data Warehousing and Data Mining Spatial Data Management Time Series and Historical Data Management The above gives rise to new research and development in incorporating new data types, complex data structures, new operations and storage and indexing schemes in database systems.

When not to use a DBMS Main inhibitors (costs) of using a DBMS: High initial investment and possible need for additional hardware. Overhead for providing generality, security, concurrency control, recovery, and integrity functions. When a DBMS may be unnecessary: If the database and applications are simple, well defined, and not expected to change. If there are stringent real-time requirements that may not be met because of DBMS overhead. If access to data by multiple users is not required.

When not to use a DBMS When no DBMS may suffice: If the database system is not able to handle the complexity of data because of modeling limitations If the database users need special operations not supported by the DBMS.

Summary Types of Databases and Database Applications Basic Definitions Typical DBMS Functionality Example of a Database (UNIVERSITY) Main Characteristics of the Database Approach Database Users Advantages of Using the Database Approach When Not to Use Databases

Part 1.2 Data Mining/Warehousing Data Management Part 1.2 Data Mining/Warehousing

Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms

Definitions of Data Mining The discovery of new information in terms of patterns or rules from vast amounts of data. The process of finding interesting structure in data. The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.

Data Warehousing The data warehouse is a historical database designed for decision support. Data mining can be applied to the data in a warehouse to help with certain types of decisions. Proper construction of a data warehouse is fundamental to the successful use of data mining.

Knowledge Discovery in Databases (KDD) Data mining is actually one step of a larger process known as knowledge discovery in databases (KDD). The KDD process model comprises six phases Data selection Data cleansing Enrichment Data transformation or encoding Data mining Reporting and displaying discovered knowledge

Goals of Data Mining and Knowledge Discovery (PICO) Prediction: Determine how certain attributes will behave in the future. Identification: Identify the existence of an item, event, or activity. Classification: Partition data into classes or categories. Optimization: Optimize the use of limited resources.

Types of Discovered Knowledge Association Rules Classification Hierarchies Sequential Patterns Patterns Within Time Series Clustering

Association Rules Association rules are frequently used to generate rules from market-basket data. A market basket corresponds to the sets of items a consumer purchases during one visit to a supermarket. The set of items purchased by customers is known as an itemset. An association rule is of the form X=>Y, where X ={x1, x2, …., xn }, and Y = {y1,y2, …., yn} are sets of items, with xi and yi being distinct items for all i and all j. For an association rule to be of interest, it must satisfy a minimum support and confidence.

Association Rules Confidence and Support The minimum percentage of instances in the database that contain all items listed in a given association rule. Support is the percentage of transactions that contain all of the items in the itemset, LHS U RHS. Confidence: Given a rule of the form A=>B, rule confidence is the conditional probability that B is true when A is known to be true. Confidence can be computed as support(LHS U RHS) / support(LHS)

Generating Association Rules The general algorithm for generating association rules is a two-step process. Generate all itemsets that have a support exceeding the given threshold. Itemsets with this property are called large or frequent itemsets. Generate rules for each itemset as follows: For itemset X and Y a subset of X, let Z = X – Y; If support(X)/Support(Z) > minimum confidence, the rule Z=>Y is a valid rule.

Reducing Association Rule Complexity Two properties are used to reduce the search space for association rule generation. Downward Closure A subset of a large itemset must also be large Anti-monotonicity A superset of a small itemset is also small. This implies that the itemset does not have sufficient support to be considered for rule generation.

Generating Association Rules: The Apriori Algorithm The Apriori algorithm was the first algorithm used to generate association rules. The Apriori algorithm uses the general algorithm for creating association rules together with downward closure and anti- monotonicity.

Generating Association Rules: The Sampling Algorithm The sampling algorithm selects samples from the database of transactions that individually fit into memory. Frequent itemsets are then formed for each sample. If the frequent itemsets form a superset of the frequent itemsets for the entire database, then the real frequent itemsets can be obtained by scanning the remainder of the database. In some rare cases, a second scan of the database is required to find all frequent itemsets.

Generating Association Rules: Frequent-Pattern Tree Algorithm The Frequent-Pattern Tree Algorithm reduces the total number of candidate itemsets by producing a compressed version of the database in terms of an FP-tree. The FP-tree stores relevant information and allows for the efficient discovery of frequent itemsets. The algorithm consists of two steps: Step 1 builds the FP-tree. Step 2 uses the tree to find frequent itemsets.

Step 1: Building the FP-Tree First, frequent 1-itemsets along with the count of transactions containing each item are computed. The 1-itemsets are sorted in non-increasing order. The root of the FP-tree is created with a “null” label. For each transaction T in the database, place the frequent 1-itemsets in T in sorted order. Designate T as consisting of a head and the remaining items, the tail. Insert itemset information recursively into the FP-tree as follows: if the current node, N, of the FP-tree has a child with an item name = head, increment the count associated with N by 1 else create a new node, N, with a count of 1, link N to its parent and link N with the item header table. if tail is nonempty, repeat the above step using only the tail, i.e., the old head is removed and the new head is the first item from the tail and the remaining items become the new tail.

Step 2: The FP-growth Algorithm For Finding Frequent Itemsets Input: Fp-tree and minimum support, mins Output: frequent patterns (itemsets) procedure FP-growth (tree, alpha); Begin if tree contains a single path P then for each combination, beta of the nodes in the path generate pattern (beta U alpha) with support = minimum support of nodes in beta else for each item, i, in the header of the tree do begin generate pattern beta = (i U alpha) with support = i.support; construct beta’s conditional pattern base; construct beta’s conditional FP-tree, beta_tree; if beta_tree is not empty then FP-growth(beta_tree, beta); end; End;

Generating Association Rules: The Partition Algorithm Divide the database into non-overlapping subsets. Treat each subset as a separate database where each subset fits entirely into main memory. Apply the Apriori algorithm to each partition. Take the union of all frequent itemsets from each partition. These itemsets form the global candidate frequent itemsets for the entire database. Verify the global set of itemsets by having their actual support measured for the entire database.

Complications seen with Association Rules The cardinality of itemsets in most situations is extremely large. Association rule mining is more difficult when transactions show variability in factors such as geographic location and seasons. Item classifications exist along multiple dimensions. Data quality is variable; data may be missing, erroneous, conflicting, as well as redundant.

Classification Classification is the process of learning a model that is able to describe different classes of data. Learning is supervised as the classes to be learned are predetermined. Learning is accomplished by using a training set of pre- classified data. The model produced is usually in the form of a decision tree or a set of rules.

An Example Rule Here is one of the rules extracted from the decision tree of Figure 28.7. IF 50K > salary >= 20K AND age >=25 THEN class is “yes”

Clustering Unsupervised learning or clustering builds models from data without predefined classes. The goal is to place records into groups where the records in a group are highly similar to each other and dissimilar to records in other groups. The k-Means algorithm is a simple yet effective clustering technique.

Additional Data Mining Methods Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms

Sequential Pattern Analysis Transactions ordered by time of purchase form a sequence of itemsets. The problem is to find all subsequences from a given set of sequences that have a minimum support. The sequence S1, S2, S3, .. is a predictor of the fact that a customer purchasing itemset S1 is likely to buy S2 , and then S3, and so on.

Time Series Analysis Time series are sequences of events. For example, the closing price of a stock is an event that occurs each day of the week. Time series analysis can be used to identify the price trends of a stock or mutual fund. Time series analysis is an extended functionality of temporal data management.

Regression Analysis A regression equation estimates a dependent variable using a set of independent variables and a set of constants. The independent variables as well as the dependent variable are numeric. A regression equation can be written in the form Y=f(x1,x2,…,xn) where Y is the dependent variable. If f is linear in the domain variables xi, the equation is call a linear regression equation.

Neural Networks A neural network is a set of interconnected nodes designed to imitate the functioning of the brain. Node connections have weights which are modified during the learning process. Neural networks can be used for supervised learning and unsupervised clustering. The output of a neural network is quantitative and not easily understood.

Genetic Learning Genetic learning is based on the theory of evolution. An initial population of several candidate solutions is provided to the learning model. A fitness function defines which solutions survive from one generation to the next. Crossover, mutation and selection are used to create new population elements.

Data Mining Applications Marketing Marketing strategies and consumer behavior Finance Fraud detection, creditworthiness and investment analysis Manufacturing Resource optimization Health Image analysis, side effects of drug, and treatment effectiveness

Recap Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms

Reference: Fundamentals of Database Systems, 5/E Ramez Elmasri, University of Texas at Arlington Shamkant B. Navathe, Georgia Institute of Technology Publisher:  Addison-Wesley Copyright:  2007