Data, Information, and Databases BDIS 6.1

Data, Information, and Databases BDIS 6.1
BSAD 141 Dave Novak

Topics Covered Information types: transactional –vs- analytical
Five characteristics of information quality Database versus a DBMS RDBMS: advantages and terminology Multi-user issues

The Need for High-Quality Information
Data are everywhere Which data are important? Which data should the organization store? Which data need to be further manipulated? Which data are required to make different types of decisions? How does the organization convert raw data into the information that is needed?

Recall difference between data and information

The need to obtain and analyze the many different levels, formats, and granularities of organizational information to make decisions Granularity refers to the extent of detail within the information (fine and detailed or “coarse” and abstract information)

Decisions are only as good as the quality of the data and information that are used to make the decisions Garbage in  Garbage out Using “poor” quality data doesn’t help

Example of Poor Quality Data
Data Quality Problems Example of Poor Quality Data Issue 1: Without a first name it would be impossible to correlate this customer with customers in other databases (Sales, Marketing, Billing, Customer Service) to gain a compete customer view (CRM) Issue 2: Without a complete street address there is no possible way to communicate with this customer via mail or deliveries. An order might be sitting in a warehouse waiting for the complete address before shipping. The company has spent time and money processing an order that might never be completed Issue 3: If this is the same customer, the company will waste money sending out two sets of promotions and advertisements to the same customers. It might also send two identical orders and have to incur the expense of one order being returned Issue 4: This is a good example of where cleaning data is difficult because this may or may not be an error. There are many times when a phone and a fax have the same number. Since the phone number is also in the address field, chances are that the number is inaccurate Issue 5: The business would have no way of communicating with this customer via Issue 6: The company could determine the area code based on the customer’s address. This takes time, which costs the company money. This is a good reason to ensure that information is entered correctly the first time. All incorrect information needs to be fixed, which costs time and money

Characteristics of High Quality Data
1) Accurate 2) Complete 3) Consistent 4) Unique 5) Timely Characteristics of High Quality Information

1) Accurate Are the data (is the information) correct, precise, and exact? For example: Are the data factual? Are data error-free? Have data been verified? Correct spelling Precise numbers Accuracy Are all the values correct? For example, is the name spelled correctly? Is the dollar amount recorded properly?

2) Complete Are the data whole (complete) and do they have all the necessary parts? For example Are there missing values or pieces of data? Full street address Area code along with phone number Empty fields Full Names Completeness Are any of the values missing? For example, is the address complete including street, city, state, and zip code?

3) Consistent Are the data are in agreement with themselves and with known facts? For example Does summary information agree with detailed information? Can you reconcile the data? Do mathematical manipulations yield correct results? Are data manipulations performed consistently for the entire data set? Consistency Is aggregate or summary information in agreement with detailed information? For example, do all total fields equal the true total of the individual fields?

4) Unique Are the data unique (one of a kind) or are there redundant, repetitious or unnecessary data stored in the same database? For example: Are there duplicate records for the same “event”? Are there different versions of “the same” file or event (which is the latest or most accurate?) Uniqueness Is each transaction, entity, and event represented only once in the information? For example, are there any duplicate customers?

5) Timely Are the data current with respect to decision-making needs?
Timeliness depends on the situation Real-time information – Immediate, up-to-date information Real-time system – Provides real-time information in response to requests “Real-time” is a relative description that depends on the use or need Timeliness Is the information current with respect to the business requirements? For example, is information updated weekly, daily, or hourly?

Examples of how can data be of “poor” quality
Customers intentionally enter inaccurate information to protect their privacy or because they are irritated Different data entry standards and formats are used Operators enter abbreviated or erroneous information by accident or to save time Third party and external information contains inconsistencies, inaccuracies, and errors Addressing the above sources of information inaccuracies will significantly improve the quality of organizational information Determine a few additional sources of low quality information A customer service representative could accidentally transpose a number in an address or misspell a last name

What is a Database? Database – a collection of information organized in a way that provides efficient retrieval There are electronic and physical databases (paper/print) A database can be a very simple collection of data such as alphabetically arranging names in an address book

What Is a Database? Self-describing collection of integrated records
includes Meta Data about the fields/attributes Governs data acceptable formats for consistency Hierarchy of data elements Columns/Fields Rows/Records Tables/Relations A location to store and retrieve well structured and well governed data

What is a Database Management System (DBMS)?
Database management systems (DBMS) – A set of computer programs / software that allow users to store, modify, query, and retrieve data in an organized, systematic, and controlled manner

Database Management System (DBMS)
A database (the physical collection of data) is typically not portable across different DBMS Like application software, different DBMS are generally designed to work with specific system software and specific database schema

What is a database schema? The way in which the objects in the database are logically grouped / organized What are the tables and how are they linked? What are the different user views? What types of procedures and queries are stored?

A database is typically something inside the DBMS, although in the case of a MS Excel workbook the database is a standalone object

Single File Data Management
MS Excel is a database, but it is not a DBMS! There is NO DB management component - each worksheet is a single large two-dimensional matrix A DBMS is software that is used to manage the database and provides a set of tools used to manipulate and query data A database is simply an organized collection of data that can be accessed

Why go beyond a Spreadsheet?
Need to Store Multiple Themes of Data Spreadsheets Lack Structure and are prone to error To reduce redundantly stored data Optimized Query/Reporting Databases ENFORCE Consistency of the Data Spreadsheets are Clumsy & Time consuming to Update, Append or Expand Multiple User Access

Why Redundancy and Duplication of data are Important to Avoid
Update, Insertion and Deletion Anomalies Poorly normalized tables that require duplicate entries…how do we ensure that when you change a value for one record that the duplicated value is changed? If an employee leaves or if you stop selling a specific product, should your system permit those records to be deleted? Would you have this level of control over a spreadsheet? Redundancy is great for backups but terribly inefficient for Data Structures Increase manual time required for development and data entry Increase required disk space Decrease processing speeds & response time Lead to data anomalies and inconsistencies

Types of Database Architectures
Hierarchical Model Parent/Child Tree Like Structure. Parents can have many children but children only one parent Network Model Permitted children to have many parents Offers more direct relationships between entities Mostly Replaced by Relational Model Object Model Ideal when demand for massive amounts of information about single items is frequent (high energy physics, molecular biology, spatial databases, telecommunications..) Relational Model Most Common and what we will study in this class By far the most dominant enterprise data structure

NoSQL database technologies RDBMS are not well suited to handle unstructured data NoSQL technologies offer increased flexibility and scalability NoSQL technologies are designed with “big data” needs as opposed to transaction processing needs in mind

RDBMS Most popular and common DBMS is the relational DBMS (RDBMS)
A standard program and user interface in the RDBMS is the Structure Query Language (SQL) A programming language used to create, modify, and retrieve information from a database Different databases use different (proprietary) variations of SQL

RDBMS RDBMS are still best for “most” business needs
Oracle: Oracle Database and MySQL IBM: DB2 and Informix Microsoft: SQL Server SAP: Sybase Enterprise and Sybase IQ Teradata

RDBMS Data are organized as a set of formal tables
Data can be accessed and combined in different ways without altering the data within the tables RDBMS can be easily extended / scaled – new data and new categories of data can be added without changing existing data

RDBMS Terminology Data model – A picture of logical structures that detail the relationships among data elements Metadata – Formal description of data structures (like tables and fields) and any constraints of the table or values within the table Data about the containers of data

RDBMS Terminology Data dictionary – Compiles all of the metadata about the elements in the data model

Entity Sets (Tables) Relational table or entity set – Each table consists of columns (fields/attributes) and rows (records/entities) The table has a name that describes the group of related entities within the table For example, a table labeled “Student” would contain a group of student entities

Entity / Record / Row A person, place, thing, transaction, or event about which data are being collected and stored The individual rows in a table contain entities Each row is also referred to as a record Example?

Attributes / Field / Column
The data elements that describe the characteristics of a specific entity The columns in each table contain the attributes Example?

What is a Relationship? When designing a relational DB, data are grouped into tables Each table contains all related data elements For example we would store data related to customer (name, address, phone, etc.) and data related to the customer’s particular order (orderID, date, shipping method, etc.) in different tables (Customer and Order)

What is a Relationship? All information specific to a customer would go into a “Customer” table All information specific to the orders would go into an “Order” table We would then create a relationship between the tables that allows us to match a particular customer with a particular order

What is a Relationship? A relationship in an RDBMS is an association between the entities within the different tables There are THREE (3) types of relationships: One-to-One (1:1) One-to-Many (1:M) Many-to-Many (M:M)

Creating Relationships Through Keys
KEYS are used to create relationships between the entities in different tables in the RDB Primary key – A field (or group of fields) that uniquely identifies a given entity in a table Foreign key – A primary key of one table that appears an attribute in another table and acts to provide a logical relationship among the two tables

Creating Relationships Through Keys
For our purposes: Every table in a RDBMS MUST have a primary key The foreign key is not required in every table and will only appear on the “many” side of the relationship

Advantages of RDBMs RDBMS advantages from a business perspective include 1) Flexibility 2) Scalability and performance 3) Improved information integrity (quality) Reduced information redundancy 4) Information security A good way to explain databases is to compare them to spreadsheets What are the limitations when using a spreadsheet? Limited number of rows and columns (Excel - 65,536 rows by 256 columns) Once you use more than 65,536 rows you have outgrown your spreadsheet Only one user can access the spreadsheet Users can view all information in the spreadsheet Users can change all information in the spreadsheet All of the disadvantages associated with a spreadsheet are fixed when using a database

1) Flexibility Handle changes quickly and easily
Provide users with different views of the data Arranging data items in different ways depending on the specific user need Showing a particular user only some of the available fields while not showing them other fields

1) Flexibility: Schema Different database schema can be “owned” by or associated with different users The schema is a user personalized set of tables, views, and indexes

2) Scalability and Performance
A DBMS must expand to meet increased demand, while maintaining acceptable performance levels Scalability – Refers to how well a system can adapt to increased demands Performance – Measures how quickly a system performs a certain process or transaction What happens to a business if its suddenly experienced a 60 percent growth in sales and its IT systems fail with all of the increased activity?

3) Information Integrity
Information integrity – a measure of information quality Know that data have not been entered incorrectly or altered in an unauthorized manner Integrity constraint – rules that help ensure the quality of information We will discuss entity integrity and referential integrity (there is also domain integrity) Can you define two relational integrity constraints for an ordering system? Users cannot create an order for a nonexistent customer An order cannot be shipped without an address Can you define two business-critical integrity constraints for an ordering system? Product returns are not accepted for fresh product 15 days after purchase A discount maximum of 20 percent

3) Information Integrity: Controlling Redundancy
Redundant data are ok if they serve a specific purpose such as being used as backup directly linked to the source Backup systems promote fault tolerance, Unintentional redundancy is not good Wasted storage Difficult to modify Possible inconsistencies

4) Information Security
Information is an organizational asset and must be protected RDBMS offer several security features Access level – Determines the level of access each individual user has Who can access the DBMS Access control – Determines the types of things each group can do Types of access, such as power to create, modify, delete, and/or read Which types of SQL statements can be executed Why you would want to define access level security? Access levels will typically mimic the hierarchical structure of the organization and protect organizational information from being viewed and manipulated by individuals who should not have access to the sensitive or confidential information Low level employees typically have the lowest levels of access High level employees typically have access to all types of database information For example: You would not want analysts viewing all salary information for the entire company - in general: Analysts can usually only view their own salary Managers have higher access and can view the salaries of all their team members, but cannot view other managers’ salaries Directors can view all of their managers’ and analysts’ salaries, but not other directors’ salaries The CFO and CEO can view every employee’s salary

Multiuser Issues DBMS serve many different users with different needs
Many users may require concurrent access to the same data Must preserve integrity of data and the performance of the system

Multiuser Issues Problem: if multiple users (say tens or even hundreds of users) access the same data concurrently, how does the DBMS allow one user to change data without immediately overwriting the change by another user? This is typically referred to as the Lost-update problem

Enterprise DBMS

Multiuser Issues Concurrent transactions are addressed through the use of transactions and locks Transactions – single indivisible action that affects some data Once a transaction is committed, it is permanent and changes are visible to all users If transaction is not committed, changes are “rolled back” or reversed

Multiuser Issues Locks – literally “locks” the data so that changes cannot be made on the data while another transaction is in process

Summary Five characteristics of quality information
Define database, DBMS, RDBMS, and supporting components and terminology Advantages of RDBMS What is SQL? Describe the lost-update problem and how it is addressed

Data, Information, and Databases BDIS 6.1

Similar presentations

Presentation on theme: "Data, Information, and Databases BDIS 6.1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data, Information, and Databases BDIS 6.1

Similar presentations

Presentation on theme: "Data, Information, and Databases BDIS 6.1"— Presentation transcript:

Similar presentations

About project

Feedback