1 HowMessy How Messy is Your Database Page n How messy is your database? 2 n Hashing algorithm5 n Interpreting master dataset lines12 n Master dataset.

Slides:



Advertisements
Similar presentations
Disk Storage, Basic File Structures, and Hashing
Advertisements

Methods of Access Serial Sequential Indexed Sequential Random Access
 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Hashing and Indexing John Ortiz.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
23/05/20151 Data Structures Random Access Files. 223/05/2015 Learning Objectives Explain Random Access Searches. Explain the purpose and operation of.
1 Lecture 8: Data structures for databases II Jose M. Peña
Chapter 11: File System Implementation
1 Hash Tables Gordon College CS Hash Tables Recall order of magnitude of searches –Linear search O(n) –Binary search O(log 2 n) –Balanced binary.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
File Structures Dale-Marie Wilson, Ph.D.. Basic Concepts Primary storage Main memory Inappropriate for storing database Volatile Secondary storage Physical.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
MA/CSSE 473 Day 28 Hashing review B-tree overview Dynamic Programming.
File Systems (1). Readings r Silbershatz et al: 10.1,10.2,
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
1 Physical Data Organization and Indexing Lecture 14.
File Systems Long-term Information Storage Store large amounts of information Information must survive the termination of the process using it Multiple.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Comp 335 File Structures Hashing.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Chapter- 14- Index structures for files
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Hashing Hashing is another method for sorting and searching data.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Hash Tables. 2 Exercise 2 /* Exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Appendix C File Organization & Storage Structure.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems, Sixth.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Chapter 5 Record Storage and Primary File Organizations
Appendix C File Organization & Storage Structure.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database System Implementation CSE 507
Subject Name: File Structures
Disk Storage, Basic File Structures, and Hashing
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Exercise Open the Store database and copy the m-customer dataset into a file called Custfile Then look at the contents of Custfile >base store >get m-customer.
Inside Module 3 Working with Eloquence Page
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Chapter 12 Query Processing (1)
File Organization.
Presentation transcript:

1 HowMessy How Messy is Your Database Page n How messy is your database? 2 n Hashing algorithm5 n Interpreting master dataset lines12 n Master dataset solutions15 n HowMessy sample report (detail dataset)17 n Repacking a detail dataset22 n Detail dataset solutions26 n Estimating response time29 n Automating HowMessy analysis30 n Summary33

2 How messy is your database? n A database is messy if it takes more I/O than it should n Unnecessary I/O is still a major limiting factor even on MPE/iX machines n Databases are messy by nature n Run HowMessy or DBLOADNG against your database n HowMessy is a bonus program for Robelle customers n DBLOADNG is a contributed library program

3 Blocks n TurboIMAGE does all I/O operations in blocks n A block may contain many user records n More entries per block means fewer I/Os n Fewer I/Os means better performance Block 1 Block 2 Block User Data Blocking factor = Capacity:

4 Record location in masters n Search item values must be unique n Location of entries is determined by a hashing algorithm or a primary address calculation n Calculation is done on search item value to transform it into a record number between one and the capacity n Different calculation depending on the search item type n X, U, Z, and P give random results n I, J, K, R, and E give predictable results

5 Hashing algorithm Block 1 Block Customer number AA1000 n Customer number AA1000 is transformed into a record number Block 3162 Record number Block Blocking factor = 8 Capacity:

6 Hashing algorithm (no collision) Block 1 Block Customer number BD2134 AA1000 n Customer number BD2134 gives a different record number in a different block Block 7759 Record number Block Blocking factor = 8 Capacity:

7 Hashing algorithm (collision - same block) Customer number CL1717 AA n Customer number CL1717 hashes to the same record number as AA1000 location n TurboIMAGE tries to find an empty location in the same block. If it finds one, no additional I/O is required. n CL1717 becomes a secondary entry. Primary and secondary entries are linked using pointers that form a chain. Block 3162

8 Block 3163 is full Hashing algorithm (collision - different block) Customer number MD AA1000 CL Block 3162 Block 3164 n Customer number MD4884 collides with AA1000 n No more room in this block. TurboIMAGE reads the following blocks until it finds a free record location. n In this case, MD4884 is placed two blocks away, which requires two additional I/Os.

9 An example TurboIMAGE database A-ORDER-NO M-CUSTOMER D-ORDERS D-ORD-ITEMS ORDER-NO CUSTOMER-NO

10 HowMessy sample report Secon- Max Type Load daries BlksBlk Data Set Capacity EntriesFactor (Highwater)Fact M-Customer Man %30.5% A-Order-No Ato %25.7% 1 70 D-Orders Det %( ) 32 D-Ord-Items Det %( ) 23 HowMessy/XL (Version 2.2.1) Data Base: STORE.DATA.INVENT Run on: MON, JAN 9, 1995, 11:48 AM TurboIMAGE/3000 databases By Robelle Consulting Ltd. Page: 1 Max Ave StdExpd Avg IneffElong- Search Field ChainChain DevBlocks Blocks Ptrsation Customer-No %1.90 Order-No %1.00 !Order-No %1.00 S Customer-No %5.25 S !Order-No %8.34

11 HowMessy sample report (master dataset) HowMessy/XL (Version 2.2.1) Data Base: STORE.DATA.INVENT Run on: MON, JAN 9, 1995, 11:48 AM TurboIMAGE/3000 databasesBy Robelle Consulting LtdPage: 1 Secon- Max Type Load daries BlksBlk Data Set Capacity EntriesFactor (Highwater)Fact M-Customer Man %30.5% A-Order-No Ato %25.7% 1 70 D-Orders Det %( ) 32 D-Ord-Items Det %( ) 23 Max Ave StdExpd Avg IneffElong- Search Field ChainChain DevBlocks Blocks Ptrsation Customer-No %1.90 Order-No %1.00 !Order-No %1.00 S Customer-No %5.25 S !Order-No %8.34

12 Interpreting master datasets lines n Pay attention to the following statistics: n High percentage of Secondaries (inefficient hashing) n High Maximum Blocks (clustering) n High Maximum and Average Chains (inefficient hashing) n High Inefficient Pointers (when secondaries exist) n High Elongation (when secondaries exist)

13 Report on m-customer n The number of Secondaries is not unusually high n However, there may be problems n Records are clustering (high Max Blks) n Long synonym chain n High percentage of Inefficient Pointers Secon- Max TypeLoaddaries Blks Blk Data SetCapacity Entries Factor(Highwater)Fact M-CUSTOMER Man %30.5% Max Ave Std Expd Avg Ineff Elong- Search FieldChain ChainDev Blocks Blocks Ptrs ation CUSTOMER-NO % 1.90

14 Report on a-order-no n Very tidy dataset n Number of Secondaries is acceptable n Max Blks, Ineff Ptrs and Elongation are at the minimum values, even if the maximum chain length is a bit high Secon- Max Type Load daries Blks Blk Data SetCapacityEntries Factor (Highwater) Fact A-ORDER-NOAto % 25.7% 1 70 MaxAveStdExpdAvgIneff Elong- Search FieldChainChainDevBlocksBlocksPtrsation ORDER-NO %1.00

15 Master dataset solutions n Increase capacity to a higher odd number n Increase the Blocking Factor n Increase block size n Reduce record size n Change binary keys to type X, U, Z, or P n Check your database early in the design n Use HowMessy on test databases

16 HowMessy Exercise 1 Secon-Max Type Loaddaries BlksBlk Data Set Capacity Entries Factor(Highwater)Fact A-MASTERAto %36.8% MaxAveStdExpdAvg Ineff Elong- Search Field Chain ChainDevBlocksBlocksPtrs ation MASTER-KEY % 1.88

17 HowMessy sample report (detail dataset) HowMessy/XL (Version 2.2.1) Data Base: STORE.DATA.INVENT Run on: MON, JAN 9, 1995, 11:48 AM for TurboIMAGE/3000 databases By Robelle Consulting Ltd. Page: 1 Secon- Max TypeLoad daries Blks Blk Data Set Capacity Entries Factor(Highwater) Fact M-CUSTOMER Man %30.5% A-ORDER-NO Ato %25.7% 1 70 D-ORDERS Det %( ) 12 D-ORD-ITEMS Det %( ) 23 Max AveStdExpd AvgIneffElong- Search Field Chain ChainDev Blocks Blocks Ptrsation Customer-No %1.90 Order-No %1.00 !Order-No %1.00 S Customer-No %5.25 S !Order-No %8.34

18 Empty detail dataset n Records are stored in the order they are created starting from record 1 n Records for the same customer are linked together using pointers to form a chain n Chains are linked to the corresponding master entry Block 1 Block AA1000O MD4884O BD2134O MD4884O CL1717O AA1000O D-ORD-HEADER CustomerOrder Blocking factor = 8 Capacity:

19 Detail chains get scattered n Over time, records for the same customer are scattered over multiple blocks 1 AA1000O AA1000O Block 1Block AA1000O AA1000O Block 23 AA1000O

20 Delete chain n Deleted records are linked together n TurboIMAGE reuses the records in the Delete chain, if there are any Block Deleted 265 Block Deleted

21 Highwater mark n Indicates highest record location used so far n Serial reads scan the dataset up to the highwater mark Block highwater mark Used blocks : some empty, some partially used, some full D-ORD-HEADER Block 1 Block 8000

22 Repacking a detail dataset n Groups records along primary path n Removes Delete chain (no holes) n Resets highwater mark Block 1 highwater mark 1 AA1000O AA1000O Block 1 AA1000O AA1000O AA1000O BD2137O CL1717O MD4884O Block Block 12500

23 Interpreting detail dataset lines n Pay attention to the following statistics: n Load Factor approaching 100% (dataset full) n Primary path (large Average Chain and often accessed) n High Average Chain and low Standard deviation, especially with a sorted path (Is path really needed?) n High Inefficient Pointers (entries in chain not consecutive) n High Elongation (entries in chain not consecutive)

24 Report on d-orders n Primary path should be on customer-no, not on order-no n Highwater mark is high n Repack along new primary path regularly Secon-Max Type Loaddaries BlksBlk Data Set Capacity Entries Factor(Highwater)Fact D-ORDERSDet %( )12 MaxAveStdExpdAvg Ineff Elong- Search Field Chain ChainDevBlocksBlocksPtrs ation !ORDER-NO % 1.00 SCUSTOMER-NO % 5.25

25 Report on d-ord-items n Inefficient Pointers and Elongation are high n Highwater mark is fairly high n Repack the dataset regularly n Is the sorted path really needed? Secon- Max Type Load daries Blks Blk Data Set Capacity Entries Factor (Highwater)Fact D-ORD-ITEMS Det %( ) 23 Max Ave Std Expd Avg Ineff Elong- Search Field Chain ChainDev Blocks Blocks Ptrs ation S !ORDER-NO

26 Detail dataset solutions n Assign the primary path correctly; search item with Average Chain length > 1 that is accessed most often n Repack datasets along the primary path regularly n Increase the Blocking Factor n Increase block size n Reduce record size n Understand sorted paths n Check your databases early in the design; use HowMessy on test databases

27 HowMessy Exercise 2 Secon-Max Type Loaddaries BlksBlk Data Set Capacity Entries Factor(Highwater)Fact D-ITEMSDet %( )7 MaxAveStdExpdAvg Ineff Elong- Search Field Chain ChainDevBlocksBlocksPtrs ation S !ITEM-NO % 1.00 SSUPPLIER-NO % 1.86 LOCATION %1.13 BO-STATUS %1.00 DISCOUNT %10.55

28 Minimum number of disc I/Os IntrinsicDisc I/Os DBGET1 DBFIND1 DBBEGIN1 DBEND1 DBUPDATE1(non-critical item) DBUPDATE13(critical item) DBPUT3 [+ (4 x #paths, if detail)] DBDELETE2 [+ (4 x #paths, if detail)] Serial reads: MasterCapacity / Blocking factor Detail# entries / Blocking factor

29 Estimating response time n Deleting 100,000 records from a detail dataset with two paths would take: n 2 + (4 x 2 paths) = 10 I/Os per record n 100,000 records x 10 I/Os per record = 1,000,000 I/Os Classic: around  25 I/Os per second n 1,000,000 I/Os / 25 = 40,000 seconds n 40,000 seconds / 3600 = 11.1 hours n iX: around 40 I/Os per second n 1,000,000 I/Os / 40 = 25,000 seconds n 25,000 seconds / 3600 = 6.9 hours

30 Automating HowMessy analysis n Recent version of HowMessy creates a self-describing file with these statistics n Process the file with generic tools (Suprtool, AskPlus) or custom programs (COBOL, 4GL), and produce custom reports n Send messages to database administrators n Write “smart” job to fix databases without user intervention

31 Processing Loadfile with Suprtool Datasets more than 80% full >inputloadfile >if loadfactor > 80 >ext database, dataset, datasettype, loadfactor >list standard n Only one address per customer >inputloadfile >ifdataset = "D-ADDRESSES" and & maxchain > 1

32 References n The TurboIMAGE/3000 Handbook (Chapter 23) n Available for $ from: WORDWARE P.O. Box Seattle, WA 98114

33 Summary n TurboIMAGE databases become messy over time, especially if they are active n HowMessy and DBLOADNG let you analyze the database’s efficiency n You should have some knowledge of the internal workings of TurboIMAGE n Monitor your databases regularly

34