DATABASE PHYSICAL DESIGN Chandra S. Amaravadi 1
INTRODUCTION 2
PHYSICAL DATABASE DESIGN Physical database design is concerned with issues revolving around data base implementation: Implementation design Database storage, access & location File organization & constraints 3
Conceptual/ Base table THE THREE FORMS OF DATA External Internal/ Hardware level These three levels provide logical and physical data independence 4 Cust#NameAddressBalance 100Gordon110 Oak Street $ Prasad 22 Birch place$ ……….………………
Create table Alter table Create index drop index Facilities ConceptualConceptual InternalInternal ExternalExternal Models Schemas File Organizations Views THE THREE TYPES OF MODELS Create view Drop view 5
DATABASE PHYSICAL DESIGN Inputs? 6
COMPONENTS OF PHYSICAL DESIGN 1.Implementation design 2.Storage, access & distribution strategies 3.File organizations 4.Specifications for integrity constraints (later) 7
IMPLEMENTATION DESIGN Decide on tables (de-normalization) Decide on primary and cross reference keys (not discussed further) Decide on attribute data types (not discussed further) E.g. fixed vs variable length fields integer vs double integer Design reports and forms (not discussed further) Concerned with taking the results of normalization and designing tables, attributes, data types for implementation. 8 Field NameData typeDescriptionLengthDecimals Prod#NumericUnique prod code60 DescrTextShort prod description 250 PriceCurrencyProduct price62
Denormalization Example (for 1:1) Parts(Part#, PartName, ) Container (ContainerID, #fin, #needed, Part#) Parts(Part#, PartName, ContainerID, #fin, #needed) DECIDING ON TABLES 9 Denormalization is going back in the normal forms to reduce schema overhead
DECIDING ON TABLES.. Denormalization Example (for M:N) ORDERS PRODUCTSAre for Ord# Ord_dt Qty Prod# Descr. What tables does normalization result in? 10
Orders(ord#, ord_dt,..) Product(prod.#, descr,..) Orders for prod (prod.#, ord#, qty) DENORMALIZATION Orders(ord#, ord_dt,..) Product(prod.#, ord#, descr., qty..) 11
COMPONENTS OF PHYSICAL DESIGN.. 1. Implementation design 2. Storage and access strategies 3. Distribution strategies 4. File organizations 5. Specifications for integrity constraints (later) 12
STORAGE & ACCESS STRATEGIES Estimate storage requirements (Volume analysis) Determine media to be used (not discussed) Study how data is being acccessed (Usage analysis) Use these to develop file organization (later) OBJECTIVES 13 ALSO CALLED VOLUME & USAGE ANALYSIS Volume and Usage analysis is carried out with a composite usage map.
COMPOSITE USAGE MAP Used for volume & usage analysis file org. Superimposed on ER Chart Attributes are not shown Shows estimated number of records (volume) Shows type of access (dotted lines ) A composite usage map is simply an ER chart (without attr), that shows the number of records, and the frequency/pattern with which they are accessed. 14
VOLUME & USAGE ANALYSIS 15 Equipment, Parts and PE tables Equipment: 100; Parts:12,000; PE: 10, inquiries per hour to Equipment 300 inquiries per hour on Parts table 70% of these inquiries also need to know Equipment info. Draw a composite usage map, estimate storage requirements and develop a suitable file organization
COMPOSITE USAGE MAP EQUIPMENT PARTS ARE FOR (100) (12,000) PE (10,000) 20 ???? ??? 16
FOR DISCUSSION How can one estimate the size of a database? 17
ESTIMATING STORAGE REQMTS. FOR PARTS AND EQUIPMENT EQUIPMENT (Model#, Descr, Mfr., Price, HP, WT) PARTS(Part#, Descr, Mfr, Price) PE (Model#, Part#, Qty) 18 Equipment table: = 33 bytes/record Parts table: ?? PE table: ?? Total storage requirements = ??
STORAGE REQUIREMENTS RECORD SIZE: 33 Bytes # OF RECORDS: 100 FILE SIZE: 33 * 100 = 3,300 Bytes EQUIPMENT TABLE: 19 PARTS TABLE: RECORD SIZE: 25 Bytes # OF RECORDS: 12,000 FILE SIZE: 25 * = 300,000 Bytes PE TABLE: RECORD SIZE: 10 Bytes (approx) # OF RECORDS: 10,000 FILE SIZE: 10 * = 100,000 Bytes TOTAL STORAGE: ??????
A MORE ELABORATE EXAMPLE Parts are manufactured parts and purchased parts Parts: 1,000; Suppliers:50; Quotations: 2,500 Total of 200 parts inquiries 60 direct inquiries to purchased parts Of the purchased parts inquiries, 80 are also to quotation Of these 80, 70 are to supplier as well. 75 direct queries to supplier Of these 40 are for quotation All of these are also for parts 40% 70% 20
ANOTHER EXAMPLE.. PART MANU- FACTURED PURCH- ASED SUPPLIER QUOTA- TION Is -a (1000) (400) (700) 40%70% (2500) (50) A COMPOSITE USAGE MAP Note: # of records are in red; the # of accesses are in blue
COMPONENTS OF PHYSICAL DESIGN.. 1. Implementation design 2. Storage & access strategies 3. Distribution strategies 4. File organizations 5. Specifications for integrity constraints (later) 22
1. Centralized 2. Distributed Replicated (not discussed) Partitioned DISTRIBUTION STRATEGIES Distribution strategies are concerned with where the files are physically located. 23
DISTRIBUTION STRATEGIES Centralized -- All the data is stored in one physical location. Distributed -- The data is stored in multiple physical locations. Replicated -- The database is duplicated in multiple locations. Partitioned -- The database is divided into “fragments” and each fragment is stored in a different location. 24
CENTRALIZED VS DISTRIBUTED Which is bottleneck? Which causes security problems? Which method may be required for business reasons? In which setup is data more accessible? Which provides better performance? 25
CENTRALIZED STRATEGY Maximize local access, minimize remote access General Principle: S1 S2 S WHERE SHOULD WE LOCATE THE DATABASE? S1, S2 or S3 26
This slide is blank
DISTRIBUTED DATABASE EIDNameCity 2356ArmstrongLA 3286NickersonSF 3356ForresterMPLS LA SF MPLS partitioning
COMPONENTS OF PHYSICAL DESIGN.. 1. Implementation design 2. Storage & access strategies 3. Distribution strategies 4. File organizations 5. Specifications for integrity constraints (later) 29
FILE ORGANIZATION Tracks Sectors File 1 Rec. 1,2.. How records are arranged and retrieved from secondary storage or mapping between ____ and ______? 30
DATA ACCESS (FYI) Hard drive IOP FAT/NTFS O/SDBMS Requests Consults Directory tables Generates instructions to IOP Partition RAM 31 Database storage User
FILE ORGANIZATION Retrieval time (disk access) Access type (direct, sequential) Storage space Maintenance effort Selection Criteria 32
OVERVIEW OF FILE ORGANIZATIONS Sequential Hashed Indexed ISAM VSAM 33
OVERVIEW OF FILE ORGANIZATIONS.. Sequential -- Records are stored one after another in pkey sequence. Hashed --Record address is determined by subjecting pkey to hashing algorithm. Indexed --Same as sequential except that there is an index file which places keys into a separate file for ease of searching. 34
THE SEQUENTIAL ORGANIZATION Records in Pkey sequence Access only sequential Insertions/Deletions in sequential order Simple organization good for batch updates Part#Descr. 100Aux. motors 120Scrapers 124Rotors
THE HASHING ORGANIZATION A type of file organization where record addresses are generated by subjecting primary keys to a hashing routine, usually by dividing by a prime# Hashing Algorithm PkeyHash Address = REM [(Pkey)/(Prime#)] + Address of Starting Block
HASHING CONCEPTS Hashing algorithm Hash address Buckets & Bucket size Slots Collisions/overflows Load factor Search length n Record address = hash address + physical addr 37 Following are important concepts in hashing: 3432 Pkey = 43 Hash address = (43 remainder 7) = 1 Record address = = File space
HASHING CONCEPTS.. Hashing algorithm – the formula used to calculate a record address Hash address – an address (within block) where a hashed record is stored Buckets – storage area for a group of records; bucket size refers to # of slots. Slots – storage area for an individual record Collision – when two records hash to the same address Load factor – is the ratio of # of records to the total space allocated Average search length – is the time it takes to retrieve a record on the avg. (usually expressed in terms of disk accesses) Disk access – every time a disk is accessed for getting a record (if the record is stored in its hardware address, one access otherwise it depends on record location) 38
HASHING ALGORITHM Choose load factor Identify # of buckets to be allocated Select a prime# close to this number Divide each pkey by prime# Remainder = record address Sequentially number the buckets Place each record to its address If there are overflows, use Open 39
HASHING CONCEPTS n Collision: When two keys hash to the same address Open overflow (store in unallocated slots) Chained overflow (a separate area) OVERFLOWS 40
HASHING EXAMPLE Given Part#s: 100Gears 120Scrapers 130Aux motors 140Crankshafts 145Cylinder heads 150Pistons 100 Mod 7 = Mod 7 = Mod 7 = Mod 7 = Mod 7 = Mod 7 = 3 assume 8 buckets (0..7) assume 1 slot per bucket assume disk access time of 20 ms 41
HASHING EXAMPLE Gears 120 Scrapers 130 Aux. motor Crankshaft 145 Cylinders FILE LOADINGS 150 Pistons 6 Insert: 135 Shovel? 135 Mod 7 = 2 Average search length? 6 records -> 1 access 1 record -> 2 accesses 7 Load factor: ? Bucket size = ? 42
THE HASHING ORGANIZATION H(pkey) --> record address Records in hash sequence Need to allocate extra space Load factor between 60-80% Good for low activity (FAR) files Real-time and OO applns. EVALUATION 43
DISCUSSION A parts file with Part# as the pkey includes records with the following part# values: 23,37,46,48, 56,18, 10, 71, 16, 24, 39, 47 and 69. The file uses 8 buckets numbered 0 to 7. Each bucket holds two records. Load these records into the file in the given order using the hash function h(K) = K mod 8. Calculate the average search length in terms of # of disk accesses. Assume 20ms disk access. 44
INDEXED ORGANIZATION Primary key Secondary key Clustered A method of file organization where a subset of key values are stored in an index. Types are: 45
Records are in pkey sequence (master file) But are organized into groups Grouping information is stored in index file Records can be inserted at random Records can be accessed in sequence or at random THE INDEXED ORGANIZATION (ISAM) 46
……… Index file (index set) Master file (sequence set) Emp ID Angela108 Scott104 Becky103 Jacob101 name ID# THE INDEXED ORGANIZATION 47
THE INDEXED ORGANIZATION TRACKS CYLINDER1 48 CYLINDER2 CYLINDER1 CYLINDER2
THE ISAM ORGANIZATION Cylinder index Track index Overflow tracks Sequence Set CYLINDER1 CYLINDER N.. Index Set … …. … Note: Assume that the corresponding HW addresses are stored along with the pkeys 49
INSERTIONS IN ISAM Identify track where record needs to be inserted If the track is full, insert in overflow area If the track has room insert pkey in sequence Update track index and cylinder index if necessary 50
ISAM: ADVANTAGES AND DISADVANTAGES Access is direct or sequential? Access time dependent on? Rewrite sequentially Retrieval time uniform Suitable for volatile files? Workhorse organization used in most apps. 51
SECONDARY KEY INDEX REC# E_SSN E_NAME E_TITLE E_SALARY Smith Developer $35, Johnson Analyst $27, Weintraub Developer $60, Dickson Manager $64, HollandAnalyst $47, Rao Analyst $71, McDonald Manager $85,000 EMPLOYEE E_TITLE REC# Analyst 2,5,6 Manager 4,7 Developer 1,3 52
CLUSTERED INDEX Address e_ssn e_name e_title e_salary Johnson Analyst $27, HollandAnalyst $47, Rao Analyst $71, Dickson Manager $64, McDonald Manager $85, Weintraub Programmer $60, Smith Programmer $35,000 EMPLOYEE E_title Address Analyst 1 Manager 4 Developer 6 Also known as Inverted file organization 53
INDEXING STRATEGIES Index if you must Index on pkey Index on foreign keys Index on secondary key (depending on query frequency) 54
DISCUSSION What activities are part of identifying storage strategies? How is denormalization carried out for M:N relationships? How many indexes can you have per table? How many clustered indexes? Can we sequentially update all records in a) hashing organization? b) in indexing? Is indexing suitable for volatile files? If an index consists of 3 levels of indexes with the main index in RAM, and a disk access time of 20 MS, how long on the average does it take to retrieve a record? What problems do overflow records cause in hashing? A file is required to store 60,000 records; how much space is required in order to store the records using a hashing organization? 55
THE END! 56