Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases.

Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases. (Probably next week Fri, so watch your email towards the end of this week.) This week's lectures: – Physical design of databases Including: – Some implementation details; The evaluation of queries – Optimisation Techniques

Some implementation details Two goals in any sw system: (1) Correctness (2) Efficiency So far we studied (1) Now we will consider (2).

Some implementation details * DB files are typically very large – they do not fit into the memory. * They are stored such that the DBMS loads pages when they are needed. * Fetching a page takes 1000 times longer than the processing time required for accessing data on one page. (And processors get even faster!) * Inside the main memory the DBMS maintains a buffer pool of one page each. These are shared across a number of users.

Hence: a. The number of pages needed to store a table should be minimised. This can be achieved by either: - having records which are small (so that many of them fit onto one page) or - by having fewer records overall. b. An operation which requires the DBMS to look at every record of a large table is expensive.

The evaluation of queries * Selection operation: - selects certain rows. In general this needs a linear scan - in simple cases the system may know on which page to find the relevant row without looking at each row! This is achieved by indexing (to be discussed later) * Natural join: - it is the main time-consuming operation. - requires comparison of each row of T1 with each row of T2 (to check agreement on a common attribute). - optimisation seeks to reduce nr of comparisons; but works only if the resulting tables are reasonably small.

Optimisation techniques Technique 1: Pre-computing to avoid queries

Optimisation techniques Technique 1: Pre-computing to avoid queries Technique 2: 'Vertical' splitting - that is, separating out subsets of rows into different tables -...based on semantic criteria: e.g. time: keep last years sales separate e.g. status: move completed sales to archive e.g. location: different tables for customers from different countries - good if common queries will require access to only one of the tables (otherwise it is counterproductive)

Optimisation techniques Technique 2b: Horizontal splitting - by duplicating the key attribute, the table is made narrower by separating out some attributes into a different table. - e.g. the Uni student table has ~50 attributes. But few are used frequently (e.g. UCAS code & entry qualifications are v rarely needed once the student has arrived) - here too, for the split to be effective it is important that it is based on a use-case analysis, so queries that require the tables to be joined back again are rare.

Optimisation Technique 3: Indexes - aim to avoid linear search. - the principle is the same as for searching in a sorted array log(n) steps [vs O(n) steps in unsorted array] But a list of DB records can only be kept in order according to one attribute – not several ones! Analogy with phone books Solution: create an index into the data that allows fast access according to another (combination of) attribute(s).

Indexing The index does not repeat the full information but have pointers instead. Indexes are contained in separate files. They contain: the attribute values according to which we want fast access and pointers to the actual database records. There are 2 types of indexes: Tree-based [a version of binary search idea] – Most commonly used data structure is a B+ tree – This makes speed-up possible for range-search too Hash-based

Tree-based indexes Syntax - no standardised syntax - in PostgreSQL the syntax is: CREATE INDEX some_name ON staff USING BTREE (office); You will not need the name of the index unless you want to get rid of it again: DROP INDEX some_name; By default PostgreSQL builds and maintains a B+ tree index on the primary key of each table.

* dynamic index structure * high fan-out (F) (fan-out means # child nodes) ==> depth rarely exceeds 3-4. * each node has m entries (m is called 'the order of the tree') Instert, delete efficient [at log_F(N) (where N is # leaf pages)] * allows range-based search too!

Hash-based indexes Hash function: takes some input data and returns a number that describes the location of the record where more info about the record can be found This computation is very fast – more efficient than B+ tree. But: similar inputs do NOT lead to similar hash values. ==> not good for range-based search

* Hash based index is typically used when a table has alternative search keys (e.g. cid, bc) Note: the term “search key” is a different concept than that of primary key or secondary key. Do not confuse them. Search key is an attribute with respect to which we want fast access. * Only one search key can be primary – the one according to which the original table is physically ordered. Usually this contains the primary key – this is created automatically by the DBMS when you create the table. * Other search keys we call “secondary” – to create these we need to tell the system to build index files for them.

Before you jump to have lots of indexes Creation & maintenance of indexes causes some effort. * when a new record is entered * when a record is deleted The index files need to be adjusted at these operations. Hence, for a table with lots of “traffic” (i.e. a table that is likely to be modified a lot) having many indexes is less useful. To decide, some experimentation is needed. You can create the indexes and then delete them later if you find they aren't that useful.

Technique: De-normalisation

Remember Achim's example tables So how do we deal with the FD? cid,year → numbers Applying what we learned, we decompose the lecturing(cid,sid,year,numbers) schema into: course_instances(cid,year,numbers) taught_by(cid,year,sid). course_instances is now a weak entity of courses. (courses has additional info: name, level, semester, bc) We can consider putting the additional fields into course_instances?

Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases.

Similar presentations

Presentation on theme: "Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases.

Similar presentations

Presentation on theme: "Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases."— Presentation transcript:

Similar presentations

About project

Feedback