Download presentation
Presentation is loading. Please wait.
Published byBartholomew Byrd Modified over 9 years ago
1
Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723 AJW@SaSC.co.uk www.SaSC.co.uk
2
15-Mar-03Statistical Databases and RDBMS © S&SC2 Introduction Good Statistical Analysis needs Good Data –Many things contribute to data quality Quality of structures and processes for data collection related to quality of data Good Ideas from Computer Science –Process design methodologies –Database methodologies –Data and process modelling systems Statisticians can benefit from CS concepts –Particularly for design of complex structures and continuous processes
3
15-Mar-03Statistical Databases and RDBMS © S&SC3 Data Structure Statistics generally based on a single rectangular dataset –Much statistical data is conceived and stored in this form Database professionals often use Relational Database systems –Formal model, based on a set of related rectangular datasets –Allows the representation of more complex structures, and easier maintenance and manipulation of the data. RDB approach valuable for some statistical situations –Review RDB concepts and functionality, plus SQL –Statistical Examples Pakistan Fertility Survey Processing and analysis system for HIV and AIDS reporting for England and Wales
4
15-Mar-03Statistical Databases and RDBMS © S&SC4 The Relational Model A logical specification of the content and behaviour of a database management system, including The types of structure that can be present in a database The properties of elements that can be stored in these structures The operations that can be performed on these structures and their behaviour Facilities that must be present in the database management system The general nature of the interactions between the database and its users and administrators –Codd’s Rules specify properties that a Relational DataBase Management System (RDBMS) must possess SQL –A standard language with which to interact with a RDBMS.
5
15-Mar-03Statistical Databases and RDBMS © S&SC5 Relational Database System Structure
6
15-Mar-03Statistical Databases and RDBMS © S&SC6 RDBMS Strengths Relational Model –Precise, formal mathematical specification of structure and behaviour –Appropriate for information about Entities, and Relationships between them Data Modelling –Useful tools for understanding data structures and flows SQL –International Standard (SQL2), widely implemented Current Implementations –Widely available, well supported, good implementations, integration with other products, add-on market for tools
7
15-Mar-03Statistical Databases and RDBMS © S&SC7 Components of the Relational Model Domain a set of possible (atomic) values Attribute defined over a domain, has a name, cf. variable Tuple a set of values, one associated with each attribute, cf. cases or records. NULL values supported Relation defined over a set of attributes, has a name, consists of a set of tuples Relational DataBase a set of relations
8
15-Mar-03Statistical Databases and RDBMS © S&SC8 Relations Rectangular Data tables, in which columns are variables, rows are cases. Tables can be linked by the values of common variables.
9
15-Mar-03Statistical Databases and RDBMS © S&SC9 SQL –International Standard, actively revised –Widely available in good RDBMS software –Text language, used by programs and people Stored or constructed at run-time –Designed to support tools which are independent of the DBMS cf. Client-Server architecture –User and Programmer skills portable across products and sites –Has sections for defining database structure (DDL), Manipulating database content (DML) Ensuring database integrity, and Managing database security
10
15-Mar-03Statistical Databases and RDBMS © S&SC10 SQL Data Manipulation SELECT {DISTINCT} FROM WHERE GROUP BY HAVING ORDER BY All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations. The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement. Retrieval is usually done through a Query interface, which generates the SQL.
11
15-Mar-03Statistical Databases and RDBMS © S&SC11 Data Manipulation - Join The Join operation selects the combination of rows from two or more tables that meet a matching condition SELECT R1.*, R2.* FROM R1, R2 WHERE A = X All matching combinations are retained –Eg can relate household information to all members via household number Join A=X ABCR1.DXR2.DE 1b1c1d11d3e1 2b2c2d22d5e2 3b3c3d33 e3
12
15-Mar-03Statistical Databases and RDBMS © S&SC12 Data Manipulation - Aggregation Aggregation functions can be used in the variable list SELECTCOUNT(DIAGNOSIS), COUNT(DISTINCT DIAGNOSIS), AVE(WEIGHT) FROMCANCER_REGISTRATION –This produces one record for the whole table containing three values, the number of records with non-null values for Diagnosis, the number of distinct Diagnosis values, and the average Weight. –The available functions are COUNT, SUM, AVE, MIN, MAX. –Aggregation functions can be applied to expressions.
13
15-Mar-03Statistical Databases and RDBMS © S&SC13 Data Manipulation - Grouping The GROUP BY clause is used to perform aggregation within subgroups, producing a record for each group. SELECTDIAGNOSIS, COUNT(DIAGNOSIS), AVE(WEIGHT) FROM CANCER_REGISTRATION GROUP BY DIAGNOSIS ORDER BY DIAGNOSIS –This produces a record for each diagnosis containing the number of registrations and their average weight as well as the diagnosis code. –It is sorted into diagnosis code order, using the ORDER BY clause. –The Group By clause can contain multiple fields (for cross- tabulation). –The expression list can contain variables used in the Group By clause, and aggregate functions for any variable. –A WHERE clause selects the records to be aggregated. –A HAVING clause selects the groups to be returned.
14
15-Mar-03Statistical Databases and RDBMS © S&SC14 Views Stored query definition –Can be evaluated on demand –Used to present data in the form needed by the user (a person or a process) Dynamic evaluation –Ensures that the viewed information is up to date –May be inefficient if the information does not change
15
15-Mar-03Statistical Databases and RDBMS © S&SC15 Database Design Choose physical tables and fields –Capture essential characteristics of real-world system –Support intended uses of the information (directly or through views) –Operate efficiently Identify important links between tables –Provide appropriate Keys –Choose Indexes and other implementation details Retain flexibility for future needs Various methodologies for design process
16
15-Mar-03Statistical Databases and RDBMS © S&SC16 RDB Summary Relational databases are ubiquitous, and are useful for large- scale data collections Some manipulation and aggregation operations can be done more easily than in statistical packages Relational model is a useful way of thinking about data structures –May suggest different data structure No replacement for statistical packages for statistical analysis Flexible manipulation of more complex data structures Useful development tools for processes Interface to Views for Statistical Analysis –ODBC or similar –Extract micro- or macro-data for statistical analysis
17
15-Mar-03Statistical Databases and RDBMS © S&SC17 Databases for Statistics Simple situations –Use statistical or survey packages Studies with Complex Structure –Use RDB for flexible manipulation –May store static result tables Ongoing Data Capture Processes –Use RDB for dynamic data entry, updating, matching and checking –Choose data structure to optimise data processes and quality, not ease of statistical processing –May take static snapshots for regular statistical analysis Statistical Dissemination – another presentation Statistical Metadata – yet another presentation
18
15-Mar-03Statistical Databases and RDBMS © S&SC18 PFFPS – Main structure
19
15-Mar-03Statistical Databases and RDBMS © S&SC19 Data Tables for PFFPS
20
15-Mar-03Statistical Databases and RDBMS © S&SC20 PFFPS Meta Data Structure
21
15-Mar-03Statistical Databases and RDBMS © S&SC21 HIV and AIDS reporting section, PHLS Redesign of processing and reporting system Old system (1994) based on files processed with Epi-Info and Aspect database Requirements –Simplify management of data –Improve consistency –Simplify regular reporting –Enable ad hoc reporting
22
15-Mar-03Statistical Databases and RDBMS © S&SC22 Old System Input Notifications –HIV Tests (standard form) –AIDS Diagnosis (standard) –ONS Death Reports Some notifications on disc, most on paper Data Storage –Separate HIV and AIDS files –Deaths Added Reporting –Quarterly standard reports –Ad Hoc inquiries (e.g. PQs) –Occasional in-depth analysis Processes –Continuous Data Entry from paper –Regular Linking for reporting extractions –Occasional Matching within and between files Problems –Duplicates and omissions –Name matching –Repeated linking –Inflexible for reporting (eg progression from HIV to AIDS)
23
15-Mar-03Statistical Databases and RDBMS © S&SC23 Old Structure HIV Notification File AIDS Notification File HIV Forms AIDS Forms Death Reports on disc Quarterly Extracts
24
15-Mar-03Statistical Databases and RDBMS © S&SC24 Initial Thoughts Notification forms appropriate for reporting, but not for primary information store Information is about Patients –Notifications are sources of information –Record information about events and states Do matching once, at initial entry Unified database structure Additional components for metadata, auditing, contact information Extracts to separate statistical reporting system
25
15-Mar-03Statistical Databases and RDBMS © S&SC25 Full system
26
15-Mar-03Statistical Databases and RDBMS © S&SC26 Main Patient Tables
27
15-Mar-03Statistical Databases and RDBMS © S&SC27 Patients and Notifications
28
15-Mar-03Statistical Databases and RDBMS © S&SC28 Full HAP Data Structure
29
15-Mar-03Statistical Databases and RDBMS © S&SC29 Data Entry Get Patient details from source Look for matching existing patients –May need expert to decide –Create new Patient if needed Open appropriate form for source to complete transfer of Patient and details Forms know how to handle duplicates (generally take earliest)
30
15-Mar-03Statistical Databases and RDBMS © S&SC30 Meta Data Structure
31
15-Mar-03Statistical Databases and RDBMS © S&SC31 Extraction Separate Extraction Database template –Linked to main database –Set of associated Excel spreadsheet templates New copy each quarter –Populate by Extraction of information through Views –Gives a Snapshot of Patient status Set of aggregation queries –Import results into spreadsheets as source Reorganise with Pivot tables –Create publication versions
32
15-Mar-03Statistical Databases and RDBMS © S&SC32 Conclusions Good analysis needs good data –Need right structure for efficient processing RDB provides flexible structure and manipulation –Useful for complex situations –Can produce the views needed for statistical analysis –Quick to implement at a basic structural level Good framework for implementing production processes –Important to get the process right for good quality data –Not cheap to implement production processes Can easily include associated information –Metadata, other supporting information
33
15-Mar-03Statistical Databases and RDBMS © S&SC33 The End Extended versions of presentations available from –www.sasc.co.uk –Follow links to MSc in Statistical Computing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.