Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723

Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723 AJW@SaSC.co.uk www.SaSC.co.uk

15-Mar-03Statistical Databases and RDBMS © S&SC2 Introduction Good Statistical Analysis needs Good Data –Many things contribute to data quality Quality of structures and processes for data collection related to quality of data Good Ideas from Computer Science –Process design methodologies –Database methodologies –Data and process modelling systems Statisticians can benefit from CS concepts –Particularly for design of complex structures and continuous processes

15-Mar-03Statistical Databases and RDBMS © S&SC3 Data Structure Statistics generally based on a single rectangular dataset –Much statistical data is conceived and stored in this form Database professionals often use Relational Database systems –Formal model, based on a set of related rectangular datasets –Allows the representation of more complex structures, and easier maintenance and manipulation of the data. RDB approach valuable for some statistical situations –Review RDB concepts and functionality, plus SQL –Statistical Examples Pakistan Fertility Survey Processing and analysis system for HIV and AIDS reporting for England and Wales

15-Mar-03Statistical Databases and RDBMS © S&SC4 The Relational Model A logical specification of the content and behaviour of a database management system, including The types of structure that can be present in a database The properties of elements that can be stored in these structures The operations that can be performed on these structures and their behaviour Facilities that must be present in the database management system The general nature of the interactions between the database and its users and administrators –Codd’s Rules specify properties that a Relational DataBase Management System (RDBMS) must possess SQL –A standard language with which to interact with a RDBMS.

15-Mar-03Statistical Databases and RDBMS © S&SC5 Relational Database System Structure

15-Mar-03Statistical Databases and RDBMS © S&SC6 RDBMS Strengths Relational Model –Precise, formal mathematical specification of structure and behaviour –Appropriate for information about Entities, and Relationships between them Data Modelling –Useful tools for understanding data structures and flows SQL –International Standard (SQL2), widely implemented Current Implementations –Widely available, well supported, good implementations, integration with other products, add-on market for tools

15-Mar-03Statistical Databases and RDBMS © S&SC7 Components of the Relational Model Domain a set of possible (atomic) values Attribute defined over a domain, has a name, cf. variable Tuple a set of values, one associated with each attribute, cf. cases or records. NULL values supported Relation defined over a set of attributes, has a name, consists of a set of tuples Relational DataBase a set of relations

15-Mar-03Statistical Databases and RDBMS © S&SC8 Relations Rectangular Data tables, in which columns are variables, rows are cases. Tables can be linked by the values of common variables.

15-Mar-03Statistical Databases and RDBMS © S&SC9 SQL –International Standard, actively revised –Widely available in good RDBMS software –Text language, used by programs and people Stored or constructed at run-time –Designed to support tools which are independent of the DBMS cf. Client-Server architecture –User and Programmer skills portable across products and sites –Has sections for defining database structure (DDL), Manipulating database content (DML) Ensuring database integrity, and Managing database security

15-Mar-03Statistical Databases and RDBMS © S&SC10 SQL Data Manipulation SELECT {DISTINCT} FROM WHERE GROUP BY HAVING ORDER BY All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations. The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement. Retrieval is usually done through a Query interface, which generates the SQL.

15-Mar-03Statistical Databases and RDBMS © S&SC11 Data Manipulation - Join The Join operation selects the combination of rows from two or more tables that meet a matching condition SELECT R1.*, R2.* FROM R1, R2 WHERE A = X All matching combinations are retained –Eg can relate household information to all members via household number Join A=X ABCR1.DXR2.DE 1b1c1d11d3e1 2b2c2d22d5e2 3b3c3d33 e3

15-Mar-03Statistical Databases and RDBMS © S&SC12 Data Manipulation - Aggregation Aggregation functions can be used in the variable list SELECTCOUNT(DIAGNOSIS), COUNT(DISTINCT DIAGNOSIS), AVE(WEIGHT) FROMCANCER_REGISTRATION –This produces one record for the whole table containing three values, the number of records with non-null values for Diagnosis, the number of distinct Diagnosis values, and the average Weight. –The available functions are COUNT, SUM, AVE, MIN, MAX. –Aggregation functions can be applied to expressions.

15-Mar-03Statistical Databases and RDBMS © S&SC13 Data Manipulation - Grouping The GROUP BY clause is used to perform aggregation within subgroups, producing a record for each group. SELECTDIAGNOSIS, COUNT(DIAGNOSIS), AVE(WEIGHT) FROM CANCER_REGISTRATION GROUP BY DIAGNOSIS ORDER BY DIAGNOSIS –This produces a record for each diagnosis containing the number of registrations and their average weight as well as the diagnosis code. –It is sorted into diagnosis code order, using the ORDER BY clause. –The Group By clause can contain multiple fields (for cross- tabulation). –The expression list can contain variables used in the Group By clause, and aggregate functions for any variable. –A WHERE clause selects the records to be aggregated. –A HAVING clause selects the groups to be returned.

15-Mar-03Statistical Databases and RDBMS © S&SC14 Views Stored query definition –Can be evaluated on demand –Used to present data in the form needed by the user (a person or a process) Dynamic evaluation –Ensures that the viewed information is up to date –May be inefficient if the information does not change

15-Mar-03Statistical Databases and RDBMS © S&SC15 Database Design Choose physical tables and fields –Capture essential characteristics of real-world system –Support intended uses of the information (directly or through views) –Operate efficiently Identify important links between tables –Provide appropriate Keys –Choose Indexes and other implementation details Retain flexibility for future needs Various methodologies for design process

15-Mar-03Statistical Databases and RDBMS © S&SC16 RDB Summary Relational databases are ubiquitous, and are useful for large- scale data collections Some manipulation and aggregation operations can be done more easily than in statistical packages Relational model is a useful way of thinking about data structures –May suggest different data structure No replacement for statistical packages for statistical analysis Flexible manipulation of more complex data structures Useful development tools for processes Interface to Views for Statistical Analysis –ODBC or similar –Extract micro- or macro-data for statistical analysis

15-Mar-03Statistical Databases and RDBMS © S&SC17 Databases for Statistics Simple situations –Use statistical or survey packages Studies with Complex Structure –Use RDB for flexible manipulation –May store static result tables Ongoing Data Capture Processes –Use RDB for dynamic data entry, updating, matching and checking –Choose data structure to optimise data processes and quality, not ease of statistical processing –May take static snapshots for regular statistical analysis Statistical Dissemination – another presentation Statistical Metadata – yet another presentation

15-Mar-03Statistical Databases and RDBMS © S&SC18 PFFPS – Main structure

15-Mar-03Statistical Databases and RDBMS © S&SC19 Data Tables for PFFPS

15-Mar-03Statistical Databases and RDBMS © S&SC20 PFFPS Meta Data Structure

15-Mar-03Statistical Databases and RDBMS © S&SC21 HIV and AIDS reporting section, PHLS Redesign of processing and reporting system Old system (1994) based on files processed with Epi-Info and Aspect database Requirements –Simplify management of data –Improve consistency –Simplify regular reporting –Enable ad hoc reporting

15-Mar-03Statistical Databases and RDBMS © S&SC22 Old System Input Notifications –HIV Tests (standard form) –AIDS Diagnosis (standard) –ONS Death Reports Some notifications on disc, most on paper Data Storage –Separate HIV and AIDS files –Deaths Added Reporting –Quarterly standard reports –Ad Hoc inquiries (e.g. PQs) –Occasional in-depth analysis Processes –Continuous Data Entry from paper –Regular Linking for reporting extractions –Occasional Matching within and between files Problems –Duplicates and omissions –Name matching –Repeated linking –Inflexible for reporting (eg progression from HIV to AIDS)

15-Mar-03Statistical Databases and RDBMS © S&SC23 Old Structure HIV Notification File AIDS Notification File HIV Forms AIDS Forms Death Reports on disc Quarterly Extracts

15-Mar-03Statistical Databases and RDBMS © S&SC24 Initial Thoughts Notification forms appropriate for reporting, but not for primary information store Information is about Patients –Notifications are sources of information –Record information about events and states Do matching once, at initial entry Unified database structure Additional components for metadata, auditing, contact information Extracts to separate statistical reporting system

15-Mar-03Statistical Databases and RDBMS © S&SC29 Data Entry Get Patient details from source Look for matching existing patients –May need expert to decide –Create new Patient if needed Open appropriate form for source to complete transfer of Patient and details Forms know how to handle duplicates (generally take earliest)

15-Mar-03Statistical Databases and RDBMS © S&SC31 Extraction Separate Extraction Database template –Linked to main database –Set of associated Excel spreadsheet templates New copy each quarter –Populate by Extraction of information through Views –Gives a Snapshot of Patient status Set of aggregation queries –Import results into spreadsheets as source Reorganise with Pivot tables –Create publication versions

15-Mar-03Statistical Databases and RDBMS © S&SC32 Conclusions Good analysis needs good data –Need right structure for efficient processing RDB provides flexible structure and manipulation –Useful for complex situations –Can produce the views needed for statistical analysis –Quick to implement at a basic structural level Good framework for implementing production processes –Important to get the process right for good quality data –Not cheap to implement production processes Can easily include associated information –Metadata, other supporting information

15-Mar-03Statistical Databases and RDBMS © S&SC33 The End Extended versions of presentations available from –www.sasc.co.uk –Follow links to MSc in Statistical Computing

Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723

Similar presentations

Presentation on theme: "Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723

Similar presentations

Presentation on theme: "Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723"— Presentation transcript:

Similar presentations

About project

Feedback