Lecture 5 Data Model Design Jeffery S. Horsburgh Hydroinformatics Fall 2012 This work was funded by National Science Foundation Grant EPS
Objectives Identify and describe important entities and relationships to model data Develop data models to represent, organize, and store data Design and use relational databases to organize, store, and manipulate data
Data Model Requirements What is the information/data domain that you are modeling? What are the 20 queries that you want to do? – e.g., “Give me simultaneous observations of turbidity and TSS collected during the spring snowmelt period so I can develop a regression in R.” What software do you want (have) to use? How do you want to share the data?
Hydrologic Time Series An organization operates a network of monitoring sites. At each site they collect data for a number of data series, and each series contains a time series of observed values
Data Model Design Our focus – relational data model design Three stages: – Conceptual data model – Logical data model – Physical data model
Conceptual Data Model High-level description of the data domain Does not constrain how that description is mapped to an actual implementation in software There may be many mappings – Relational database – Object model – XML schema, etc.
Conceptual Data Model Technology independent Defines scope of the domain Defines and organizes data requirements Defines entities and relationships among them Site TimeSeries DataValues 1 1 * *
Logical Data Model Technology independent Contains more detail than the Conceptual Data Model Considered by many to be just an expanded conceptual data model Defines – Entities AND their attributes – Relationships AND cardinality – Constraints Generally completed as a documented Entity Relationship (ER) diagram
Example: ODM Logical Data Model
Physical Data Model The physical means to implement the data model – Choice of relational database management system – Implementation of tables, relationships, constraints, triggers, indices, data types – Database access – Performance – Storage
Steps in Data Model Design 1.Identify entities 2.Identify relationships among entities 3.Determine the directionality and cardinality of relationships 4.List attributes of entities 5.Designate keys / identifiers for entities 6.Identify constraints and business rules 7.Map 1-6 to a physical implementation
Entity Relationship Diagram Entities effectively become tables Attributes describe entities and become fields (columns) in tables Relationships link tables and become formal constraints
Relationships and Cardinality Relationships link one entity / table to another on a common attribute or “key” Cardinality defines how relationships link one table to another – 1..1 One-to-one – 1..* One-to-many – *..* Many-to-many
Relationship Examples A site has 1 or more time series A variable has 1 or more time series A time series has 1 or more data values A data value has 0 or more qualifiers, a qualifier may apply to 0 or more values SiteTimeSeries 1 * VariableTimeSeries 1 * DataValues 1 * Qualifiers * *
Primary and Foreign Keys Each row in a table should have an attribute that is a persistent, unique identifier Primary key in “parent” table Foreign key in “child” table 1 * * 1
Normalization Organizing the fields and tables in a relational database to minimize redundancy and dependency – Dividing large tables into smaller tables (with relationships) Isolate data so that additions, deletions, and modifications of a field or record can be made in one place Reduce the need for restructuring the database as new types of data are introduced
Normalization Example SiteIDSiteNameVariableIDVariableNameDateTimeValue 1Logan River1Temperature1/1/ Logan River1Temperature1/2/ Logan River2pH1/1/ Logan River2pH1/2/ Spring Creek1Temperature1/1/ Spring Creek1Temperature1/2/ Spring Creek2pH1/1/ Spring Creek2pH1/2/
Normalization Example SiteIDVariableIDDateTimeValue 111/1/ /2/ /1/ /2/ /1/ /2/ /1/ /2/ SiteIDSiteName 1Logan River 2Spring Creek VariableIDVariableName 1Temperature 2pH 1 1 **
ILO-2 Data Multiple Sites One file per site Multiple Sensors at Each Site 1 or More Time Series Per Sensor