Scientific Data Annotation and Analysis Lecture 7.

Slides:



Advertisements
Similar presentations
Richard M. Jacobs, OSA, Ph.D.
Advertisements

Analyzing Measurement Data ENGR 1181 Class 8. Analyzing Measurement Data in the Real World As previously mentioned, data is collected all of the time,
Multidimensional data processing. Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate.
Becoming Acquainted With Statistical Concepts CHAPTER CHAPTER 12.
QUANTITATIVE DATA ANALYSIS
Chapter 13 Conducting & Reading Research Baumgartner et al Data Analysis.
© 2002 Prentice-Hall, Inc.Chap 3-1 Basic Business Statistics (8 th Edition) Chapter 3 Numerical Descriptive Measures.
Calculating & Reporting Healthcare Statistics
Descriptive Statistics A.A. Elimam College of Business San Francisco State University.
© 2003 Prentice-Hall, Inc.Chap 3-1 Business Statistics: A First Course (3 rd Edition) Chapter 3 Numerical Descriptive Measures.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Unit Organizers feel free to expand, edit, or transform any of these for your own use.
Data Mining Techniques
Chapter 2: The Research Enterprise in Psychology
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Applying Science Towards Understanding Behavior in Organizations Chapters 2 & 3.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved.
● Midterm exam next Monday in class ● Bring your own blue books ● Closed book. One page cheat sheet and calculators allowed. ● Exam emphasizes understanding.
DR. AHMAD SHAHRUL NIZAM ISHA
Multivariate Statistical Data Analysis with Its Applications
STATISTICS: BASICS Aswath Damodaran 1. 2 The role of statistics Aswath Damodaran 2  When you are given lots of data, and especially when that data is.
Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data.
Describing and Exploring Data Initial Data Analysis.
Semester 2: Lecture 2 Quantitative Data Analysis Prepared by: Dr. Lloyd Waller ©
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Quantitative Research Design and Statistical Analysis.
Taxonomies and Laws Lecture 10. Taxonomies and Laws Taxonomies enumerate scientifically relevant classes and organize them into a hierarchical structure,
Evaluating a Research Report
Role of Statistics in Geography
Descriptive Statistics Roger L. Brown, Ph.D. Medical Research Consulting Middleton, WI Online Course #1.
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
Lecture 1.2 Field work (lab work). Analysis of data.
RESEARCH METHODS 2. Psychological research methods The type of data collected in psychological research is used as the basis of classifying research methods.
MAT 1000 Mathematics in Today's World. Last Time 1.Three keys to summarize a collection of data: shape, center, spread. 2.Can measure spread with the.
Descriptive Statistics becoming familiar with the data.
Chapter 21 Basic Statistics.
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Discovering Descriptive Knowledge Lecture 18. Descriptive Knowledge in Science In an earlier lecture, we introduced the representation and use of taxonomies.
A Context Model based on Ontological Languages: a Proposal for Information Visualization School of Informatics Castilla-La Mancha University Ramón Hervás.
Descriptive Statistics Prepared by: Asma Qassim Al-jawarneh Ati Sardarinejad Reem Suliman Dr. Dr. Balakrishnan Muniandy PTPM-USM.
CEN st Lecture CEN 4021 Software Engineering II Instructor: Masoud Sadjadi Monitoring (POMA)
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
1.1 Statistical Analysis. Learning Goals: Basic Statistics Data is best demonstrated visually in a graph form with clearly labeled axes and a concise.
Structural Models Lecture 11. Structural Models: Introduction Structural models display relationships among entities and have a variety of uses, such.
Statistical Analysis Quantitative research is first and foremost a logical rather than a mathematical (i.e., statistical) operation Statistics represent.
Chapter 2 The Research Enterprise in Psychology. Table of Contents The Scientific Approach: A Search for Laws Basic assumption: events are governed by.
Chapter 6: Analyzing and Interpreting Quantitative Data
Economics 173 Business Statistics Lectures 1 Fall, 2001 Professor J. Petry.
1 UNIT 13: DATA ANALYSIS. 2 A. Editing, Coding and Computer Entry Editing in field i.e after completion of each interview/questionnaire. Editing again.
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Research Methodology Lecture No :32 (Revision Chapters 8,9,10,11,SPSS)
Chapter 6 Becoming Acquainted With Statistical Concepts.
Techniques for Decision-Making: Data Visualization Sam Affolter.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
DATA VISUALIZATION BOB MARSHALL, MD MPH MISM FAAFP FACULTY, DOD CLINICAL INFORMATICS FELLOWSHIP.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
Data analysis and basic statistics KSU Fellowship in Clinical Pathology Clinical Biochemistry Unit
Becoming Acquainted With Statistical Concepts
Marketing Research Introduction Overview.
Module 8 Statistical Reasoning in Everyday Life
Data analysis and basic statistics
Statistical Data Analysis
Lecture 1: Descriptive Statistics and Exploratory
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Descriptive Statistics
Relating Models to Data
Presentation transcript:

Scientific Data Annotation and Analysis Lecture 7

Data Annotation, Processing, and Analysis Data are expensive to gather and confounded by noise, but they are the primary means of validation in the sciences. Data annotation helps scientists effectively share their data and maximize its use in knowledge discovery. Processing steps help control the quality of the data by reducing irrelevant variation and handling missing values. Data analysis helps scientists form conjectures about their data and identify hidden relationships. Informatics tools can support each of these activities, although tools for analysis receive the most attention.

Data Annotation Data annotation includes several activities, such as labeling measurements, adding structure to data, describing the collection environment, and recording provenance. This information enhances the use of scientific data in collaborative environments and enables data integration. Shared, controlled vocabularies let scientists communicate how and why data were collected to reduce data misuse. In some cases the annotations supplant the original observations to become a new form of scientific data.

Ontologies Controlled vocabularies are a collection of established terms used for annotation. Ontologies go further by structuring terms into classes, their instances, attributes with allowed values, and relations. The is-a and part-of relations often have special status and impose hierarchical structures on the classes. For example, a neutron is-a subatomic particle, (relation) is part-of the nucleus, and (relation) has-charge 0e. (attribute) In this manner, classes are defined by their attributes and relations in a way that supports automated reasoning.

Ontology Creation There are several ontology formalisms including CycL, frame languages, and OWL (the Web Ontology Language) Informatics tools like Protégé enable ontology design and development without familiarity with a specific language. Collaborative tools such as BioPortal let scientists search available ontologies, visualize their structure, comment on their contents, and map concepts between ontologies. These tools initiate the larger scientific populace into the means and ends of knowledge representation.

Data Sources That Use Ontologies Several data sources use ontologies to facilitate information retrieval and data sharing on the web: Note that biology and biomedicine are the informal testing grounds for ontologies in scientific practice. Protein Data Bank, Mouse Genome Informatics, FlyBase (Drosophila), VectorBase (disease carriers), ZFIN (Zebra Fish), and many others.

Editing an Ontology with Protégé

Visualizing an Ontology in BioPortal

Using Annotations Annotated data serves several purposes such as enhancing traditional information retrieval approaches with shared knowledge of concepts and relationships; tracking the source an original use of scientific data to facilitate proper interpretation and use by third parties; creating a new, structured representation of the data that scientists can reason about. The Video Annotation and Reference System (VARS) enables these capabilities and more. Using an ontology, researchers describe video of observed entities, their location, and other properties. VARS was designed for marine biologists, but the use of an explicit ontology simplifies customization to other fields.

Annotating Video Records with VARS

Data Preparation Observations often require processing before serving as scientific data. Even then, data may require further preparation before analysis such as When correctly applied, these steps help ensure the reliability of scientific results. normalizing the data to enable the comparison of results across experiments; filtering the data to enhance the signal; and estimating the values of missing observations.

Data Normalization and Filtering Normalization counters systematic and uninformative variation in measurement tools and measured entities. Normalization of fMRI data maps individual results to an “average” brain to enable comparison across people. Normalization of microarray data combats incidental variation across experimental settings. Normalizations may also transform data to fit a normal distribution to support the use of statistical analyses. Filters remove unreliable data and irrelevant noise by scanning for outliers, smoothing trajectories, etc. Informatics tools for filtering and normalization are often problem specific (caGEDA for microarrays, FIASCO for fMRI, ProMAX for seismic data).

Handling Missing Data Missing data can skew the distribution of a sample: Imputation builds a (typically shallow) underlying model of the available data that provides the missing values. SPSS, SAS, and R include imputation routines. substituting the mean is no longer encouraged; for series data, interpolation fits a (localized) curve to the data set and estimates the missing values from it; maximum likelihood estimation and multiple imputation are the most common approaches. some values may be more difficult to detect than others; removing observations with missing values may result in a biased sample. Imputation involves estimating the missing values:

Data Analysis Analysis tools can reveal the patterns and relationships hidden within a scientific data set. Abstract views of these relationships are gathered through a combination of These analyses describe the key characteristics of data sets, helping scientists form conjectures. Informatics tools supporting these analyses include Excel, SPSS, Minitab, and R. descriptive statistics, correlation tables, and exploratory data analysis.

Descriptive Statistics and Correlations Descriptive statistics include quantitative measures of Correlation tables identify linear relationships between variables in a multivariate data set. The correlation coefficient ranges between -1.0 and 1.0 and provides heuristic evidence for interesting interactions. central tendency (e.g., mean, median), variability (e.g., range, standard deviation), and skewness (whether a distribution leans to one direction). Example distributions and their correlation coefficients.

Exploratory Data Analysis Exploratory data analysis includes a collection of techniques designed to These techniques complement statistical approaches to testing hypotheses and providing quantitative summaries. Informatics support for exploratory data analysis includes: identify potential causal factors in a data set; locate outliers for analysis or removal; and produce other general intuitions about the data. Data Desk, SOCR, and JMP.

Exploratory Data Analysis Exploratory data analysis favors graphical techniques that reveal trends in the data. Autocorrelation plots reveal interactions between measurements in time series. Boxplots reveal the effect of alternative conditions on sample distributions. Histograms illustrates the distribution of a single variable and reveals the number of modes, its skewness, and its spread.

Data Annotation and Analysis: Summary Data annotation assists primarily in information retrieval, but it has potential for data and knowledge integration. Ontology-based annotation is moving from basic research to routine practice especially in biology. However, we need rich informatics tools that use the well established knowledge bases such as the Gene Ontology. Software for data processing is becoming more common, but different types of data have different needs. General informatics tools that are readily specialized to particular sciences could address this situation. Data analysis tools are ubiquitous and valuable for scientists, but proper application remains a problem.