Estimating Entropy and Diversity Profiles Based on

Slides:

Advertisements

Similar presentations

Introduction to the analysis of community data Vojtech Novotny Czech Academy of Science, University of South Bohemia & New Guinea Binatang Research Center.

Advertisements

Community and gradient analysis: Matrix approaches in macroecology The world comes in fragments.

Phylogenetic Diversity Measures Based on Hill Numbers Anne Chao National Tsing Hua University Institute of Statistics Hsin-Chu, Taiwan Eco-Stats.

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.

. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.

Detecting Temporal Trends In Species Assemblages With Randomization Procedures And Hierarchical Models Nick Gotelli University of Vermont USA.

Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.

Parametric Inference.

Diversity and Distribution of Species

CHAPTER 3 Community Sampling and Measurements From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach,

Calculating Diversity Class 3 Presentation 2. Outline Lecture Class room exercise to calculate diversity indices.

Species Richness, Simpson’s, and Shannon-Weaver…oh my…

Chapter 17 Community Structure A community has attributes that differ from those of its components –Number of species –Relative abundance of species –Nature.

STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)

MANAGEMENT AND ANALYSIS OF WILDLIFE BIOLOGY DATA Bret A. Collier 1 and T. Wayne Schwertner 2 1 Institute of Renewable Natural Resources, Texas A&M University,

Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.

Theory of Probability Statistics for Business and Economics.

Comparing two sample means Dr David Field. Comparing two samples Researchers often begin with a hypothesis that two sample means will be different from.

Measuring Diversity.

Community Ecology BDC321 Mark J Gibbons, Room 4.102, BCB Department, UWC Tel: Image acknowledgements –

PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?

Learning Targets “I Can…”

PCB 3043L - General Ecology Data Analysis.

Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.

Species richness The number of species is an important biological variable that scientists try to quantify.

 1 Species Richness 5.19 UF Community-level Studies Many community-level studies collect occupancy-type data (species lists). Imperfect detection.

Monitoring and Estimating Species Richness Paul F. Doherty, Jr. Fishery and Wildlife Biology Department Colorado State University Fort Collins, CO.

Statistics for Business and Economics 7 th Edition Chapter 7 Estimation: Single Population Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.

Some Wildlife Census Techniques

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.

Virtual University of Pakistan

SUR-2250 Error Theory.

Biodiversity and agriculture

Confidence Intervals and Sample Size

Chapter 4 Basic Estimation Techniques

Quantifying biological diversity Lou Jost

Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.

Making inferences from collected data involve two possible tasks:

Confidence Interval Estimation

STATISTICS POINT ESTIMATION

STATISTICAL INFERENCE

ECO 173 Chapter 10: Introduction to Estimation Lecture 5a

Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.

3. The X and Y samples are independent of one another.

Sampling Distributions and Estimation

Relative Values.

PCB 3043L - General Ecology Data Analysis.

S2 Chapter 6: Populations and Samples

Community Structure & Function

Patterns, Practicality & Preservation

ECO 173 Chapter 10: Introduction to Estimation Lecture 5a

Chapter 5 – Evolution of Biodiversity

Critical Systems Validation

Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.

STA 291 Spring 2008 Lecture 6 Dustin Lueker.

Honors Statistics From Randomness to Probability

Sampling Distributions

Counting Statistics and Error Prediction

There is a Great Diversity of Organisms

Species diversity indices

Parametric Methods Berlin Chen, 2005 References:

Sampling Distributions (§ )

CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.

The Bias-Variance Trade-Off

Simulation Berlin Chen

STA 291 Summer 2008 Lecture 12 Dustin Lueker.

STA 291 Spring 2008 Lecture 12 Dustin Lueker.

Species diversity: rarefaction, evenness and indices

CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.

Presentation transcript:

Estimating Entropy and Diversity Profiles Based on Turing’s Statistical Work A Symposium on Complex Data Analysis May 26, 2017 Anne Chao 趙蓮菊 National Tsing Hua Univ Institute of Statistics Hsin-Chu, Taiwan 30043

Outline: Diversity and entropy profiles Link diversity/entropy measures to species accumulation curve (SAC) and formulate measures as functions of the slopes of SAC Estimate profiles based on Alan Turing’s insight Concept of sample coverage Good-Turing frequency formula One or two examples for illustration

Three dimensions of biodiversity Species (or taxonomic) diversity consider species abundances and evenness Phylogenetic diversity also consider evolutionary history or phylogenetic distance among species Functional (or ecosystem) diversity also consider functional distance between species based on species traits

Quantifying species diversity How to quantify species diversity had once been the most controversial issue in diversity analysis A 2010 Ecology Forum has achieved a complete consensus Hill numbers (effective number of species) should be the measure of choice

q =1, 1D = exponential of entropy q = 2, 2D = inverse of Simpson index Hill’s (1973) family of diversity indices of order q 0 q = 0, 0D = species richness q =1, 1D = exponential of entropy q = 2, 2D = inverse of Simpson index S is the number of species; pi is the relative frequency of the i-th species

Quantifying species diversity by a profile (curve): plot Hill numbers w.r.t q Equally Common Slightly uneven Moderately uneven highly uneven

Equivalent classes of measures Rényi entropy of order q Tsallis entropy of order q

Statistical estimation of continuous profiles Diversity/entropy profiles must be estimated from samples because the true values of species richness/abundances are unknown Profiles are usually generated by substituting species sample proportions into the complexity measures. However, empirical approach typically underestimates the true profile especially for low values of q, 0 ≤ q ≤ 2, because samples usually miss some of the assemblage’s species due to under-sampling.

Chao & Jost (2015) approach An unbiased estimator for diversity profile doesn’t exist (e.g., Shannon entropy) Estimation of seemingly simple functions is surprisingly difficult when there are undetected species Our approach: link diversity/entropy measures to species accumulation curve (SAC), formulate measures as functions of the slopes of SAC, and then substitute slope estimators

Statistical model based on a random sample of n individuals S: number of species and indexed by 1, 2, …, S (unknown) Sobs: observed number of species Xi : (species frequency) number of times (individuals) of the ith species observed in the sample. Only species with frequency X > 0 in the sample are detected. fk: the number of species that are observed exactly k times in the sample.

Species frequency counts: fk f0: zero-frequency is unobservable. (unknown) (number of undetected/unseen species) f1: number of species that are observed once (singletons) f2: number of species that are observed twice (doubletons)

An Example: n =10

Species Accumulation Curve (SAC): plot S(k) w.r.t. sample size k Model pi: relative abundance (probability) of species i The expected number of species in a sample of size k is (Good 1953)

Expected SAC

Slope of SAC The slope of the line connecting (k, S(k)) and (k+1, S(k+1)) The slope is the probability that the (k+1)-th individual represents a species that was missed in the previous sample of size k (the successive discovery rates of previously-unsampled species)

Link measures to SAC

Diversity/entropy: infinite sum of the successive slopes via SAC Diversity (Hill numbers) Rényi entropy of order q Tsallis entropy of order q

Expected SAC

Shannon entropy: infinite sum of the successive slopes via SAC

Wisdom of Alan M. Turing (1912-1954): who is often considered to be founder of modern computer science

Turing memorial statue in Sackville Park, Manchester, UK

Turing and me on Dec. 11, 2016 after traveling 24 hours from Taiwan I thought Turing would like an apple instead of flowers

Turing memorial statue plaque in Sackville Park, Manchester, UK

Coverage of an observed sample of size n The total probabilities (relative abundances) of the species discovered in sample An objective measure of sample completeness: the fraction of the individuals in an assemblage (including all undetected individuals) that belong to the species observed in the sample

Estimating sample coverage: Contrary to most people’s intuition Turing (Good 1953) showed that sample coverage can be very accurately and efficiently estimated based on the sample itself 1 – Cn: the conditional (on data) probability of discovering a new species if an additional observation (individual) were to be taken (Coverage deficit)

An Example: n =10 1 2 3 4 5 11 6 7 8 9 10

Estimation of coverage of an observed sample of size n Turing’s estimator: proportion of singletons in an enlarged sample of size n + 1 ~ proportion of singleton in a size of n Sample coverage for the observed sample ~ 1- proportion of singletons

Expected coverage for any sample size k For any k, the expected coverage in a sample of size k is Slope at size k = Complement of expected coverage for a sample of size k

Separation into two sums Separate the infinite sum into two sums: the first sum with k < n, and the second sum with k ≥ n For k < n, MVUE exists:

Second sum via slope estimation The second sum involves the expected slopes for sample sizes > n, and no unbiased estimator exists. This part is usually dominated by rare undetected species whose effect on diversity/entropy cannot be ignored. The burden of profile estimation is shifted onto this second sum. Estimating the second sum via successive slopes estimators based on the wisdom of Turing and Good: singletons and doubletons carry much information about the number of undetected rare species.

Second part via slope estimation An estimator for the slope at sample size n + m is m ≥ 0 = mean relative frequency of singletons

The Good-Turing Frequency Formula Given data, for those species that appeared r times (r = 0, 1, …) in an incomplete sample of n individuals, what is their mean relative abundance? where I(A) is the indicator function, i.e., I(A) = 1 if the event A occurs, and 0 otherwise.

The Good-Turing Frequency Formula Contrary to most people’s intuition The Good-Turing frequency formula states that αr, r = 0, 1, 2, …, is not estimated by its sample frequency r/n, but rather by In other words, αr should be estimated by r*/n, where

Modified Good-Turing frequency formula Original formula for singletons Modified form (Chao and Jost 2012)

A continuous profile estimator Variance estimator: by a bootstrap method

Special cases q =0, it reduces to the Chao1 estimator (Chao 1984) q = 1, it reduces to the “entropy pearl” (Chao, Wang and Jost 2013) q  2 integer

Methods in Ecology and Evolution 2016慶祝國際婦女節介紹本人之論文 (十年磨一劍--- 35年育珍珠) 治學如琢玉: 專注完美近乎苛求

Simulation study and extension Extensive simulations from theoretical models and real surveys show that the proposed profiles greatly reduce under-sampling bias, and have substantially lower bias and mean squared error than the empirical profile, especially for 0  q ≤ 1. The method is also extended to deal with incidence data. Also extended to phylogenetic version (Hsieh and Chao 2017)

Example 1 for Illustration: Insects species data collected in two sites of Costa Rica (Janzen 1973) Osa second-growth site (Sobs =140 species, n = 976) Osa old-growth site (Sobs = 112 species, n = 237). Which site is more diverse?

Species Frequency Counts fk: the number of species that are observed exactly k times in sample There were undetected species f1: singletons A large fraction of singletons signifies that there are undetected species in sample

Example 2: Comparing diversity of two rain forest habitats Data consist of tree species abundances from two habitats (Edge and Interior) in Brazil rain forest (Magnago et al. 2014) The Edge Habitat, Sobs = 319, n = 1794, sample coverage = 93.9%, 110 singletons The Interior Habitat, Sobs = 356, n = 2074, sample coverage = 94.1%, 123 singletons

Species Frequency Counts fi: the number of species that are observed exactly i times in sample

Species diversity

Phylogenetic tree of the pooled assemblage (425 species) using PHYLOMATIC

100th~140th species in the list

Phylogenetic diversity

Main References & Software Good’s papers on Turing’s statistical work Good, I. J. (1953) The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264. Good, I. J. and Toulmin, G. H. (1956) The number of new species, and the increase in population coverage when a sample is increased. Biometrika, 43, 45-63. ---------------------------------------------------------------------------------------- Chao, A., Wang, Y. T., and Jost, L. (2013). Entropy and the species accumulation curve: a novel estimator of entropy via discovery rates of new species. Methods in Ecology and Evolution, 4, 1091-1110. Chao, A. and Jost, L. (2015) Estimating diversity and entropy profiles via discovery rates of new species. Methods in Ecology and Evolution, 6, 873-882. Software R code and online freeware application SPADE (Species-richness Prediction And Diversity Estimation) at website (http://chao.stat.nthu.edu.tw/).

THANK YOU FOR LISTENING 所謂天堂竟在人間 Heaven is under our feet as well as over our heads Henry David Thoreau , Writer and Naturalist (1817-1862) Sobralia luerorum Crested goshawk (Accipiter trivirgatus formosae) Fairy pitta (Pitta nympha) THANK YOU FOR LISTENING