Download presentation
Presentation is loading. Please wait.
Published byDennis Gallagher Modified over 6 years ago
1
Estimating Entropy and Diversity Profiles Based on
Turing’s Statistical Work A Symposium on Complex Data Analysis May 26, 2017 Anne Chao 趙蓮菊 National Tsing Hua Univ Institute of Statistics Hsin-Chu, Taiwan 30043
2
Outline: Diversity and entropy profiles
Link diversity/entropy measures to species accumulation curve (SAC) and formulate measures as functions of the slopes of SAC Estimate profiles based on Alan Turing’s insight Concept of sample coverage Good-Turing frequency formula One or two examples for illustration
3
Three dimensions of biodiversity
Species (or taxonomic) diversity consider species abundances and evenness Phylogenetic diversity also consider evolutionary history or phylogenetic distance among species Functional (or ecosystem) diversity also consider functional distance between species based on species traits
4
Quantifying species diversity
How to quantify species diversity had once been the most controversial issue in diversity analysis A 2010 Ecology Forum has achieved a complete consensus Hill numbers (effective number of species) should be the measure of choice
5
q =1, 1D = exponential of entropy q = 2, 2D = inverse of Simpson index
Hill’s (1973) family of diversity indices of order q 0 q = 0, 0D = species richness q =1, 1D = exponential of entropy q = 2, 2D = inverse of Simpson index S is the number of species; pi is the relative frequency of the i-th species
6
Quantifying species diversity by a profile (curve): plot Hill numbers w.r.t q
Equally Common Slightly uneven Moderately uneven highly uneven
7
Equivalent classes of measures
Rényi entropy of order q Tsallis entropy of order q
8
Statistical estimation of continuous profiles
Diversity/entropy profiles must be estimated from samples because the true values of species richness/abundances are unknown Profiles are usually generated by substituting species sample proportions into the complexity measures. However, empirical approach typically underestimates the true profile especially for low values of q, 0 ≤ q ≤ 2, because samples usually miss some of the assemblage’s species due to under-sampling.
9
Chao & Jost (2015) approach An unbiased estimator for diversity profile doesn’t exist (e.g., Shannon entropy) Estimation of seemingly simple functions is surprisingly difficult when there are undetected species Our approach: link diversity/entropy measures to species accumulation curve (SAC), formulate measures as functions of the slopes of SAC, and then substitute slope estimators
10
Statistical model based on a random sample of n individuals
S: number of species and indexed by 1, 2, …, S (unknown) Sobs: observed number of species Xi : (species frequency) number of times (individuals) of the ith species observed in the sample. Only species with frequency X > 0 in the sample are detected. fk: the number of species that are observed exactly k times in the sample.
11
Species frequency counts: fk
f0: zero-frequency is unobservable. (unknown) (number of undetected/unseen species) f1: number of species that are observed once (singletons) f2: number of species that are observed twice (doubletons)
12
An Example: n =10
13
Species Accumulation Curve (SAC): plot S(k) w.r.t. sample size k
Model pi: relative abundance (probability) of species i The expected number of species in a sample of size k is (Good 1953)
14
Expected SAC
15
Slope of SAC The slope of the line connecting (k, S(k)) and (k+1, S(k+1)) The slope is the probability that the (k+1)-th individual represents a species that was missed in the previous sample of size k (the successive discovery rates of previously-unsampled species)
16
Link measures to SAC
17
Diversity/entropy: infinite sum of the successive slopes via SAC
Diversity (Hill numbers) Rényi entropy of order q Tsallis entropy of order q
18
Expected SAC
19
Shannon entropy: infinite sum of the successive slopes via SAC
20
Wisdom of Alan M. Turing (1912-1954):
who is often considered to be founder of modern computer science
21
Turing memorial statue in Sackville Park, Manchester, UK
22
Turing and me on Dec. 11, 2016 after traveling
24 hours from Taiwan I thought Turing would like an apple instead of flowers
23
Turing memorial statue plaque in Sackville Park, Manchester, UK
24
Coverage of an observed sample of size n
The total probabilities (relative abundances) of the species discovered in sample An objective measure of sample completeness: the fraction of the individuals in an assemblage (including all undetected individuals) that belong to the species observed in the sample
25
Estimating sample coverage: Contrary to most people’s intuition
Turing (Good 1953) showed that sample coverage can be very accurately and efficiently estimated based on the sample itself 1 – Cn: the conditional (on data) probability of discovering a new species if an additional observation (individual) were to be taken (Coverage deficit)
26
An Example: n =10 1 2 3 4 5 11 6 7 8 9 10
27
Estimation of coverage of an observed sample of size n
Turing’s estimator: proportion of singletons in an enlarged sample of size n + 1 ~ proportion of singleton in a size of n Sample coverage for the observed sample ~ 1- proportion of singletons
28
Expected coverage for any sample size k
For any k, the expected coverage in a sample of size k is Slope at size k = Complement of expected coverage for a sample of size k
29
Separation into two sums
Separate the infinite sum into two sums: the first sum with k < n, and the second sum with k ≥ n For k < n, MVUE exists:
30
Second sum via slope estimation
The second sum involves the expected slopes for sample sizes > n, and no unbiased estimator exists. This part is usually dominated by rare undetected species whose effect on diversity/entropy cannot be ignored. The burden of profile estimation is shifted onto this second sum. Estimating the second sum via successive slopes estimators based on the wisdom of Turing and Good: singletons and doubletons carry much information about the number of undetected rare species.
31
Second part via slope estimation
An estimator for the slope at sample size n + m is m ≥ 0 = mean relative frequency of singletons
32
The Good-Turing Frequency Formula
Given data, for those species that appeared r times (r = 0, 1, …) in an incomplete sample of n individuals, what is their mean relative abundance? where I(A) is the indicator function, i.e., I(A) = 1 if the event A occurs, and 0 otherwise.
33
The Good-Turing Frequency Formula Contrary to most people’s intuition
The Good-Turing frequency formula states that αr, r = 0, 1, 2, …, is not estimated by its sample frequency r/n, but rather by In other words, αr should be estimated by r*/n, where
34
Modified Good-Turing frequency formula
Original formula for singletons Modified form (Chao and Jost 2012)
35
A continuous profile estimator
Variance estimator: by a bootstrap method
36
Special cases q =0, it reduces to the Chao1 estimator (Chao 1984)
q = 1, it reduces to the “entropy pearl” (Chao, Wang and Jost 2013) q 2 integer
37
Methods in Ecology and Evolution 2016慶祝國際婦女節 介紹本人之論文 (十年磨一劍--- 35年育珍珠) 治學如琢玉: 專注完美 近乎苛求
38
Simulation study and extension
Extensive simulations from theoretical models and real surveys show that the proposed profiles greatly reduce under-sampling bias, and have substantially lower bias and mean squared error than the empirical profile, especially for 0 q ≤ 1. The method is also extended to deal with incidence data. Also extended to phylogenetic version (Hsieh and Chao 2017)
39
Example 1 for Illustration: Insects species data collected in two sites of Costa Rica (Janzen 1973)
Osa second-growth site (Sobs =140 species, n = 976) Osa old-growth site (Sobs = 112 species, n = 237). Which site is more diverse?
40
Species Frequency Counts fk: the number of species that are observed exactly k times in sample There were undetected species f1: singletons A large fraction of singletons signifies that there are undetected species in sample
42
Example 2: Comparing diversity of two rain forest habitats
Data consist of tree species abundances from two habitats (Edge and Interior) in Brazil rain forest (Magnago et al. 2014) The Edge Habitat, Sobs = 319, n = 1794, sample coverage = 93.9%, 110 singletons The Interior Habitat, Sobs = 356, n = 2074, sample coverage = 94.1%, 123 singletons
43
Species Frequency Counts fi: the number of species that are observed exactly i times in sample
44
Species diversity
45
Phylogenetic tree of the pooled assemblage (425 species)
using PHYLOMATIC
46
100th~140th species in the list
47
Phylogenetic diversity
48
Main References & Software
Good’s papers on Turing’s statistical work Good, I. J. (1953) The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264. Good, I. J. and Toulmin, G. H. (1956) The number of new species, and the increase in population coverage when a sample is increased. Biometrika, 43, Chao, A., Wang, Y. T., and Jost, L. (2013). Entropy and the species accumulation curve: a novel estimator of entropy via discovery rates of new species. Methods in Ecology and Evolution, 4, Chao, A. and Jost, L. (2015) Estimating diversity and entropy profiles via discovery rates of new species. Methods in Ecology and Evolution, 6, Software R code and online freeware application SPADE (Species-richness Prediction And Diversity Estimation) at website (
49
THANK YOU FOR LISTENING
所謂天堂竟在人間 Heaven is under our feet as well as over our heads Henry David Thoreau , Writer and Naturalist ( ) Sobralia luerorum Crested goshawk (Accipiter trivirgatus formosae) Fairy pitta (Pitta nympha) THANK YOU FOR LISTENING
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.