Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and.

Similar presentations


Presentation on theme: "The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and."— Presentation transcript:

1 The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and Information Science University of Illinois at Urbana-Champaign

2 Time

3 How do we understand the evolution of digital objects? Time

4 How do we understand the evolution of digital objects when they are complexly interrelated? c/o Steve Worley, NCAR

5 Evolution as a tree From http://tolweb.org/tree/home.pages/aboutoverview.html

6 tl;dr

7 1)Biologists construct evolutionary trees by comparing animals’ traits and inferring how they may have evolved

8 tl;dr 1)Biologists construct evolutionary trees by comparing animals’ traits and inferring how they may have evolved 2)And there’s lots of free, open source software available for this work.

9 Why not datasets? (which, like organisms, also often lack explicit documentation…) Cornets (Tëmkin & Eldredge, 2007) “Little Red Riding Hood” (Tehrani, 2013) Non-biological evolution

10 A phylogenetic approach helps us: Study evolution of digital objects more rigorously Model how digital objects are reworked into new “species” Understand what properties of a digital object must be preserved or expressed to facilitate modeling We ask: In a digital object, what properties lead to evolutionary fitness?

11

12 Dataset of datasets: COADS, ICOADS and its derivatives (I)COADS= (International) Comprehensive Ocean and Atmosphere Dataset Community project bringing together 1000s of marine surface measurements from buoys, ship’s logs, more – First release: 1987 – New releases as new datasets are added; now at 2.5 Enormously modified & reused by others in climate science

13 Towards a more rigorous view of the evolutionary process: anagenesis and phylogenesis ICOADS documentation largely describes anagenesis (versioning) GCMD* = 1 of many potential sources of data on phylogenesis (branching) – Found 99 metadata records versions/derivatives of ICOADS (“specimens”) through keyword search – Metadata includes scientific paramaters, geographic scope, instruments used, more *known problems in metadata quality, but value in GCMD is breadth rather than depth

14 Workflow Download records Create character matrix Create a NEXUS file Assess the tree!

15 Workflow Download records Create character matrix Create a NEXUS file Assess the tree!

16 Identifying “characters” In phylogenetics: characters are morphological features, DNA, other measurable qualities In ICOADS datasets: we treated each metadata field as a character, and each term as a character state

17 Dates, times, resolution are “binned” into categories Parameters are split into individual categories, and presence/absence are noted in binary

18 https://github.com/akthom/phylomemetics

19 Method: * Software: PAUP* (Phylogenetic Analysis Using Parsimony *and other methods) Maximum Likelihood algorithm (we can talk about that more if people are interested). Result:

20

21 Phylogeny of ICOADS datasets Each fork = a “speciation event” Each group joined at a node = a “clade” – We annotated primary clades

22 Related datasets cluster; some clades show up as derived from “ancestral” forms – Clade 1 – original COADS datasets – Clade 2 – ICOADS input datasets – Clade 3 – Sea surface flux calculations – Clade 4 – later COADS data products – Clade 5 – COADS derivatives

23

24 Why does it matter that digital objects evolve? Or how? Digital preservation implications – A way to understand the history and contents of a collection – Could be used to browse repositories? – Could be used to complement citation analysis? Offers a lens into cooperative processes that create objects – A way to “read” interplay of different scientific cultures

25 Challenges and areas for future work What existing statistical models of evolution are most appropriate for this? Or do we need to develop a new one? How can existing software be modified for this work? How do we show reticulating relationships?

26 Future work: Phylogenies showing hybridization & ‘spontaneous generation’

27 Future work: what makes a dataset “fit”? Part of ICOADS success and proliferation is surely due to low levels of “competition” – But is some of it due to its open availability? – How do we test the effects of openness on a dataset’s fitness-for-purpose?

28 Acknowledgements Thanks to Julie Allen, Peter Fox and Steve Worley for feedback, and our reviewers for excellent comments. Thanks to CIRSS and the DCERC program for funding

29 References & Additional Reading Datasets mentioned in this talk: https://github.com/akthom/phylomemetics https://github.com/akthom/phylomemetics Howe, C. J., & Windram, H. F. (2011). Phylomemetics-- evolutionary analysis beyond the gene. PLoS Biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069 O’Brien, M. J., Darwent, J., & Lyman, R. L. (2001). Cladistics Is Useful for Reconstructing Archaeological Phylogenies: Palaeoindian Points from the Southeastern United States. Journal of Archaeological Science, 28(10), 1115–1136. doi:10.1006/jasc.2001.0681 Tehrani JJ (2013) The Phylogeny of Little Red Riding Hood. PLoS ONE 8(11): e78871. doi:10.1371/journal.pone.0078871 Tëmkin, I., & Eldredge, N. (2007). Phylogenetics and Material Cultural Evolution. Current Anthropology, 48(1), 146–154.

30 Homology

31

32

33 Future work: Phylogenies showing hybridization & ‘spontaneous generation’


Download ppt "The Phylogeny of a Dataset Andrea K Thomer & Nicholas M. Weber Center for Informatics Research in Science and Scholarship Graduate School of Library and."

Similar presentations


Ads by Google