HPC in linguistic research Andrew Meade University Of Reading

Slides:



Advertisements
Similar presentations
15 lines representing a bull Traditional statistics Assumes data is independent Comparative methods.
Advertisements

Andrew Meade School of Biological Sciences.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
EVIDENCE OF EVOLUTION.
METHODS FOR HAPLOTYPE RECONSTRUCTION
THE EVOLUTIONARY HISTORY OF BIODIVERSITY
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Modern Evolutionary Classification Section Which Similarities are Most Important? Taxonomic groups above species were “invented” to distinguish.
Classification.
Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Constraining Astronomical Populations with Truncated Data Sets Brandon C. Kelly (CfA, Hubble Fellow, 6/11/2015Brandon C. Kelly,
The origins & evolution of genome complexity Seth Donoughe Lynch & Conery (2003)
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Genetic Algorithms Overview Genetic Algorithms: a gentle introduction –What are GAs –How do they work/ Why? –Critical issues Use in Data Mining –GAs.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Evolution of Populations
Molecular phylogenetics
Classification and Systematics Tracing phylogeny is one of the main goals of systematics, the study of biological diversity in an evolutionary context.
Chapter 15: Evolution of Populations
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
MODERN EVOLUTIONARY CLASSIFICATION In a way, organisms determine who belongs to their species by choosing with whom they will __________! Taxonomic.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Challenges of Large-scale Vehicular & Mobile Ad hoc Network Simulation Thomas D. Hewer, Maziar Nekovee, Radhika S. Saksena and Peter V. Coveney
Evolutionary Biology Concepts Molecular Evolution Phylogenetic Inference BIO520 BioinformaticsJim Lund Reading: Ch7.
INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig.
17.2 Modern Classification
Unit 5 Evolution. Biological Evolution All of the changes that have transformed life on Earth from the earliest beginnings to the diversity of organisms.
VOCABULARY EVOLUTION. GENETIC DRIFT RANDOM CHANGE IN ALLELE FREQUENCIES THAT OCCURS IN SMALL POPULATIONS.
MESQUITE: Mesh Optimization Toolkit Brian Miller, LLNL
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Why phylogenetics? Barbara Holland School of Physical Sciences University of Tasmania.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
Classification.
Classification and Phylogenetic Relationships
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Systematics and Phylogenetics Ch. 23.1, 23.2, 23.4, 23.5, and 23.7.
1.What is a language family?. A group of languages that came from the same ancestor language and have words in common.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Modelling evolution Gil McVean Department of Statistics TC A G.
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Classification Biology I. Lesson Objectives Compare Aristotle’s and Linnaeus’s methods of classifying organisms. Explain how to write a scientific name.
Ms. Hughes.  Evolution is the process by which a species changes over time.  In 1859, Charles Darwin pulled together these missing pieces. He was an.
Lesson Overview Lesson Overview Modern Evolutionary Classification 18.2.
英语词汇学课程课件 课件名称:英语词汇的发展 制作人:寻阳、孙红梅 单位:曲阜师范大学外国语学院.
Section 2: Modern Systematics
Chapter 14. Conclusions From “The Computational Nature of Language Learning and Evolution” Summarized by Seok Ho-Sik.
Introduction to Bioinformatics Resources for DNA Barcoding
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Daniil Chivilikhin and Vladimir Ulyantsev
Harnessing the Power of Condor for Human Genetics
Pipelines for Computational Analysis (Bioinformatics)
Section 2: Modern Systematics
High-throughput Biological Data The data deluge
Model-Driven Analysis Frameworks for Embedded Systems
EVOLUTION VOCAB Chapter 14 & 16
Methods of molecular phylogeny
Chapter 25 Phylogeny and the Tree of Life
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
15-3: Darwin Presents His Case
Presentation transcript:

HPC in linguistic research Andrew Meade University Of Reading

HPC use in linguistic research Linguistic and biological models Phylogenies Linguistic data Models of evolution Parallelism Scaling Results On going work Key challenges

Linguistic and biological systems AttributeGeneticsLinguistics Discrete units nucleotides, codons, genes, individuals words, grammar, syntax Replicationtranscription teaching, learning, imitation Dominant mode(s) of inheritance parent-offspring, clonal parent-offspring, peer groups, teaching Horizontal transmissionmany mechanismsborrowing Mutation many mechanisms SNP’s, mobile DNA, mistakes, vowel shifts, innovation Selection fitness differences among alleles ?

Inferring evolutionary histories form linguistic data Evolutionary histories, phylogenies Tools for understand evolution Depicts relationships between languages Identify groups which share a common ancestor Calculate timing events Account for lack of independence in the data Inferred from data, taken from different languages Using an explicate statistical model of evolution Problem is NP-hard, growth is a double factorial. Markov chain Monte Carlo search methods, heuristic search, hill climber Product of Data + Model

Greek Indo-Iranian Slavic Germanic Celtic Romance

The Data Swadesh list, Morris Swadesh 1940, onwards 200 meaning, present in all languages (all most) Chosen to be stable, slowly evolving and resistant to borrowing Some what of a language “gene”

Cognate classes Word with a common evolutionary ancestry and meaning English Fish Danish Fisk Dutch Visch Fish Ryba Czech Ryba Russian Ryba Bulgarian Riba 23 other languages 34other languages

Data coding, Cognates Cognates, words and meaning what are derived from a common ancestor Languages evolve by a processes of descent with modification Englishwhenwater Germanwannwasser Frenchquandeau Italianquandoacqua Greek qotenero Hittite kuwapiwatar English German French Italian Greek Hittite “Water” 3 cognates “When” 1 cognate

Continuous-time Markov Model Q 01 0 Non cognate 1 Cognate Q 10 Q 01 Rate at which cognates are gained Q 10 Rate at which cognates are lost

The Likelihood Model Product over the model 1 – 12 categories Product over the data 200 – 100,000 sites

Level of parallelism Data – Analysis of multiple datasets (3-5) Model – Test a range of models (10-20) Run – Stochastic process multiple runs (5-10) Code – individual run can still take years Trivially parallel

The problem 2003 – 16 taxa, 125 sites, 1 x model 2005 – 87 taxa, 2450 sites, 4 x model 2007 – 400 taxa, 34,440 sites, 100 x model Complexity 700,000x, 5-6 order of magnitude 4.8 years per run, typically 5 publication quality runs + 10 model tests 4.8 years < attention span of academics results are required in days

Parallel method 1 Distribute the data (MPI) Cognates Languages Data Core 1Core 2Core ……………………..……………..

Parallel method 2 Distribute the model (OpenMP) Data Core 1 Pass 1 Data Core 2 Pass 2 Data Core 3 Pass 3 Data Core 4 Pass 4

Distribute the data and the model (MPI + OpenMP) Data Core 1 Pass 1 Core 2 Data Core 3 Pass 2 Core 4 Data Core 5 Pass 3 Core 6 Data Core 7 Pass 4 Core 8

Cores Seconds - log 10

Cores Efficiency

Results Runtime reduced from 4.8 years to Good scaling, but not sustainable HPC has allowed for the accurate analysis of large complex data sets with statistically justifiable models. CoresDays

Current work Phoneme data Modelling sound utterances Better resolution than cogency data Relevant linguistics patterns are emerging 120 phonemes, 2 cogency judgments Another 3 order of magnitude complexity Accelerator implementation CUDA / OpenCL LanguageWordCogencyPhoneme EnglishFish1 DanishFisk1

Scalable computing Last 10 years, 5-6 order of magnate increase in complexity Reasonably scalable code redesign needed. Need to change the how not the what What – statistical framework, realistic models How – algorithm, language, parallelisation method, hardware Scalable algorithms

Burn in Serial Convergence Parallel

Parallel sampling using multiple chains

Key challenges Computing is a rate limiting step Trending water / drowning Widening gap between computing power and data models complexity Data set size and model complexity restricted year old methods, which are less accurate and non statistical are returning Connecting researchers with results not HPC HPC is a nuisance in science Steep learning curve High cost. Hardware, running costs and personnel Access and flexibility Not one off activity, thousands of data sets are produced each year, published in 2011

Acknowledgments Mark Pagel