Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.

Slides:

Advertisements

Similar presentations

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

Advertisements

A Lite Introduction to (Bioinformatics and) Comparative Genomics Chris Mueller August 10, 2004.

Metabarcoding 16S RNA targeted sequencing

A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.

Next Generation Sequencing, Assembly, and Alignment Methods

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.

Bioinformatics and Phylogenetic Analysis

Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.

CSE182-L12 Gene Finding.

Workshop in Bioinformatics 2010 Class # Class 8 March 2010.

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.

Genome sequencing and assembling

The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.

Sequencing a genome and Basic Sequence Alignment

Developing Pairwise Sequence Alignment Algorithms

From Haystacks to Needles AP Biology Fall Isolating Genes  Gene library: a collection of bacteria that house different cloned DNA fragments, one.

Metagenomic Analysis Using MEGAN4

How to Build a Horse Megan Smedinghoff.

Mouse Genome Sequencing

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Todd J. Treangen, Steven L. Salzberg

What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.

H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.

Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.

Lesson Overview Lesson Overview Studying the Human Genome Lesson Overview 14.3 Studying the Human Genome.

June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.

20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.

Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.

P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Sequencing a genome and Basic Sequence Alignment

Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.

Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.

BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.

EB3233 Bioinformatics Introduction to Bioinformatics.

Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.

Section 14-3: Studying the Human Genome. Manipulating DNA The SMALLEST human chromosome contains 50 million bases DNA is a HUGE molecule that is difficult.

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

FOOTHILL HIGH SCHOOL SCIENCE DEPARTMENT Chapter 13 Genetic Engineering Section 13-2 Manipulating DNA.

Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!

MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res

Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.

Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.

Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.

BME435 BIOINFORMATICS.

Virginia Commonwealth University

Metagenomic Species Diversity.

Introduction to Bioinformatics Resources for DNA Barcoding

Research in Computational Molecular Biology , Vol (2008)

Pipelines for Computational Analysis (Bioinformatics)

CSE182-L12 Gene Finding.

Objective of This Course

H = -Σpi log2 pi.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

CSCI 1810 Computational Molecular Biology 2018

Introduction to Sequencing

Applying principles of computer science in a biological context

Unit Genomic sequencing

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Genome resolved metagenomics

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Presentation transcript:

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly

Contents Team Bioinformatics  Genome Sequencing Research Problem Fuzzy Logic Ongoing Work Future Work

Team Advisors:  PI: Dr. Gregory Vert (Dept. of Computer Science, University of Nevada Reno)  Co-PI: Dr. Alison Murray (Desert Research Institute, Reno)  Co-PI: Dr. Monica Nicolescu (Dept. of Computer Science, University of Nevada Reno) Student :  Sara Nasser (Dept. of Computer Science, University of Nevada Reno)

Bioinformatics- Genome Sequencing Genome sequencing is figuring out the order of DNA nucleotides, or bases, in a genome—the order of As, Cs, Gs, and Ts that make up an organism's DNA. Sequencing the genome is an important step towards understanding it. The whole genome can't be sequenced all at once because available methods of DNA sequencing can only handle short stretches of DNA at a time.

Genome Sequencing Much of the work involved in sequencing lies in putting together this giant biological jigsaw puzzle. Various problems occur such as:  Errors in reading  Flips

Shot-Gun Sequencing The "whole-genome shotgun" method, involves breaking the genome up into small pieces, sequencing the pieces, and reassembling the pieces into the full genome sequence.

Environmental Genomics Multiple sequence alignment is an important first step in many bioinformatics applications such as structure prediction, phylogenetic analysis and detection of key functional residues. The accuracy of these methods relies heavily on the quality of the underlying alignment. [1]

Multiple Sequence Alignment The traditional multiple sequence alignment problem is NP-hard, which means that it is impossible to solve for more than a few sequences [1]. In order to align a large number of sequences, many different approaches have been developed.

Tools and Techniques MUMMER Phrap, Phred, Consed TIGR The Smith-Waterman Algorithm Tree-Based Algorithms

Meta-genomics Meta-genomics is the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species. [2]

Meta-genomics Bacteria can often have minor variations in their DNA that can result in different metabolic characteristics. The differences can make it difficult to classify bacteria taxonomically. What has been needed is a method of creating a characteristic representation (characteristic genome) from the sub sequences of DNA found in several sub variant of a bacteria of the same species. Such genome could be used for more efficient classification at a molecular level through the process of controlled generalization.

Research Goals Given a collection of nucleotide sequences from multiple organisms, develop techniques based on fuzzy set theory and other methods for assembly of the sequences into the original full genome for each organism. Using the above techniques to develop a generalized approach for creating a characteristic genome that represents a generalization of the original organisms that donated sequence data.

The Data SYM (Original Raw Data):  Contains 302K Sequences  Average length of 450 base pairs (bp) It was obtained from a community of bacteria There is an estimated of 100 organisms Lets say, for example 75% of data is repeated, we still need to reassemble a sequence of ~ 33 Million bp

Motivation Current tools could not solve the problem:  Complexity of the dataset, since they are from same species.  Sequencing environmental genomes, not a single organism.  Limited tools that sequence environmental genomes. Algorithm:  Underlying algorithm determines the accuracy of match.  Performance can be highly improved.  Interfaces could be improved.

Problem Genome assembly is a O(2 k ) problem. Using Dynamic Programming it can be reduced. Example in seconds:  Assembly that takes around seconds to can be reduced to seconds!

A Start We divide the problem in two steps  Acquiring subsets such that each subset represents an organism  Assembling this into a characteristic genome sequence The above two steps to be obtained by  Clustering  Assembly

Steps ClusteringAssembly Raw Data Assembled Sequences Characteristic Genome

The Data CAVEEG (Cleaned Dataset):  Contains 128K Sequences  Assembled + Singletons  Length ranges from 200bp-1000bp

D2 Cluster It is a software for clustering genome sequences The technique is based on distance.

Clustering with D2 Cluster Clustering was performed on 128K CAVEEG Dataset  One dataset with 100K Sequences was obtained  Majority of the data falls into one cluster  This makes the process of separating organisms hard  The clustering/assembly failed to assemble the sequences (the number of organisms were estimated manually and compared)

Problem with D2 clustering Does not look for contigs Ex: A same cluster may have: AATGCGTATTCGATGCGC CATACTTAGTCGATC – AG When we assemble we desire: AATGCGTATTCGATGCGC TGCGCATCGTATCG

Problems Since data is closely related the clustering technique assigns them to same cluster. Existing tools are unable to assemble the data correctly. The clustering software can only perform one round of clustering.

Ongoing Work Genome assembly using dynamic programming Uses Longest Common Sub-Sequence  LCS is commonly used (ex: Mummer)  We added restrictions  Enforce strict matches Encoding of data

Results

Then… We added clustering. Instead of comparing each sequence with each other we can compare them with a group. Faster, less number of comparisons.

Clustering [3]

Comparison of Clustering

Performance Comparison

How much does it matter? Obtaining an exact full length sequence it not essential A sequence that is very close to the original is desired

Snapshot 1

Snapshot 2

Fuzzy Logic Fuzzy Logic has been used extensively in approximate string matching using distance measures, etc. However, very little work has been done in application of building genomes from subsequences of nucleotides. The concept of similarity and application of fuzzy logic will be defined which is a relatively new area in nucleotide sequencing.

In Future.. Compare technique with Phrap (alignment software, Mummer) Improve clustering  Define Similarity using Fuzzy Logic  Define Dissimilarity Parallelize the process

References [1] accessed May, [2] DeLong EF (2002) Microbial population genomics and ecology. Curr Opin Microbiol 5: 520–524. [3]

Questions