Sequence File Formats.

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

• Exam II Tuesday 5/10 – Bring a scantron with you!
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
RNA Say Hello to DNA’s little friend!. EngageEssential QuestionExplain Describe yourself to long lost uncle. How do the mechanisms of genetics and the.
Lectures on Computational Biology HC Lee Computational Biology Lab Center for Complex Systems & Biophysics National Central University EFSS II National.
Molecular Techniques in Molecular Systematics. DNA-DNA hybridisation -Measures the degree of genetic similarity between pools of DNA sequences. -Normally.
Exciting Developments in Molecular Biology As seen by an amateur.
You Must Know How the sequence and subcomponents of proteins determine their properties. The cellular functions of proteins. (Brief – we will come back.
Chapter 27 Amino Acids, Peptides, and Proteins. Nucleic Acids.
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
How does DNA work? What is a gene?
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
Human Genetic Variation Basic terminology. What is a gene? A gene is a functional and physical unit of heredity passed from parent to offspring. Genes.
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
How Proteins Are Made Mrs. Wolfe. DNA: instructions for making proteins Proteins are built by the cell according to your DNA What kinds of proteins are.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
PROTEIN SYNTHESIS NOTES #1. Review What is transcription? Copying of DNA onto mRNA Where does transcription occur? In the Nucleus When copying DNA onto.
© 2010 Pearson Education, Inc. Lectures by Chris C. Romero, updated by Edward J. Zalisko PowerPoint ® Lectures for Campbell Essential Biology, Fourth Edition.
LESSON 4: Using Bioinformatics to Analyze Protein Sequences PowerPoint slides to accompany Using Bioinformatics : Genetic Research.
AMINO ACIDS.
Genetics in ~1920: 1. Cells have chromosomes Sketch of Drosophila chromosomes (Bridges, C. 1913)
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Fig Second mRNA base First mRNA base (5 end of codon) Third mRNA base (3 end of codon)
intro-VIRUSES Virus NamePDB ID HUMAN PAPILLOMAVIRUS 161DZL BACTERIOPHAGE GA1GAV L-A virus1M1C SATELLITE PANICUM MOSAIC VIRUS1STM SATELLITE TOBACCO NECROSIS2BUK.
Macromolecules of Life Proteins and Nucleic Acids
CELL REPRODUCTION: MITOSIS INTERPHASE: DNA replicates PROPHASE: Chromatin condenses into chromosomes, centrioles start migrating METAPHASE: chromosomes.
End Show Slide 1 of 39 Copyright Pearson Prentice Hall 12-3 RNA and Protein Synthesis 12–3 RNA and Protein Synthesis.
RNA 2 Translation.
Transcription and Translation
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Chapter 3 Proteins.
CS273a A Zero-Knowledge Based Introduction to Biology Courtesy of George Asimenos.
DANDY Deoxyribonucleic Acid ALL CELLS HAVE DNA… Cells are the basic unit of structure and function of all living things. –Prokaryotes (bacteria) –Eukaryotes.
Parts is parts…. AMINO ACID building block of proteins contain an amino or NH 2 group and a carboxyl (acid) or COOH group PEPTIDE BOND covalent bond link.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Proteins Tertiary Protein Structure of Enzyme Lactasevideo Video 2.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Amino acids Proof. Dr. Abdulhussien Aljebory College of pharmacy
Amino acids.
Translation PROTEIN SYNTHESIS.
Whole process Step by step- from chromosomes to proteins.
Please turn in your homework
Protein Synthesis: Translation
Alignment Sequence, Structure, Network
BIOLOGY 12 Protein Synthesis.
RNA Ribonucleic Acid.
Protein Sequence Alignments
Proteins.
Transport proteins Transport protein Cell membrane
Chapter 2 part 2: Biochemistry
Sequence Alignment ..
The genetic code © 2016 Paul Billiet ODWS.
Figure 3.14A–D Protein structure (layer 1)
The Interface of Biology and Chemistry
Chapter 3 Proteins.
Fig. 5-UN1  carbon Amino group Carboxyl group.
A Ala Alanine Alanine is a small, hydrophobic
The Structure and Function of Macromolecules
South African amaXhosa patients with atopic dermatitis have decreased levels of filaggrin breakdown products but no loss-of-function mutations in filaggrin 
Today’s notes from the student table Something to write with
Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.
The 20 amino acids.
Translation.
The 20 amino acids.
Do now activity #5 How many strands are there in DNA?
The Chemical Building Blocks of Life
Example of regression by RBF-ANN
Proteins Proteins have many structures, resulting in a wide range of functions Proteins do most of the work in cells and act as enzymes 2. Proteins are.
“When you understand the amino acids,
Presentation transcript:

Sequence File Formats

Sequence File Formats Different formats for different uses Competing formats developed in parallel Some easy to read, some easy to write programs Don’t have to stick to these formats, but parsers already written! Most formats are plain text (not .bam files!)

Id’s versus accessions When people first started, they were using gene names as id’s But too few gene names, and databases require unique ids Now have a variety of accession numbers The simplest id is a number that you increment, as you can (almost) never run out of IDs.

Standard genetic code Symbol Meaning Origin G Guanine A Adenine C Cytosine T Thymine R G or A puRine Y T or C pYrimidine M A or C aMino K G or T Keto N G or A or T or C aNy

Standard protein codes One Three Amino acid A Ala Alanine M Met Methionine C Cys Cysteine N Asn Asparagine D Asp Aspartic acid P Pro Proline E Glu Glutamate R Arg Arginine F Phe Phenylalanine S Ser Serine G Gly Glycine T Thr Threonine H His Histidine V Val Valine I Ile Isoleucine W Trp Tryptophan K Lys Lysine Y Tyr Tyrosine L Leu Leucine X Xaa Unknown

Fasta Simplest file format. Easy to parse, easy to use >identifier [optional information] ATGACTAGCATGCATCGATCGATCGACTAGCAT G ACTGCACTACGACGACAGCAAC >identifier2 [optional information] ACTAGCTCAGCTAGAGAGCTACGATCAGCACTA C atccgatagcatgacttactACGCTAGCATCAG TCATACAT

GenBank More complex, includes detailed information on genes, cds, annotation etc Human readable Difficult to parse Use standard parsers (bioperl, biojava, etc)

LOCUS NC_001418 5833 bp ss-DNA circular PHG 17-APR-2009 DEFINITION Pseudomonas phage Pf3, complete genome. ACCESSION NC_001418 VERSION NC_001418.1 GI:9626316 DBLINK Project:14061 KEYWORDS . SOURCE Pseudomonas phage Pf3 ORGANISM Pseudomonas phage Pf3 Viruses; ssDNA viruses; Inoviridae; Inovirus. FEATURES Location/Qualifiers source 1..5833 /organism="Pseudomonas phage Pf3" /mol_type="genomic DNA" /host="Pseudomonas aeruginosa" /db_xref="taxon:10872" /note="Pf3 bacteriophage DNA from P.aeruginosa infected with plasmid RP1." gene join(5763..5833,1..106) /locus_tag="Pf3_1" /db_xref="GeneID:1260905" CDS join(5763..5833,1..106) /note="orf 58, part 2" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_040651.1" /db_xref="GI:9626317" /translation="MSYYVCVQLVNDVCHEWAERSDLLSLPEGSGLQIGGMLLLLSAT AWGIQQIARLLLNR"

3241 aggtcctgtt ggccttaaga tcacccaagg gcatcttgcc agatggtacc gtcattactt 3301 atgagaaaat atcctcaatg ggtaatggct ataccttcga gcttgagtcg cttatatttg 3361 cggctcttgc tcggtcttta tgcgaattac tgggcttacg accgtcagat gttacggtct 3421 atggcgatga cataatattg ccatcagacg cgtgcagtcc tctagttgaa gttttctcct 3481 atgttggttt tcgtaccaac aagaagaaaa cgttttctag tggaccgttc cgagagtcgt 3541 gcggaaagca ctactttttg ggcgttgacg tcacaccttt ctacatacgt cgccgtatag 3601 tgagtccctc cgatctcata ctggttttga accagatgta tcgttgggcc acaattgacg 3661 gcgtatggga tcctagggta tatcctgtat acaccaagta tagacgttac cttccggaaa 3721 ttctccggag gaatgtcgtg cctgatggat acggtgatgg tgccctcgtc ggatctgtct 3781 taatcagtcc tttcgcagaa aatcgcggtt gggttcggcg tgtgccgatg attatagaca 3841 agaggaaaga ccgagttcgt gacgaatatg gttcgtatct ctacgagcta tggtcgttgc 3901 agcaactcga atgtgacagt gagttcccct ttaacgggtc gctggtcgtt ggttccactg 3961 atggcactct cgcttacgca caccgagaac ggttacctac cgttatcagt gatgccgtaa 4021 gtgcgtttga catcatgtgg ataccgtgca gtagtcgtgt cctggctccc tacggggatt 4081 tccggaggca cgaaggctct atcctaaaaa tggggtagcg cctgggaggg gtgcattatg 4141 caccctaggt tagcaatact taaactaacc ttctcaaaag agagagtgaa ggctctgctt 4201 tgccctcact cctccca // LOCUS NC_003301 3192 bp ds-RNA linear PHG 23-AUG-2008 DEFINITION Pseudomonas phage phi8 segment S, complete sequence. ACCESSION NC_003301 VERSION NC_003301.1 GI:17736965 DBLINK Project:14731 KEYWORDS . SOURCE Pseudomonas phage phi8 ORGANISM Pseudomonas phage phi8 Viruses; dsRNA viruses; Cystoviridae; Cystovirus.

GFF3 Tab separated format Easy to parse Columns: Contig Source database Feature type Start Stop Score Strand Phase Attributes Tab separated format Easy to parse Attributes are tag/value pairs separated by “;”

ASN.1 Developed as computer readable form of GenBank Not widely used

ASN.1 seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQA TGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFA PTLMSSCITSTTGPPAWAGDRSHE" } } , iupacaa "TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAGSRPNRFAPTL MSSCITSTTGPPAWAGDRSHE" } }

Base calling Need to be sure which base you have identified Depends on the technology Each machine includes software Phred is an historical package developed by at U. Washington Phred scores are probability that the base is correct 13

Quality values Phred 10: 1 x 101 chance that the base is wrong Phred 99: the base is correct! Fastq scores are the score + 33 then converted to ascii text 14

FastQ Based on fasta format Contains information about the quality of the sequence Quality comes from sequencing machines! Four lines per sequence: Line starting @ = identifier line before the sequence DNA sequence Line starting + = identifier line before the quality scores String = quality scores as ASCII + 33

ASCII character codes ASCII Char 33 ! 50 2 70 F 90 Z 110 n 34 " 51 3 71 G 91 [ 111 o 35 # 52 4 72 H 92 \ 112 p 36 $ 53 5 73 I 93 ] 113 q 37 % 54 6 74 J 94 ^ 114 r 38 & 55 7 75 K 95 _ 115 s 39 ' 56 8 76 L 96 ` 116 t 40 ( 57 9 77 M 97 a 117 u 41 ) 58 : 78 N 98 b 118 v 42 * 59 ; 79 O 99 c 119 w 43 + 60 < 80 P 100 d 120 x 44 , 61 = 81 Q 101 e 121 y 45 - 62 > 82 R 102 f 122 z 46 . 63 ? 83 S 103 g 123 { 47 / 64 @ 84 T 104 h 124 | 48 65 A 85 U 105 i 125 } 49 1 66 B 86 V 106 j 126 ~

fastq DNA sequence @SRR014849.1 EIXKN4201CFU84 length=93 GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5’9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8== Quality scores Note: Illumina has a format of fastq that is not compatible with everyone else’s format!

How to convert fastq to fasta prinseq-lite.pl -fastq input.fastq -out_format 2 https://edwards.sdsu.edu/research/fastq-to-fasta/