Distribution of Introns among Full Length cDNA

Bioinformatics Capstone Distribution of Introns among Full Length cDNA By Xin Hong Advisor: Dr. Michael Lynch and Dr. Sun Kim

3 Motivation Genomic sequences Full length cDNA project
Gene predict program does not include UTR regions. The UTR structure and Function and NMD theory.

4 Definition of UTRs and Introns
5’UTR sequences were defined as the mRNA region spanning from the cap site to the starting codon (excluded). 3’UTR sequences were defined as the mRNA region spanning from the stop codon (excluded) to poly(A) starting site. The coding region begins with the initiation codon, which is normally ATG. It ends with one of three termination codons: TAA, TAG or TGA. Genomic sequence Pre-mRNA 1 2 3 mRNA 3UTR 5UTR CDS UAA AUG

5 Function of UTRs Translational control mRNA sub cellular localization
mRNA stability Pesole, 2001

6 Nonsense-Mediated Decay (NMD)
An mRNA is immune to NMD if translation terminates less than 50–55 nucleotides upstream or downstream of the 3′-most exon–exon junction, which is the last intron of cDNA. NMD is a a mRNA surveillance mechanism that leads to selective degradation of transcripts containing premature termination codon. Genomic sequence Pre-mRNA transcription 5’ 3’ Exon-Exon Junction (EEJ) 3’most EEJ Post transcriptional process 50-55nt NMD mRNA AUG UAA 5’ UTR CDS 3’ UTR

7 Objectives To explore introns in the UTR region
To find the rule about introns distribution among UTR regions. To compare the introns distribution between UTRs and CDS. To compare the introns distribution rules among different species.

8 Data source Full length cDNA sequences Genomic sequences
MGC (Mammalian Gene Collection): - mammalian BDGP : – fruit fly KOME : – plant Genomic sequences Genbank Ensmbal CDS prediction (Furuno et al. 2003) ProCrest rsCDS NCBI predictor DECODER Experiment Human (hs) 15504 15458 Mouse (mm) 12828 12803 Rat (rn) 641 634 Drosophila melanomas (dm) 9152 9096 Arabidopsis thaliana (at) 18415 18414

9 Method Do alignment between cDNA sequences and Genomic sequence
How about gaps, overlapping even polymorphism? BLAST, Mega BLAST .. sim4, gap2, spidey, BLAT and GeneSeqer Jim Kent - the Blat Rap

10 Steps Clear full length cDNA and genomic sequence.
Parse cDNA to 5UTR, CDS and 3UTR three parts. Aligning cDNA to genomic sequence by BLAT Parse BLAT result to get locations of exon and intron. Get sequences of exon and intron. Check if sum of exons equal to cDNA to remove suspect candidates. Calculate the average length of the cDNA, the average number of introns in cDNA, etc. Compare the intron distribution of 5UTR, CDS and 3UTR regions. Compare the intron distribution rules among different species.

11 Flow Chart

13 Introns Do Exist in UTRs
However, for arabidopsis as an example, 80% of sequences of 5’UTR don’t have introns. 90% of sequences of 3’UTR don’t have introns.

14 Introns in CDS 80% of sequences of CDS have introns.

15 Introns number: UTRs vs. CDS
Most of CDS sequences have introns, but most of UTR sequences don’t have introns. Number of sequences Number of intron

17 Introns in UTR Introns of 5’UTR and 3’UTR are overspread, but not evenly or uniformly distributed. If evenly distributed, the expected intron location = 1/(number of intron+1) Intron Number Number of intron

18 Introns in UTR The number of intron increase, when the length of sequence increase. For human 5’UTR, on average an intron is present for each 100nt. Introns of 3’UTR tend to concentrate toward the center of 3’UTR. Location of introns Length of sequences Number of intron Number of intron

20 Introns in CDS Introns are shift toward 5’.
Introns in CDS are overspread. For human, if there are more than one intron, the interval between 2 introns is about 140nt. (In other word, the average exon in CDS is 140nt) Introns are shift toward 5’.

21 Intron distribution: UTRs vs. CDS
Human as example: The frequency of introns occurring 5’UTR is higher than that of CDS. The frequency of introns occurring CDS is higher than that of 3’UTR. Number of intron Number of intron

22 Intron distribution: UTRs vs. CDS
Interval between 2 introns 100nt 140nt uncertain Intron frequency Higher than CDS Higher than 3’UTR Lowest distribution evenly Shift toward 5’ of CDS Concentrate toward the center of 3’UTR

24 Different species: UTRs vs. CDS
Number of introns increase with the length of sequence in both UTRs and CDS. The sequences of 5’UTR less than 100nt don’t have introns for human, mouse, rat, Arabidopsis and fruit fly. While the sequences of CDS less than 800nt don’t have introns for human, mouse, Arabidopsis and fruit fly. For rat this boundary is 500nt. The fruit fly sequence length increase faster than the other species in both UTRs and CDS. Number of intron Number of intron

25 Different species: UTRs vs. CDS
For 5 species, most of UTRs don’t have introns. For 5 species, most of CDS have introns. The intron distribution rule works for human, mouse, rat, arabidopsis and fruit fly. Number of sequences Number of sequences Number of intron Number of intron

26 Summary The introns do exist in UTRs.
The intron distributions in 5UTR, CDS and 3UTR are different for same organism. The intron distribution rules are in common for human, mouse, rat, Arabidopsis and fruit fly. The sequences of 5’UTR less than 100nt don’t have introns for human, mouse, rat, Arabidopsis and fruit fly. While the sequences of CDS less than 800nt don’t have introns for human, mouse, Arabidopsis and fruit fly except for rat is 500nt. The fruit fly fl-cDNA sequence length increase faster than the other species in both UTRs and CDS. 5UTR CDS 3UTR Percentage (sequence have introns) 20% 80% 10% Interval between 2 introns 100nt 140nt uncertain Intron frequency Higher than CDS Higher than 3’UTR Lowest distribution evenly Shift toward 5’ of CDS Concentrate toward the center of 3’UTR

27 Future work NMD widely exists among different species.
The reason why most UTR don’t have introns. The reason why intron frequency decrease when sequence goes from 5’ to 3’ along the full length cDNA.

28 Reference Lynch, Micheal and Kewalramani, Avinash (2003) Messenger RNA Surveillance and the Evolutioary Proliferation of introns. Mol.Biol.Evol 20(40): Flavio Mignone, Carmela Gissi, Sabino Liunu and Graziano Pesole (2002) Untranslated regions of mRNAs. Genome Biology 3(3): revies Pesole G, Grillo G, Larizza A, Liuni S. (2000) The untranslated regions of eukaryotic mRNAs: Structure, function, evolution and bioinformatics tools for their analysis. Briefing in Bioinformatics. 1(3): W.James (2002) Kent BLAT The BLAST-Like Alignment Tool Genome Res. Apr;12(4): Furuno M, Kasukawa T, Saito R, Adachi J, Suzuki H, Baldarelli R, Hayashizaki Y, Okazaki Y.(2003) CDS annotation in full-length cDNA sequence. Genome Res, Jun; 13(6B): Strausberg RL et al. (2002) Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci U S A. 24;99(26):

29 Acknowledgement Dr. Micheal Lynch Dr. Sun Kim Dr. Douglas G. Scofield


