Toward a Better Understanding of Cereal Genome Evolution Through Ensembl Compara 1111, Apurva Narechania 1, Joshua Stein 1, William Spooner 1, Sharon Wei 1, Ben Faga 1, Shiran Pasternak 1, and Doreen Ware 1, 2 1 Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY11724, USA 2 USDA-ARS NAA Plant, Soil & Nutrition Laboratory Research Unit, USA Summary The maize genome has been largely shaped by its history of tetraploidization, subsequent rearrangement and duplicate gene loss. Disruption of synteny has also resulted from apparent gene movement in both maize and sorghum relative to rice. Many questions remain concerning the evolution of cereals, including the extent of lineage-specific rearrangements, selective forces that dictated the retainment of duplicate genes, and the extent of conserved non-coding regions. The availability of three nearly complete cereal genomes (maize, rice and sorghum) provides an unprecedented opportunity to use comparative genomics to answer these and other questions in the evolution of plant genomes. As part of the Maize Genome Sequencing Project, we describe the use of the Ensembl Compara whole genome alignment pipeline to construct sequence-based syntenies. The pipeline automates pairwise whole genome analysis by parallelizing the construction of blastz alignments, their subsequent consolidation into chains and nets, and their coalescence into syntenic regions. The algorithms employed identify highly similar regions between two large sequences while allowing for segments without similarity, thus highlighting gene movement or genomic rearrangement within syntenic blocks. The tetraploid nature of maize and its history of whole genome duplications suggest that much of its genome should have at least two blocks that align to the same region of rice. Preliminary analysis using a pilot 22 megabase maize assembly spanning maize chromosome 4 exhibits synteny to a comparably sized region on rice chromosome 2. In agreement with marker-based syntenic studies, we show that this rice chromosome has a duplicate homelogue on maize chromosome 5. We address the challenges of applying this pipeline to the maize genome in its partially assembled state. Blastz-CHAIN-NET and the Ensembl Hive Blastz-NET Alignment Stats (Maize Accelerated Region) Syntenic Blocks Between Maize, Rice, and Sorghum Distribution of blastz-NET sizes for Rice and Sorghum Alignments Region Statistics Region Alignment Statistics Total lengthAlignable SequenceRice Aligned CoverageSorghum Aligned Coverage Total AlignmentsChain or Net AlignmentsChainsNets Rice Sorghum Blastz-NET coverage by NET Level Blastz-NET coverage by Rice Chromosome Blastz-NET coverage by Sorghum Chromosome Alignable Sequence refers to the portion of the maize accelerated region that is of high quality and has not been RepeatMasked. Sorghum blastz-NETs align 66% of the alignable maize sequence, while rice aligns 35% of the available accelerated region. The majority of Blastz-NETS cluster on rice chromosome 2 and sorghum chromosome 4 in agreement with known marker based synteny. Proc Natl Acad Sci U S A Sep 13;102(37): The maize accel region contains syntenic blocks to rice chr2 and sorghum chr4 Maize: max gap between NETS 100,000 residues; min NET size 5000 residues. Rice and sorghum: max NET gap 50,000 residues; min NET size 2000 residues. Syntenic blocks are defined in two steps. First, NETS are grouped if the distance between them is smaller than twice the max gap parameter and there are no NETS breaking the synteny. Second, these groups are arranged into syntenic blocks up to 30 times the max gap parameter with two synteny breaking groups allowed. The rice assembly is complements of TIGR (version 5), and early access to the sorghum assemblies complements of JGI. Aligned Stats ClassAvg LenMedian LenMax LenMin LenCount Level Level Level Span Stats ClassAvg spanMedian spanMax spanMin spanCount Level Level Level Aligned Stats ClassAvg LenMedian LenMax LenMin LenCount Level Level Level Level Span Stats ClassAvg spanMedian spanMax spanMin spanCount Level Level Level Level Rice Stats Sorghum Stats Rice and Sorghum Level 1/2 Distributions Blastz-NET lengths are defined as the number of aligning bases in a NET excluding gaps while blastz-NET spans are the distances from the first to the last base in the NET including gaps. Level 1 NETS consistently show the longest length and span across species. Sorghum NETS are considerably longer than those found in rice. Despite large differences in lengths and spans across levels and species, the overall distributions are similar, highlighting the influence of biologically significant outliers. Maize BAC-contigs versus Rice at MaizeSequence.org Maize Accelerated Region Duplication Rice Chr2 from positions 29MB to 36MB aligns to Maize Chromosomes 4 and 5 in equal measure indicating a duplication event. Alignments were made to maize BAC-contigs and mapped to Chromosomes 4 and 5 using the FPC map. The majority of Chr4 hits were on FPC ctg182, corresponding to the accelerated region. The majority of NETS on Chr5 were on contigs 250, 251, 253, and 254 in agreement with marker based studies. PLoS Genet Jul 20;3(7):e123 SubmitGenome ChunkAndGroupDNA CreatePairAlignerJobs Blastz UpdateMaxAlignmentLength FilterDuplicates CreateAlignmentChainsJobs AlignmentChains UpdateMaxAlignmentLength CreateAlignmentNetsJobs AlignmentNets Blastz AlignmentChains UpdateMaxAlignmentLength The Blastz-CHAIN-NET pipeline creates long range gapped pairwise blastz chains and nets from raw blastz alignments thereby allowing for genomic rearrangements in syntenic regions. Proc Natl Acad Sci U S A Sep 30;100(20): The Ensembl Hive pipeline parallelizes the generation of blastz alignments and their consolidation into chains and nets using a hive system that creates specific jobs and spawns anonymous, general workers to complete those jobs. Nucleic Acids Res Jan;36(Database issue):D In its partially assembled state, the longest contiguous regions at maizesequence.org are the BAC contigs. Whole genome alignments to rice for all BAC contigs are available and correspond well to FgenesH predictions with similarity to known proteins and maize ESTs. Gene Predictions Associated with Blastz-NETs 39% of maize genes within syntenic blocks are non-syntenic, suggesting substantial gene movement within maize. Almost 50% of rice genes are non-syntenic, possibly due to loss of duplicate genes w/in maize homeologous regions. Methods: Syntenic blocks were defined using from BLASTZ-Chain-Net data using parameters MaxDist and MinDist as described in the synteny views above. Genes (excluding TE’s) were counted as syntenic if they overlapped a chain HSP that contributed to the synteny.