Proteomics and annotation
Definition of proteomics Study of all the proteins in an organism Derived from genomics all the DNA in an organsim On some levels it is a catalog of all the functional proteins, but in many contexts it is also the study of the interactions of the proteins
Central Dogma DNA --> RNA --> AA --> function
Proteomics techniques Protein identification/quanitfication –High throughput elusive Now typically –Separate –Isolate –Identify Enumerating protein interactions –Protein protein –Protein DNA/RNA
How to separate proteins Proteins are made up of 20 AA not 4 NT –DNA size- migration through a charged field –Protein Size Charge Hydrophobic Solubility Fraction of the cell Much more structure …
2D gels Big Little 3 pH10 pH
Limitations of 2D Very large and small proteins don’t work well Membrane bound proteins –Solubility of the protein –Disulfide bonds Rare proteins –Can stain with silver stain »Non-linear »100X
Mass spectrometry Simple principle –Explode the charged peptides off the sample Electro-spray: charged cone Laser -> Vapor -> charged grid –See how big they are Detect number of ions/mass –Ion trap- kind of like TV –TOF- how far did it go
Mass of AA
Mass spectrum Actual mass Major Ion +H C13
Post-translational modification Cleavage –removing portions of the protein by enzymatic action. –Can change location, function, activity Additions –Adding a chemical Regulated activity Can change protein function/activity
Modifications PhosyphorylationActivate/inactivate AcetylationStability (histones) AcylationMembrane assoc. GlycosylationSignaling GPI anchorMembrane assoc. HydroxyprolineStability SulfationP-P interaction DisulfideStability DeaminationP-P interaction Pyroglutamic acidStability UbiquitinationDestruction signal
Limitations of mass spec Most frequently sequenced protein: keratin –Ionization is not strictly quantitative Can cleave the protein into peptides –Complicated by mixtures –Issues on searching the database
QCAT Way to quantitatively analyze multiple proteins (Nature Methods 2, (2005)). Depends on concatemers assembled from segments of the proteins of interest. Each protein has one segment that would be produced by a tryptic digest (QCAT)
Cont. Grow the peptide in heavy and light isotopes, get standard curve Spike your sample with heavy QCAT This produces an internal standard for each protein of interest. This allows quantitation of many (~100) proteins in one experiment.
Protein-protein interactions Types of interactions –Stable Multimers, complexes –Association forms complete unit –Quaternary structure –Unstable Pathways Signaling events Transient interactions
Yeast two-hybrid
How accurate is the Y2H data? False Negative – proteins that have very transient interaction, sporadic interactions or that may be located in the membrane. –Non-physiological test conditions False Positive –Self activators –Weak non-specific interactions –Non-physiological test conditions
How to assess Remove proteins with above average number of interactions Intersection of a number of experiments (Y2H, Co-IP, and co-expression) Network properties. Other documented signals of interaction.
Network comparison Genome Biology 2006, Volume 7, Issue 11, Article 120
How to find protein/DNA interactions Have a typical Transfac binding site 10 bp long with 2 bases somewhat ambiguous. How often does it appear by chance in the genome? How can you determine if genes are co- expressed. –DNA foot-printing –Deletion experiements High throughput?
ChIP on chip
Design Need very specific antibody for each transcription factor that you wish to study cDNA will not work with large introns –Whole genome chips –Human 21, 22 –3 x10^6 spots SAGE Look for enriched vs non-enriched –Looking for a population rather than one sequence
Results
Annotation Systematically adding knowledge –Human vs computer Throughput Accuracy Repeatability Typical course –Found in one organism Mapped to all other homologous segments –Function as a consequence of sequence
Prosite PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs. Take a smaller segment of the protein and build up annotation for the whole protein
Structured languages The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The project began as a collaboration between three model organism databases, FlyBase external link (Drosophila), the Saccharomyces Genome Database external link (SGD) and the Mouse Genome Database external link (MGD), in Since then, the GO Consortium has grown to include many databases, including several of the world's major repositories for plant, animal and microbial genomes. See the GO Consortium page for a full list of member organizations.
Other Types Systems biology Protein structure Enzymatic pathways
Kegg API example cpan/non-root/
Bioperl annotation examples Get info from genbank Graphical annotation