High throughput urine biomarker discovery and integrative analysis for translational medicine High throughput urine biomarker discovery and integrative analysis for translational medicine Bruce Ling, Ph.D.
A molecular indicator of a specific biological property; a biochemical feature or facet that can be used to measure the progress of disease or the effects of treatment (NIH, 2002) Biomarker
Small molecules Glucose (diabetes) Serum cholesterol (cardiovascular disease) Proteins PSA (prostate cancer) HER2 (IHC) (breast cancer Herceptin Therapy) hCG (pregnancy test) RNA/DNA HER2 (FISH) (breast cancer) OncoDX (Genomic Health, breast cancer) Biomarker examples
Pediatric Diseases Kidney transplant Acute Rejection Kawasaki Disease Systemic Juvenile Idiopathic Arthritis Necrotizing Enterocolitis Inflammatory Bowel Disease Glioblastoma multiforme Preterm Labor
Where to look for biomarkers –Disease tissue –Proximal/distal fluids Plasma/serum, urine, amniotic, synovial fluid, CSF, saliva, tears, etc.
Why Urine? Patient consenting Non-invasive Easy to collect for time course analysis Abundant and stable
Urine is a rich resource for biomarker discovery Filtration of plasma 900 liters daily Urine proteome > 1500 proteins, ~30 mg/day 30% from circulation 70% from urogenital tract Urine peptidome > 100, 000 naturally occurring peptide, ~20 mg/day
1)Equal mass of protein and peptide in urine translates into at least a ten-fold greater molar abundance of peptides than proteins 2)Urine peptide analysis is not hampered by highly abundant protein issues 3)One hour one dimensional HPLC separation is sufficient for the analysis of greater than 100,000 urine peptides, allowing a high throughput biomarker discovery Urine Peptidome: a fertile ground for biomarker discovery
Challenges of Urine Analysis Dilution factor causing concentration variations –Solution: content normalization Creatinine; house keeping urine abundant peptides; equal peptide mass Peptide content can be complicated by –Diet, exercise, circadian rhythm, circulatory levels of hormones –Solution: careful experimental design to avoid these confounding issues, e.g., Cohorts of patients of similar demographics Multi-center sample collection and validation
Urine Peptidome Profiling by Mass Spectrometry
Biomarker HTS Flows Sample peptides: -Class 1:1,2,3… -Class 2:1,2,3… -Class 3:1,2,3… RP-HPLC Collect 120 fractions on MALDI plates MALDI-TOF MS on each fraction MASS-Conductor ® Machine learning feature discovery and classification Candidate Biomarkers etc.
Biomarker Confirmation/Validation Identify Differentiating Markers New sample Sets Validation New Center sample sets Higher throughput Quantitative methods Quantitative MS Immunoassay Testing New Longitudinal sample sets Exploration Protein ID MS/MS
Data Challenges in Urine Peptide Biomarker Discovery Data tracking and storage –Patient demographics –Peptide profiles in various fractions/samples Dimension reduction and data reduction –Multi-dimensional data sets –Huge data sets and lots of noise A project of 40 samples produced GB raw data in MYSQL database HPLC fraction Peptide mass Patient ID Patient demographics Peptide signal
Decode the Urine Peptidome Patient 1Patient 2Patient 3Patient 4… peptide 1 signal … peptide 2 signal … peptide 3…………… peptide 4…………… peptide 5…………… ……………… peptide 100,000 …………… ???
Decode the Urine Peptidome Peak finding in each fraction for each sample Align the peaks across the samples Create common peak index
Data mining issues in Biomarker Discovery Peak number >> sample number False discovery in multiple hypothesis testing Multi-class classification and validation Discovery of biomarker signature
Robustly loading and tracking of high volume proteomic data Robust reduction of raw data sets and enabling of efficient and accurate peak finding, alignment and indexing Robust and automatic high throughput computing for expensive algorithms Integration of FDR analysis and multi-class classification algorithms to obtain statistically differentiating feature panels Automatic generation of data reports with graphics MASS-Conductor® Platform Support Urine Peptide Biomarker Discovery
MASS-Conductor® Platform High Throughput Computing
Urine Biomarker Discovery: Case Study
Kidney Transplant Rejection Most effective treatment for end stage renal disease 16,000 per year in US Grafts monitored by biopsy Unmet needs: –Less invasive and more frequent monitoring –Acute rejection vs. stable graft –Acute rejection vs. BK virus
Allograft Acute Rejection Urine Biomarker Discovery Peak finding Peak alignment Peak indexing Supervised Data mining Feature selection Training Testing LCMS Data reduction Unsupervised Data mining 2D - Clustering QuantitativeLCMS Validation 1234
Biomarker Panel: Supervised Analysis
Biomarker Panel: Unsupervised Analysis
NH 2 ZP-domain EGF-like Domain I EGF-like Domain II EGF-like Domain III COOH Urine THP Peptide Biomarkers Fall into a Tight Cluster in C-Terminus 1. R.VLNLGPITR.K 2. G.SVIDQSRVLNLGPI.T 3. I.DQSRVLNLGPITR.K 4. R.SGSVIDQSRVLNLGPI.T 5. S.VIDQSRVLNLGPITR.K 6.R.SGSVIDQSRVLNLGPIT.R 7. G.SVIDQSRVLNLGPITR.K 8.R.SGSVIDQSRVLNLGPITR.K
MRM: Multiplexed Quantitative Biomarker Validation
SAMPLE: URINE PEPTIDES THP VIDQSRVLNLGPITR THP SGSVIDQSRVLNLGPITR THP VIDQSRVLNLGPITR THP SGSVIDQSRVLNLGPITR AR versus STA AR versus BK Sensitivity 1- Specificity AUC: 0.83 AUC: 0.74 AUC: 0.92 AUC: 0.83 ROC Analysis of THP Peptide Biomarkers Quantified by MRM
1. COL1A APGDRGEPGPPGP 2. COL1A APGDRGEPGPPGP 3. COL1A APGDRGEPGPPGPA 4. COL1A DAGPVGPPGPPGPPG 5. COL1A GPPGPPGPPGPPGPPS 6. COL1A NGDDGEAGKPGRPGERGPPGP 7. COL1A NGDDGEAGKPGRPGERGPPGP 8. COL1A NGDDGEAGKPGRPGERGPPGPQ 9. COL1A GKNGDDGEAGKPGRPGERGPPGPQ 10. COL1A GKNGDDGEAGKPGRPGERGPPGPQ 11. COL1A GPPGKNGDDGEAGKPGRPGERGPPGPQ 12. COL1A PPGEAGKPGEQGVPGDLG 13. COL1A PPGEAGKPGEQGVPGDLGAPGP 14. COL1A ADGQPGAKGEPGDAGAKGDAGPPGP 15. COL1A ADGQPGAKGEPGDAGAKGDAGPPGP 16. COL1A ADGQPGAKGEPGDAGAKGDAGPPGPA 17. COL1A ADGQPGAKGEPGDAGAKGDAGPPGPA 18. COL1A GPPGADGQPGAKGEPGDAGAKGDAGPPGPA 19. COL1A EGSPGRDGSPGAKGDRGETGPA 20. COL1A AEGSPGRDGSPGAKGDRGETGPA 21. COL1A ESGREGAPGAEGSPGRDGSPGAKGDRGETGPA 22. COL1A SPGPDGKTGPPGPA 23. COL1A DGKTGPPGPAGQDGRPGPPGPPG 24. COL1A GRPGEVGPPGPPGPAGEKGSPG 25. COL1A DGPPGRDGQPGHKGERGYPG 26. COL1A NDGPPGRDGQPGHKGERGYPG 27. COL2A SNGNPGPPGPPGPSGKDGPK 28. COL3A NDGAPGKNGERGGPGGPGP 29. COL3A DGESGRPGRPGERGLPGPPG 30. COL3A DAGAPGAPGGKGDAGAPGERGPPG 31. COL3A GAPGQNGEPGGKGERGAPGEKGEGGPPG 32. COL3A KNGETGPQGPPGPTGPGGDKGDTGPPGPQG 33. COL4A PGQQGNPGAQGLPGP 34. COL4A GLPGLPGPKGFA 35. COL4A GEPGPPGPPGNLG 36. COL4A GLPGPPGPKGPRG 37. COL4A GPPGPPGPLGPLG 38. COL4A PGLDGMKGDPGLP 39. COL4A GIKGEKGNPGQPGLPGLP 40. COL4A GLPGPPGPPGPPS 41. COL5A KGPQGKPGLAGMPGANGPP 42. COL7A PGLPGQVGETGKPGAPGR 43. COL9A KRPDSGATGLPGRPGPPG 44. COL11A GPPGPPGLPGPQGPKG 45. COL11A DGPPGPPGERGPQGPQGPV 46. COL17A LPGPPGPPGSFLSN 47. COL18A GPPGPPGPPGPPS 1. THP VLNLGPITR 2. THP SGSVIDQSRV 3. THP DQSRVLNLGPI 4. THP SRVLNLGPITR 5. THP IDQSRVLNLGPI 6. THP VIDQSRVLNLGPI 7. THP DQSRVLNLGPITR 8. THP SVIDQSRVLNLGPI 9. THP GSVIDQSRVLNLGPI 10. THP IDQSRVLNLGPITR 11. THP SGSVIDQSRVLNLGPI 12. THP VIDQSRVLNLGPITR 13. THP SGSVIDQSRVLNLGPIT 14. THP SVIDQSRVLNLGPITR 15. THP SGSVIDQSRVLNLGPITR 16. THP SGSVIDQSRVLNLGPITRK AB AR Urine Biomarkers are Collagen and THP Peptides Collagen peptide biomarkers THP peptide biomarkers
Hypothesis 1 Gene expression alteration in AR Hypothesis 2 Protease expression alteration in AR Hypothesis 3 Protease inhibitor expression alteration in AR Hypothesis of Molecular Mechanisms for AR Urine Biomarkers
Exploration data set 6 (TGCG) 1 Affymetirics HG-U95Av2 (AR: PBL, n=6; BX, n=7) (STA: PBL, n=9; BX, n=10) (NR: PBL, n=8; BX, n=5) (HC: PBL, n=8; BX, n=9) Exploration Analysis Confirmation 2 Affymetirics HU-133 (AR: BX, n=37) (HC: BX, n=23) Confirmation Analysis Validation 3 Quantitative RT-PCR (AR: BX, n=14) (STA: BX, n=10) (HC: BX, n=10) Validation Analysis Expression analysis of peptide biomarkers’ corresponding precursor genes Expression analysis of metzincin superfamily genes Expression analysis of protease inhibitor genes Discovery mechanism biomarkers Confirmation data set (Stanford ) Validation data set (Stanford ) Transcriptome Analysis of Allograft Biopsies
Parental Protein Expression Analysis of Allograft Biopsies Contrasting Urine Peptide Biomarker Changes
Genome-wide Protease and Protease Inhibitor Expression Analysis of Allograft Biopsies Revealed Up Regulation of MMP7, SERPING1, TIMP1
AR STA HC Signal Intensity TIMP1COL1A2UMODSERPING1MMP7COL3A Specificity Mean ( AUC): 0.98 Sensitivity Allograft Biopsies Expression Biomarkers Effectively Classified AR
Proposed Underlying Mechanisms for the AR Urine Peptide Biomarkers
Hypothesis: Collagen Breakdown and Deposition in AR Decreased Collagen Peptides In AR Increased TIMP1 (Collagenase Inhibitor) in AR Increased Collagen Deposition in AR More Graft Fibrosis After an AR episode? Biopsy Gene Expression GSE Increased MMP7 in AR Decreased Collagen Breakdown in AR Decreased Collagenase Activity In AR tissue Increased Collagen Expression in AR Integrated Analysis Urine Peptidomics Urine Renal Biopsy Urine Peptide Analysis by MS
Urine Biomarker Discovery: Case Study
Unmet Medical Needs in Necrotizing Entrocolitis Necrotizing enterocolitis (NEC) is a medical condition primarily seen in premature infants, where portions of the bowel undergo necrosis (tissue death). Despite decades of research the pathogenesis of NEC remains obscure, the diagnostic parameters unclear, and both treatment and prevention strategies remain inadequate and dated. There is the real need for better molecular identification of NEC in order to assist in altering its onset and progression.
Clinical parameters do not adequately predict outcome in Necrotizing Enterocolitis
Low Risk Group Intermediate Risk Group High Risk Group Rate of NEC-S occurrence (% patients) NEC score M: n = 2 S: n = 15 M: n = 16 S: n = 10 M: n = 26 S: n = 0 MS NEC Clinical Parameters Based Model stratifies Necrotizing Enterocolitis Patients
NEC Urine Naturally Occurring Peptide Biomarker Discovery Peak finding Peak alignment Peak indexing Supervised Data mining Feature selection Training Testing LCMS Data reduction Unsupervised Data mining 2D - Clustering 123
Biomarker Panel: Supervised Analysis (Training and Testing)
Biomarker Panel: Unsupervised Analysis
Biomarker Panel: Combined data set and ROC analysis
Permutation based FDR analysis of the biomarker signature
Discovery set n = Clinical Diagnosis Medical NEC Scoring Percent Agreement with clinical diagnosis MS NEC 70 Urine peptide based Classification MS Low n=7 Classified as M Classified as S NEC Risk Groups 96 MS Intermediate n= MS High n= % %83.3 % % Diagnosed as M Diagnosed as S P = 0.01 Clinical Diagnosis N/A n=3 Proposed Ensemble Approach to Diagnose Necrotizing Enterocolitis Patients NEC Patients Clinical Model NEC Risk Urine Biomarkers NEC Diagnosis
TABLE 2 ClusterProteinLocationMH+Sequence Relative Abundance U test P value MS 1 COL1A RGppGPPGKNGDDGEAGKPGRPGERGPpGp E-03 COL1A RGPPGppGKNGDDGEAGKpGRpGERGpPGP E-03 2 COL1A ARGEPGNIGFPGPKGPTGDPGKNGDKGHAG E-05 3 COL1A GRDGNpGNDGpPGRDGQpGHKGERGYpG E-03 COL1A DGpPGRDGQpGHKGERGYpG E-03 4 COL1A AGpPGKAGEDGHpGKPGRpGERG E-02 COL1A ARGpAGpPGKAGEDGHpGKPGRpGERG E-02 COL1A ARGpAGpPGKAGEDGHpGKpGRpGERG E-02 COL1A GpPGKAGEDGHPGKPGRpGERG E-02 COL1A GPpGKAGEDGHpGKPGRpGERG E-02 5 COL3A GApGQNGEPGGKGERGApGEKGEGGpPG E-03 6 COL3A NRGERGSEGSPGHPGQpGppGppGAPGP E-02 COL3A NRGERGSEGSpGHpGQpGPPGPpGApGp E-02 Overlapping Urine Peptide Biomarkers for NEC
Proposed Underlying Mechanisms of Urine Naturally Occurring Peptide Biomarkers
PR Enbrel CR Anakinra CRPR CR EnbrelAnakinra A B C Prediction of drug response in SJIA
Urine peptide biomarkers: the discovery process Sample peptides: -Class 1:1,2,3… -Class 2:1,2,3… -Class 3:1,2,3… SCX/RP-HPLC Collect 100 fractions on MALDI plates MALDI-TOF MS for each sample LC fraction -- m/.z --abundance MASS-Conductor ® Machine learning feature discovery and classification Biomarker panels MSMS protein ID Prospective validation with quantitative mass spec (MRM)
Interdisciplinary Skills for Biomarker Discovery Biology Analytic biochemistry Biostatistics Computer Science Medicine
Q & A
Genome vs. Proteome
The Isotope Envelope
Predictor discovery in training set 2 Training set (10 AR, 10 STA, 6 BK) 1 LCMS raw spectra Peak finding peak alignment feature extraction unique features Classifier training Six-fold Cross-validation Classify AR, STA, BK MASS-ConductorUrine biomarker discovery and testing Predictor confirmation in testing set 3 Testing set (10 AR, 10 STA, 4 BK) Predictor sets Linear discriminant analysis (LDA) Calculate estimates of predicted class probabilities Analysis of goodness of class separation Pattern analysis in all set 4 Cluster analysis All set (20 AR, 20 STA, 10 BK, 10 NS, 10 HC) Predictors of 40 peptides 2d hierarchical clustering heatmap plotting Remove background signals Normalization Platform Validation 5 Correlation Analysis 2 peptide biomarkers MRM assay development MRM assay AR, STA, BK, NS, HC Training + Testing Samples LC-MALDI MRM Allograft Acute Rejection Urine Biomarker Discovery
Correlation Studies Between LCMS and MRM Platforms
Analytical Challenges High complexity and wide dynamic range
Tirumalai, R. S. (2003) Mol. Cell. Proteomics 2: Plasma Proteins Big Trees
Tirumalai, R. S. (2003) Mol. Cell. Proteomics 2: Plasma Proteins Big Trees Bushes
Tirumalai, R. S. (2003) Mol. Cell. Proteomics 2: Plasma Proteins Big Trees Bushes Grass + Bugs
Analytical Challenges Detect low abundance proteins Big Trees = HAP Bushes = MAP Grass + Bugs = LAP
Bottom up LCMS Biomarker Discovery Sample preparation Digestion Peptide purification SCXRP Protein mixtureDigested peptides Mass-spec Spectra Data Analysis Multi-dimensional chromatography MS/MS Protein ID
Mass Spectrometry In A Nutshell time hνhν F=ma Ion source detector m/z MS Spectrum Mass Analyzer
MS/MS Peptide Sequencing hνhν source detector Fragment ions gate Collision cell MS/MS Spectrum 1 st Mass Analyzer 2 nd Mass Analyzer
Differential Expression Analysis in Quantitative LCMS Peptide 1: M/Z Peptide 2: M/Z’ Peptide 3: M/Z’’ Peptide 1: protein ID Peptide 2: protein ID’ Peptide 3: protein ID’’ MS basedMS/MS based MASS-Conductor® Exhaustive MS comparison Spectrum counting Labeling, e.g. iTRAQ
Qualitative Comparative Analysis – Spectrum Counting PROTEIN X Sample A Sample B MS/MS Number of Detected Peptides Number of Detected Peptides [PROTEIN X] IF THEN PROTEIN IDENTIFICATION
- Peptide fragments EQUAL MS/MS b y b y b y b y MS Mix -N H N H N H N H PRG PRG PRG PRG S1 S2 S3 S4 Parallel Denature & Digest -Reporter-Balance-Peptide INTACT - 4 samples identical m/z Reporter ions DIFFERENT -Chemically identical -Migrate together in HPLC MSMS Based Comparative Analysis – iTRAQ (isobaric tag) Reporter Ions 114, 115, 116, 117
More abundant proteins tends to get more sequence coverage in MS/MS, masking away the MSMS opportunities for the peptides coming from the low abundant proteins Spectrum counting is semi-quantitative iTRAQ is not scalable for a moderate throughput biomarker discovery iTRAQ cost iTRAQ tag number Issues in MS/MS Based Analysis
MS Based Comparative Analysis – Targeted MASS-Conductor® Approach 1. ALL peptide MS signals will be exhaustively compared leading to the discovery of statistically differential signals 2. ONLY peptides of interest, usually a very small number, will be tried with full attention for the MS/MS ID. If necessary, MS/MS signals can be enhanced by more loading or fraction enrichment before MS
Robustly handling of high volume proteomic data –e.g. One SCX fraction and 120 RP fractions 40 sample project MYSQL data storage –raw data is GB –Peak data is 4.4 GB Robust and automatic high throughput computing Robust reduction of raw data sets and enabling of efficient and accurate feature discovery Sophisticated data mining approaches to obtain statistically differentiating features Graphic data analysis MASS-Conductor® Platform Data Mining Requirements
“MASS-Conductor ®” An in house software platform, including JAVA, PERL, R, RUBY and MYSQL implementations Interface with AB and Thermo mass specs –Convert LC-MALDI T2D files in a batch manner to text files Extract mono-isotopic LC-MALDI peaks Track multiple scans of the same MALDI plate and HPLC SCX/RP fractions where each peak resides Cluster mono-isotopic peaks across categorical samples for comparative analysis Interface and integrate SAM, PAM, 1d classifiers, 2d classifiers, margin tree, CART algorithm packages for differential feature selection and classification
Common Feature Alignment/Extraction Spectrum Raw datasets Peak datasets Feature datasets Indexed datasets Mass-Conductor Database Binary/Multi-class Classification False Discovery Rate Analysis Biomarker Discovery Potential Biomarkers Web-Service Collaboration Peak Extraction Feature indexing Patient datasets “MASS-Conductor ®”
DATA REDUCTION in “MASS-Conductor ®” Peak Extraction from Spectra Raw Data Patient sample LC-MALDI Spot/fraction 13. m/z 900 – 4000: raw data points 1690 peak data points 62 peaks 2530 data points m/z 1200 – 1250
Before data reductionAfter data reduction Class A Class B Class C fractions MS signal DATA REDUCTION – One Peptide Example Peak Extraction from Spectra Raw Data
SEQUENCE 640 AA; MW 001 MGQPSLTWML MVVVASWFIT TAATDTSEAR WCSECHSNAT CTEDEAVTTC TCQEGFTGDG 061 LTCVDLDECA IPGAHNCSAN SSCVNTPGSF SCVCPEGFRL SPGLGCTDVD ECAEPGLSHC 121 HALATCVNVV GSYLCVCPAG YRGDGWHCEC SPGSCGPGLD CVPEGDALVC ADPCQAHRTL 181 DEYWRSTEYG EGYACDTDLR GWYRFVGQGG ARMAETCVPV LRCNTAAPMW LNGTHPSSDE 241 GIVSRKACAH WSGHCCLWDA SVQVKACAGG YYVYNLTAPP ECHLAYCTDP SSVEGTCEEC 301 SIDEDCKSNN GRWHCQCKQD FNITDISLLE HRLECGANDM KVSLGKCQLK SLGFDKVFMY 361 LSDSRCSGFN DRDNRDWVSV VTPARDGPCG TVLTRNETHA TYSNTLYLAD EIIIRDLNIK 421 INFACSYPLD MKVSLKTALQ PMVSALNIRV GGTGMFTVRM ALFQTPSYTQ PYQGSSVTLS 481 TEAFLYVGTM LDGGDLSRFA LLMTNCYATP SSNATDPLKY FIIQDRCPHT RDSTIQVVEN 541 GESSQGRFSV QMFRFAGNYD LVYLHCEVYL CDTMNEKCKP TCSGTR F R SG SVIDQSRVLN 601 LGPITRK GVQ ATVSRAFSSL GLLKVWLPLL LSATLTLTFQ Human THP precursor, Swiss-Prot: P07911 Urine THP Peptide Biomarkers Fall into Tight Clusters in C-Terminus