TCGA The Cancer Genome Atlas Project January 24, 2008
TCGA Program Goal: find genomic alterations that cause cancer (mutations, CNA, methylation, …) Pilot project –$100M (NCI/NHGRI) –3 years –3 diseases brain (glioblastoma multiforme) lung (squamous) ovarian (serous cystadenocarcinoma )
TCGA Organization Biospecimen Core Resource (BCR) Genome Sequencing Centers (GSCs) (3) Cancer Genome Characterization Centers (CGCCs) (7) Data Coordinating Center (DCC) Project Team (NCI/NHGRI) Steering Committee (NCI/NHGRI & PIs) External Scientific Committee Working Groups
TCGA PI’s BCRIGC/TGENRobert Penny GSCBaylorRichard Gibbs BroadEric Lander WashURick Wilson CGCCBroad/DFCIMatthew Meyerson Harvard/B&WRaju Kucherlapati JHUSteve Baylin LBLJoe Gray MSKCCMarc Ladanyi StanfordRick Myers UNCChuck Perou DCCSRAAri Kahn
TCGA URLs project site: gforge: (search for TCGA) data: portal: [coming]
TCGA Data Types InstitutionAnalysisPlatform Broad/DFCITranscription and Copy Number Affymetrix U133 Plus 2.0 & SNP Array 6.0 Harvard/B&WTranscription and Copy Number Agilent 244K Array LBLTranscriptionAffymetrix Exon 1.0 ST Array MSKCCCopy NumberAgilent 244K Array JHUMethylationIllumina GoldenGate UNCTranscriptionAgilent 44K Array StanfordCopy NumberIllumina Infinium 550K BeadChip Array BroadSomatic MutationsDNA sequencing BaylorSomatic MutationsDNA sequencing WashUSomatic MutationsDNA sequencing
TCGA Data Levels raw –low-level data for a single sample, not normalized (e.g., trace file,.cel file) processed –single-sample, normalized & interpreted (e.g. mutation call, amplification call for a locus,.snp,.chp) segmented (n/a for mutation & expression) –single-sample, aggregation of loci into regions (e.g. amplification call for a region of a sample) summary finding (aka “region of interest”) –cross-sample findings (e.g. minimal common region of amplification across a sample set)
TCGA Flow Tissue Source (MD Anderson, Henry Ford, …) BCR 1.check pathology, quality/quantity 2.extract analytes 3.prepare data file GSC WGACGCC DNA, mRNA DNA NCBI Trace Archive DCC sample data Bulk Download caTissue Core caArraycaIntegrator “tracking database”
TCGA Data Formats BCR –XML (tags are CDEs) –images GSC –Called mutations (Genboree LFF format) –Linking table sample-trace-target CGCC –MAGE-TAB IDF: Investigation Definition Format SDRF: Sample and Data Relationship Format
TCGA Where Does/Will the Data Go? ftp site (now with a simple web wrapper: “portal #1”) “tracking database” repositories with caBIG API’s –caArray –caTissue CORE –caIntegrator –NCIA NCBI trace archive a richer, “portal #2” –more convenient download capability –filtering datasets by clinical information –summary level data –genome browser view –gene info page –visualization on pathways –etc.