Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008
NCBI ( Contains a large number of databases Most important are: - GenBank - PubMed - RefSeq - Online Mendelian Inheritance in Man (OMIM) - dbSNP
dbSNP Database
NCBI dbSNP Contains information about SNPs Submitted data is given an ss number (e.g. ss ) If data meets criteria a reference SNP is created which had an rs number (e.g. rs530)
dbSNP Data (1) - Each record with various lines and each line with various lengths
dbSNP Data (2)
dbSNP Data (3)
Various uses of the SCAN, INDEX functions to assist in reading data (1) data ncbisnp ; length rs $12 ; infile din firstobs=1 missover pad; input snpline $132. ; if index(snpline,"updated")>0 then do; rs=compress(scan(snpline,1,"|")); output; end; run;
Various uses of the SCAN, INDEX functions to assist in reading data (2) if index(snpline,"alleles=")>0 then do; alleles=substr(compress(scan(snpline,2,"|")),9); output; end; if index(snpline,"assembly=reference")>0 then do chrom=input(substr(compress(scan(snpline,3,"|")),5),8.); posc=compress(scan(snpline,4,"|")); output; end;
Use RETAIN statement - cause a variable to keep its value from one iteration of the DATA step to the next. retain markname rs alleles;
dbSNP Data (4)
Output SAS Dataset
Readings: Kim L Kolbe etc., SUGI 22: “Advanced Techniques for Reading Difficult and Unusual Flat Files”. Clinton S Rickards, SUGI 24: “Reading External Files Using SAS ® Software ”.