Prepare data for Microdeletion Jianfang Chen
1. Original Data Set.
(1) snp_homozygosity_data. (2) snp_location_data. (3) Parameter file. 2. Objective Data Sets. (1) snp_homozygosity_data. (2) snp_location_data. (3) Parameter file.
snp_homozygosity_data -- The first row is the title line snp_homozygosity_data -- The first row is the title line. The first column is affection_status (0 for controls, 1 for cases). The remaining columns are homozygosity data at each site ( 0 for missing, 1 for homozygotes, 2 for heterozygotes). Example of "snp_homozygosity_data" (with two controls, two cases, 6 SNPs): indicator v1 v2 v3 v4 v5 v6 0 0 1 1 2 1 2 0 1 2 2 1 0 1 1 1 1 1 0 2 1 1 2 1 2 1 1 1
snp_location_data -- The first row is the title line snp_location_data -- The first row is the title line. The first column is SNP index number. Second col is SNP location. The locations are sorted in increasing order. Example of "snp_location_data" (with 6 SNPs): order position 1 50530104 2 50531804 3 50550165 4 50571683 5 50574584 6 50574983
Parameter file -- It needs the following inputs (one input per line): snp_homozygosity_data_name, snp_location_data_name, output_file_name, num_cont, num_case, num_site, maximum_window_size, num_rep1
3. Algorithm sort orginal data by FamilyID, Position and Marker_name. remove one marker with duplicate position. for each family within a marker (3 individuals) leave child as case
combine father and mother into one line as control, based on the following algorithm: suppose father (a,b) mother (c,d) and child (e,f) if e=a and f=c then control will be (b,d) else if e=a and f=d then control will be (b,c) else if e=b and f=c then control will be (a,d) else if e=b and f=d then control will be (a,c) else if e=c and f=a then control will be (d,b) else if e=c and f=b then control will be (d,a) else if e=d and f=a then control will be (c,b) else if e=d and f=b then control will be (c,a)
else if a=1 and b=1 and c=1 and d=1 and e=2 and f=2 then control will be (1,1) else if a=2 and b=2 and c=2 and d=2 and e=1 and f=1 then control will be (2,2) else if a=1 and b=1 and c=1 and d=1 and e=2 and f=2 then control will be (1,1) else if a=2 and b=2 and c=2 and d=2 and e=1 and f=1 then control will be (2,2) else if a=1 and b=1 and c=2 and d=2 and e=1 and f=1 then control will be (1,2) else if a=2 and b=2 and c=1 and d=1 and e=1 and f=1 then control will be (1,2) else if a=1 and b=1 and c=2 and d=2 and e=2 and f=2 then control will be (1,2)
else if a=2 and b=2 and c=1 and d=1 and e=2 and f=2 then control will be (1,2) else if a=2 and b=2 and c=2 and d=2 and e=1 and f=2 then control will be (2,2) else if a=1 and b=1 and c=1 and d=1 and e=1 and f=2 then control will be (1,1) else control will be (0,0)
recode any combination of a,b,c,d pair(x,y) as if x*y=0 then output 0 else if x*y=2 then output 1 else output 2 dump out Middle Step Output as I put in the website. for each family "0" + line up of all parents recode_number got from step4. "1" + line up of all children recode_number got from step4.
data_all.txt data_clean.txt 4. Data sets. data_all.txt data_clean.txt