Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Chip-chip and handling large datasets exercises You don't need to complete all exercises for this section. Appreciate what formulas can do for you in your current analysis
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Section 1: Excel
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Excel referencing =A3 : The value in A3 =A3/A4 : A3 divided by A4 A B C 1 3 2
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Excel formulas Common excel formulas =IF(condition,value if true, value if false) =mid(text,start,end) =left() =right() Database and lookup =vlookup(key,table,result_column,exactmatch?) =find() Formula names are different in swedish versions of excel! There are hundreds of formulas!
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Excel formula copy $ = fixed position $A$2 – this won't change during copying of formulas use $ before row/column to keep it constant $A1 : A stays fixed $A$1 : A and 1 stays fixed A$1 : 1 stays fixed A B C 1 3 2
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Data lookup in excel VLOOKUP command Find row information by matching the identifier. Used to combine datasets Data table Subset of interest Search Resulting data for subset
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Excel exercise 1 – excel Open the “shoes.txt” example (from day one) in excel a) calulate mean, Lower Quartile, Upper Quartile for height. (You must put the calculations in column B) b) Do the same for shoe size by copying the cells from column A c) in Column F write “boden flicka” if the person if from boden and is a girl using a formula. d) modify c) to write a description of people who are not “boden flicka” Open “cities.txt” use vlookup to add population size information to the table in “shoes”
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Complex formulas Formulas can be combined Give me the first 4 letters in the sequence only if it contains a GGGG motif =IF(find(D1,“GGGG”),left(D1,4), FALSE) Give me the D2 only for sequences that contain a GGGG motif =IF(find(D1,“GGGG”),D2, FALSE)
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Excel exercise 2 – working with sequences using formulas download a sequence from refseq in FASTA format (i.e. NM_001024) Find all 25 mers for that sequence Hint: Use the mid() command How many contain the following motif AAGCG (exact match) Hint: Use find() command How many start with the following motif AAGCG? What is the length of your sequence? (use excel formulas only)
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Excel exercise 3 – vlookup for genomic data “Where are the genes with specific GO terms in the human genome?” Download the known genes table for human from ensembl bioMart 1) 1 table with all genes and chromosome positions 2) 1 table with genes with a GO term you are interested in Load data into excel Use vlookup() to find genomic coordinates for the genes from 2) Use IF statements to find all genes in chromosome 1 within positions and
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Excel exercise 4 – vlookup for miRNA data miRNA analysis “What are the properties of the targets of my expressed miRNA” Download the known targets for miRNA from Load data into excel now download informaton about the targets from BioMART (ensMART) How many targets are on Chromosome 1?
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September What to do when many-many relationships exist? excel is not well suited to many-many (without jumping through some hoops) Solution use unix or SQL databases Unix solutions grep: looks for lines in a file that contain a specific pattern grep -e “NM_001024” filename looks for lines containing NM_ While we won't teach command line tools, we recommend them for handling large datafiles, sorting data, manipulating data, and filtering data to more meaningful datasets that can then be handled in excel.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Section 2: Cytoscape
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September PluriNet exercises Got to and load the cytoscape visualisation of the PluriNetwww.stemcellmatrix.org Explore the pluriNet in Cytoscape –How many Nodes are in the network? –Export the complete node list, including all node information –Colour nodes based on cellular location
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Cytoscape exercises cont. Add additional data to the network from external sources –Download some Stem Cell gene expression data from ArrayExpress, integrate it into the network (do it with your own data if you have it) Colour nodes based on this external data –Gene expression (up/down in a study)
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Todays exercises Add additional data to the network from –Characterizing the mouse ES cell transcriptome with Illumina sequencing Ruben Rosenkranza, Tatiana Borodinaa, Hans Lehracha and Heinz Himmelbauer Table 2: MGI symbolRefSeq IDTranscription specific forNo. of readsReads/kb Pou5f1 (Oct4)NM_013633Pluripotent stem cell NanogNM_028016Pluripotent stem cell Sox2NM_011443Pluripotent stem cell Sox1NM_009233Ectoderm21.65 Sox17NM_011441Endoderm72.24 T (Brachyury)NM_009309Mesoderm146.84
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Todays exercises From the nodes in the pluriNet, run them through a DAVID and GSEA enrichment analysis –What pathways do you find? –What is the difference in pathways between DAVID and GSEA? (if any) –Find all the genes in the top pathways, add this information back into the cytoscape network –Color the network based on the pathway in the last question –What other pathways would be informative here?
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Section 3: Unix For reference only, no exercises in this section!
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September gawk programming I Delimiter : What seperates a column. tab (“\t”) comma (“,”) Set delimiter to tab with FS=”\t” Column naming $1 = column 1 $2 = column 2 ...
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September gawk structure BEGIN {} process these commands before the file END {} process these commands after the file /PATTERN/ {COMMANDS} for each line containing this pattern, do the following COMMANDS
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September gawk programming II gawk 'BEGIN {FS = “\t”;}//{print $2”\t”$1;}' filename > filename.new Swap column1 and column2 for all lines /^PREFIX/ {print $0} Prints all lines starting with PREFIX /^PREFIX/ {print ($1+$2)} Add column1 and column2 for all lines starting with PREFIX
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September gawk programming III gawk 'BEGIN {FS = “\t”;}/^PREFIX/{sum = sum + $2*$1;}END {print sum;}' filename > filename.new Print the sum of column1 * column2 for all lines starting with PREFIX
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September sed Replace AF00245 with NM_ in file sed 's/AF00245/NM_001024/g;' filename > filename.new s = substitute. g = global (all examples on line), omitting the g will only replace the first occurance on each line. You can create pattern matching (example s/[0-9]//;). Special characters such as. ] } ; must have a \ to delimit them
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Combining statements with | “pipe” Compound statements gawk '//{print $2”\t”$1;}' filename | sed 's/Acc1/Acc2/g;s/\.[0-9]+//g;' > filename.new | = pipe (pipe the result to the next program in the command line. [0-9]+ : matches numbers (a string of characters between 0 and 9). [a-z] and [A-Z] are other examples.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September The bottom line For small-midrange dataset sizes analysis it is worthwhile to learn excel For extensive data analysis it is worthwhile to learn unix SQL Where to find more information? Excel online tutorials Programming in gawk/perl/sed/python/ruby Bioinformatics links / primers