Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. 4-3 Chip-chip and handling.

Similar presentations


Presentation on theme: "Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. 4-3 Chip-chip and handling."— Presentation transcript:

1 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. 4-3 Chip-chip and handling large datasets exercises You don't need to complete all exercises for this section. Appreciate what formulas can do for you in your current analysis

2 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Section 1: Excel

3 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Excel referencing =A3 : The value in A3 =A3/A4 : A3 divided by A4 A B C 1 3 2

4 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Excel formulas Common excel formulas =IF(condition,value if true, value if false)‏ =mid(text,start,end)‏ =left()‏ =right()‏ Database and lookup =vlookup(key,table,result_column,exactmatch?)‏ =find()‏ Formula names are different in swedish versions of excel! There are hundreds of formulas!

5 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Excel formula copy $ = fixed position $A$2 – this won't change during copying of formulas use $ before row/column to keep it constant  $A1 : A stays fixed  $A$1 : A and 1 stays fixed  A$1 : 1 stays fixed A B C 1 3 2

6 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Data lookup in excel VLOOKUP command Find row information by matching the identifier. Used to combine datasets Data table Subset of interest Search Resulting data for subset

7 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Excel exercise 1 – excel Open the “shoes.txt” example (from day one) in excel  a) calulate mean, Lower Quartile, Upper Quartile for height. (You must put the calculations in column B)‏  b) Do the same for shoe size by copying the cells from column A  c) in Column F write “boden flicka” if the person if from boden and is a girl using a formula.  d) modify c) to write a description of people who are not “boden flicka” Open “cities.txt”  use vlookup to add population size information to the table in “shoes”

8 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Complex formulas Formulas can be combined Give me the first 4 letters in the sequence only if it contains a GGGG motif =IF(find(D1,“GGGG”),left(D1,4), FALSE)‏ Give me the D2 only for sequences that contain a GGGG motif =IF(find(D1,“GGGG”),D2, FALSE)‏

9 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Excel exercise 2 – working with sequences using formulas download a sequence from refseq in FASTA format (i.e. NM_001024)‏ Find all 25 mers for that sequence  Hint: Use the mid() command How many contain the following motif AAGCG (exact match)‏  Hint: Use find() command How many start with the following motif AAGCG‏? What is the length of your sequence? (use excel formulas only)‏

10 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Excel exercise 3 – vlookup for genomic data “Where are the genes with specific GO terms in the human genome?” Download the known genes table for human from ensembl bioMart  1) 1 table with all genes and chromosome positions  2) 1 table with genes with a GO term you are interested in Load data into excel Use vlookup() to find genomic coordinates for the genes from 2)‏ Use IF statements to find all genes in chromosome 1 within positions 27230447 and 34432039

11 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Excel exercise 4 – vlookup for miRNA data miRNA analysis “What are the properties of the targets of my expressed miRNA” Download the known targets for miRNA from http://microrna.sanger.ac.uk/sequences/ Load data into excel now download informaton about the targets from BioMART (ensMART)‏ How many targets are on Chromosome 1?

12 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. What to do when many-many relationships exist? excel is not well suited to many-many (without jumping through some hoops)‏ Solution  use unix or SQL databases Unix solutions  grep: looks for lines in a file that contain a specific pattern grep -e “NM_001024” filename looks for lines containing NM_001024 While we won't teach command line tools, we recommend them for handling large datafiles, sorting data, manipulating data, and filtering data to more meaningful datasets that can then be handled in excel.

13 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Section 2: Cytoscape

14 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. PluriNet exercises Got to www.stemcellmatrix.org and load the cytoscape visualisation of the PluriNetwww.stemcellmatrix.org Explore the pluriNet in Cytoscape –How many Nodes are in the network? –Export the complete node list, including all node information –Colour nodes based on cellular location

15 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Cytoscape exercises cont. Add additional data to the network from external sources –Download some Stem Cell gene expression data from ArrayExpress, integrate it into the network (do it with your own data if you have it)‏ Colour nodes based on this external data –Gene expression (up/down in a study)‏

16 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Todays exercises Add additional data to the network from –Characterizing the mouse ES cell transcriptome with Illumina sequencing Ruben Rosenkranza, Tatiana Borodinaa, Hans Lehracha and Heinz Himmelbauer Table 2: MGI symbolRefSeq IDTranscription specific forNo. of readsReads/kb Pou5f1 (Oct4)NM_013633Pluripotent stem cell16861252.60 NanogNM_028016Pluripotent stem cell9368.58 Sox2NM_011443Pluripotent stem cell593241.35 Sox1NM_009233Ectoderm21.65 Sox17NM_011441Endoderm72.24 T (Brachyury)NM_009309Mesoderm146.84

17 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Todays exercises From the nodes in the pluriNet, run them through a DAVID and GSEA enrichment analysis –What pathways do you find? –What is the difference in pathways between DAVID and GSEA? (if any)‏ –Find all the genes in the top pathways, add this information back into the cytoscape network –Color the network based on the pathway in the last question –What other pathways would be informative here?

18 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Section 3: Unix For reference only, no exercises in this section!

19 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. gawk programming I Delimiter : What seperates a column.  tab (“\t”)‏  comma (“,”)‏ Set delimiter to tab with FS=”\t” Column naming  $1 = column 1  $2 = column 2 ...

20 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. gawk structure BEGIN {}  process these commands before the file END {}  process these commands after the file /PATTERN/ {COMMANDS}  for each line containing this pattern, do the following COMMANDS

21 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. gawk programming II gawk 'BEGIN {FS = “\t”;}//{print $2”\t”$1;}' filename > filename.new  Swap column1 and column2 for all lines /^PREFIX/ {print $0}  Prints all lines starting with PREFIX /^PREFIX/ {print ($1+$2)}  Add column1 and column2 for all lines starting with PREFIX

22 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. gawk programming III gawk 'BEGIN {FS = “\t”;}/^PREFIX/{sum = sum + $2*$1;}END {print sum;}' filename > filename.new Print the sum of column1 * column2 for all lines starting with PREFIX

23 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. sed Replace AF00245 with NM_001024 in file  sed 's/AF00245/NM_001024/g;' filename > filename.new  s = substitute. g = global (all examples on line), omitting the g will only replace the first occurance on each line.  You can create pattern matching (example s/[0-9]//;).  Special characters such as. ] } ; must have a \ to delimit them

24 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. Combining statements with | “pipe” Compound statements  gawk '//{print $2”\t”$1;}' filename | sed 's/Acc1/Acc2/g;s/\.[0-9]+//g;' > filename.new | = pipe (pipe the result to the next program in the command line. [0-9]+ : matches numbers (a string of characters between 0 and 9). [a-z] and [A-Z] are other examples.

25 Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. The bottom line For small-midrange dataset sizes analysis it is worthwhile to learn  excel For extensive data analysis it is worthwhile to learn  unix  SQL Where to find more information?  Excel online tutorials  Programming in gawk/perl/sed/python/ruby  Bioinformatics links / primers


Download ppt "Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, 18-24 September 2007. 4-3 Chip-chip and handling."

Similar presentations


Ads by Google