PaLS: Pathways and Literature Strainer Filtering common literature, ontology terms and pathway information. Andrés Cañada Pallarés Instituto Nacional de Bioinformática
-Studies of differential expression and, specially, gene selection in the context of classification and prediction with microarray data, usually output lists of “interesting genes”. -some of the members of those lists have a function in common or do they belong to the same metabolic pathway? -PaLS takes a list or set of lists of gene or protein identifiers and shows which ones share certain descriptors -Variable selection with microarray data (where number of variables>>number of samples) can lead to many solutions. Different rounds of the same algorithms often return different lists of “interesting genes”. It is a problem for the interpretability of the results. -PaLS allows us to try to discover the major biological themes that are shared among different solutions. Even if the identity of genes in each solution is different
#Run.1.component.1 NM_ NM_ NM_ NM_ NM_ NM_ NM_ NM_ #Run.2.component.1 NM_ NM_ NM_ NM_ NM_ NM_ Main input file. Text Plain -List or several lists of gene/proteins -Each list can have its own name -Type of identifiers accepted: -Ensembl Gene IDs -UniGene Cluster IDs -Gene names (HUGO) -GenBank accessions -Clone IDs -Affymetrix IDs -EntrezGene IDs -RefSeq_RNAs -RefSeq_peptides -SwissProt Names -Organisms accepted: -Human -Mouse -Rat
-PaLS has three different methods of filtering annotations: 1.- Filter descriptors referenced with more than a given percentage, giving results for each list separately. Intended to be used to discern which list has some common published information that shows that those genes/proteins share a similar function. 3.- Look for those descriptors that are referenced by more than a given threshold of identifiers in more than a given percentage of lists. Looking for commonalities present within and among sets of lists. 2.- Group all lists in one list (removing duplicates) and display those descriptors that are more referenced in the global list. To see commonalities even if they are not seen within each list. -Threshold values are part of input information needed. Defaults to 50% -Lower values are suggested
-For lists of less of 100 nodes, graph plots that describe the data structure of the lists are created. These plots show the genes/proteins that share at least one descriptor. The more descriptors they share the closer they appear. -Output are lists of those descriptors that fulfill the threshold criteria selected by the user. Every input identifier related to each descriptor is linked to IDClight to present the user as much information as possible. Most time cosuming process is the first search. After that, the user can change thresholds for each type of descriptor and filtering method, obtaining an answer in a short time (Redo Analysis button, see figure later)
-Data set from van’t Veer et al (Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871), ) -Lists of genes obtained using our cnio application SignS (Díaz-Uriarte, R) -at 50% threshold, GO terms in most lists refer to “nucleus” -at 40% threshold, the term “cell cycle” appears in several of the lists. As reported in the original van’t Veer et al. paper, genes involved in cell cycle are upregulated in the poor prognosis signature -at 20% threshold, the term “mitosis” appears in most of the lists -If we examine PaLS results from Reactome at the 20% threshold we see “cell cycle. Mitotic” in most of the lists. -The list “6 th. Cross-validation run” shows “E2F mediated regulation of DNA replication”
-Ramón Díaz-Uriarte. Structural Biology and Biocomputing. CNIO -Andreu Alibés. EMBL-CRG Systems Biology Unit. -Edward R. Morrissey. Systems Biology DTC. University of Warwick