Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
A quick review of the Gene Summarizer (1) Pre-defined six generic aspects for summarizing genes: –GP (Gene Product): describing the product (protein, rRNA, etc.) of the target gene; –EL (Expression Location): describing where the target gene is mainly expressed; –SI (Sequence Information): describing the sequence information of the target gene and its product; –WFPI (Wild-Type Function & Phenotypic Information): describing the wild-type functions and the phenotypic information about the target gene and its product; –MP (Mutant Phenotype): describing the information about the mutant phenotypes of the the target gene; –GI (Genetic interaction): describing the genetical interactions of the target gene with other molecules
A quick review of the Gene Summarizer (2) Two-stage approach for generating semi- structured gene summary –Retrieving sentences about a gene (keyword match/NER precision, recall) –Extracting sentences for each specified semantic aspect (categorization training sentences, classification algorithm)
A quick review of the Gene Summarizer (3) Potentially an Entity Summarizer? Yes. Ex. Pathway Summarizer –Define categories: (1) pathway type (Metabolism, Genetic Information Processing, Environmental Information Processing, Human Diseases, …); (2) molecules involved (3) molecular interactions/reactions/relations; … –Collect example sentences for each category Gene Summarizer Special entities + Pre-defined categories Entity Summarizer
Current Problems (1) Key word match: not high precision –Ex. Screening a cDNA library prepared from silk- producing glands of the black widow spider,… NER component: not high recall –Ex. …beta-alanine biosynthesis is regulated by black. –Take advantage of known synonyms –a simpler problem than NER as we already know the gene name and its synonyms
Current Problems (2) Categorization component: no high quality example sentences –Training sentences are mostly automatically downloaded from model organism databases (FlyBase, SGD), which are not really sentences from literature. –Classification algorithms can be further improved. Irrelevant gene mentions –Some popular gene names are frequently mentioned in abstracts for comparison/reference purpose.
Strategies used for V1.’s keyword match –Recall: use dictionary-based keyword retrieval, to retrieve all documents containing any synonym of the target gene –Precision: require retrieved abstracts to contain at least one long synonym. Ex. …SmZF1 binds both ds and ss DNA oligonucleotides,… Our Solutions (1)
Our Solutions (2) NER component for V2. –Recall: require exact match but allowing crossing the tag boundary –Ex. Query: ABC-a xxxxxxxxxxxxxx ABC a gene xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx ABC a xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx ABC a encodes xxxxxxxxxxxx xxx gene ABC a xxxxxxxxxxxxxxx
Our Solutions (3) NER component for V3. –Recall: take advantage of known synonyms, i.e., expand the query using the known synonyms to improve the recall. –Ex. Coexpression of Ss and Tgo in Drosophila SL2 cells…
Future Work (1) Further improvement of the gene name recognition component –A simpler problem than NER as we already know the gene name and its synonyms –Build a classifier that focuses on contextual features to identify false gene mentions Ex. The purpose of this study was to investigate the black gene, and protein,… Screening a cDNA library prepared from silk-producing glands of the black widow spider,… Only use contextual features because the term/phrase already matches a gene name Need some “good” negative examples (ambiguous gene names) in the training data
Future Work (2) Enable synonym selections by user –Use synonyms for potential query expansion, to increase recall –Allow user choose which synonyms for which species to be searched, to control the precision. –Ex. “black” is also used as a synonym for D. ananassae "ebony" gene, GS will extract all the sentences mentioning "ebony" for summarizing "black".
Future Work (3) Integrating the gene mention relevance to the sentence ranking. –Some popular gene names are frequently mentioned in abstracts for comparison/reference purpose. –Ex. It encodes a protein predicted to contain 688 amino acid residues, including 11 zinc finger motifs of the C-2H-2 type in the C-terminal region, that are Kruppel-like in the conservation of the H/C link sequence connecting them. –Number of Occurrences of the query gene mentions in the retrieved abstract will somehow indicate how relevant this abstract is to the query gene.
Future Work (4) Speed up the GS for huge collections, like the entire Medline (need ~3 mins)! –Three major steps for the GS: a)Get sentence IDs for the query gene b)Compute relevance scores for all six aspects c)Retrival original sentences for result display –Step (a) can be precomputed for all genes in our dictionary. –Step (b) can be precomputed for all sentences in our collection. –Only step (c) has to be done on-the-fly.
Future Work (5) Enable summarizing on any dynamically generated collection –Users may only be interested to see summarized information about a honeybee gene in a behavior- related collection. E.g., people may want to know what are the discoveries of this gene about mating behavior? –Add a filter between step (a) and (b) Only return sentences within the current collection for summarization.
Discussions … Thoughts ? Questions?