BeeSpace Informatics Research

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Jason H.D. Cho 1,2, Parikshit Sondhi 1, Chengxiang Zhai 1, Bruce R. Schatz 1,2,3 1 Department of Computer Science, 2 Institute of Genomic Biology, 3 Department.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Presented by Zeehasham Rasheed
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Information Retrieval in Practice
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
BeeSpace: An Interactive Environment for Analyzing Nature and Nurture in Societal Roles Bruce Schatz Institute for Genomic Biology University of Illinois.
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 7, 2007.
Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign.
University of Illinois at Urbana-Champaign INSTITUTE FOR GENOMIC BIOLOGY BeeSpace: An Interactive Environment for Functional Analysis of Social Behavior.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
BeeSpace: An Interactive Environment for Analyzing Nature and Nurture in Societal Roles Bruce Schatz Institute for Genomic Biology University of Illinois.
Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 14, 2007.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
BeeSpace Informatics: Interactive System for Functional Analysis Bruce Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
BeeSpace: An Interactive Environment for Functional Analysis of Social Behavior Bruce Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Automatic Labeling of Multinomial Topic Models
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Semantic Processing with Context Analysis
Text Based Information Retrieval
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Applications of Text Mining
Personalized Social Image Recommendation
Course Summary (Lecture for CS410 Intro Text Info Systems)
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Applying Key Phrase Extraction to aid Invalidity Search
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Ying Dai Faculty of software and information science,
Introduction to Information Retrieval
Ying Dai Faculty of software and information science,
Text Mining & Natural Language Processing
CS246: Information Retrieval
INF 141: Information Retrieval
Topic: Semantic Text Mining
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009

Overview of BeeSpace Technology Users … Task Support Gene Summarizer Function Annotator Space Navigation Space/Region Manager, Navigation Support Search Engine Text Miner Relational Database Words/Phrases Entities Content Analysis Natural Language Understanding Meta Data Literature Text

Part 1: Content Analysis

Natural Language Understanding …We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to … NP VP Gene Gene

Sample Technique 1: Automatic Gene Recognition Syntactic clues: Capitalization (especially acronyms) Numbers (gene families) Punctuation: -, /, :, etc. Contextual clues: Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. Global: same noun phrase occurs several times in the same article

Maximum Entropy Model for Gene Tagging Given an observation (a token or a noun phrase), together with its context, denoted as x Predict y  {gene, non-gene} Maximum entropy model: P(y|x) = K exp(ifi(x, y)) Typical f: y = gene & candidate phrase starts with a capital letter y = gene & candidate phrase contains digits Estimate i with training data

Domain overfitting problem When a learning based gene tagger is applied to a domain different from the training domain(s), the performance tends to decrease significantly. The same problem occurs in other types of text, e.g., named entities in news articles. Training domain Test domain F1 mouse 0.541 fly 0.281 Reuters 0.908 WSJ 0.643

Observation I Overemphasis on domain-specific features in the trained model wingless daughterless eyeless apexless … fly “suffix –less” weighted high in the model trained from fly data

Observation II Generalizable features: generalize well in all domains …decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly) …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)

Observation II Generalizable features: generalize well in all domains …decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly) …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse) “wi+2 = expressed” is generalizable

Generalizability-based feature ranking training data fly mouse D3 … Dm 1 2 3 4 5 6 7 8 … -less expressed 1 2 3 4 5 6 7 8 … expressed -less 1 2 3 4 5 6 7 8 … expressed -less … 1 2 3 4 5 6 7 8 … expressed -less … expressed -less … 0.125 0.167

Adapting Biological Named Entity Recognizer test data T1 Tm training data … learning entity recognizer d = λ0d0 + (1 – λ0) (λ1d1 + … + λmdm) d features λ0, λ1, … , λm testing O1 Om … individual domain feature ranking domain-specific features feature re-ranking O’ generalizable features feature selection for D1 feature selection for D0 top d0 features for D0 top d1 features for D1 feature selection for Dm top dm features for Dm …

Effectiveness of Domain Adaptation Exp Method Precision Recall F1 F+M→Y Baseline 0.557 0.466 0.508 Domain 0.575 0.516 0.544 % Imprv. +3.2% +10.7% +7.1% F+Y→M 0.571 0.335 0.422 0.582 0.381 0.461 +1.9% +13.7% +9.2% M+Y→F 0.583 0.097 0.166 0.591 0.139 0.225 +1.4% +43.3% +35.5% Text data from BioCreAtIvE (Medline) 3 organisms (Fly, Mouse, Yeast)

Gene Recognition in V3 A variation of the basic maximum entropy Classes: {Begin, Inside, Outside} Features: syntactical features, POS tags, class labels of previous two tokens Post-processing to exploit global features Leverage existing toolkit: BMR

Part 2: Navigation Support

Space-Region Navigation … Topic Regions Intersection, Union,… My Regions/Topics Bird Singing EXTRACT Fly Rover EXTRACT Bee Forager MAP MAP … Bee Bird Fly My Spaces SWITCHING Intersection, Union,… Behavior Literature Spaces

MAP: Topic/RegionSpace MAP: Use the topic/region description as a query to search a given space Retrieval algorithm: Query word distribution: p(w|Q) Document word distribution: p(w|D) Score a document based on similarity of Q and D Leverage existing retrieval toolkits: Lemur/Indri

EXTRACT: Space Topic/Region Assume k topics, each being represented by a word distribution Use a k-component mixture model to fit the documents in a given space (EM algorithm) The estimated k component word distributions are taken as k topic regions Likelihood: Maximum likelihood estimator: Bayesian estimator:

A Sample Topic & Corresponding Space Word Distribution (language model) labels Meaningful labels actin filaments flight muscle flight muscles filaments 0.0410238 muscle 0.0327107 actin 0.0287701 z 0.0221623 filament 0.0169888 myosin 0.0153909 thick 0.00968766 thin 0.00926895 sections 0.00924286 er 0.00890264 band 0.00802833 muscles 0.00789018 antibodies 0.00736094 myofibrils 0.00688588 flight 0.00670859 images 0.00649626 Example documents actin filaments in honeybee-flight muscle move collectively arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections identification of a connecting filament protein in insect fibrillar flight muscle the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles structure of thick filaments from insect flight muscle

Incorporating Topic Priors Either topic extraction or clustering: User exploration: usually has preference. E.g., want one topic/cluster is about foraging behavior Use prior to guild topic extraction Prior as a simple language model E.g. forage 0.2; foraging 0.3; food 0.05; etc.

Incorporating a Topic Prior Original EM: EM with Prior:

Incorporating Topic Priors: Sample Topic 1 age 0.0672687 division 0.0551497 labor 0.052136 colony 0.038305 foraging 0.0357817 foragers 0.0236658 workers 0.0191248 task 0.0190672 behavioral 0.0189017 behavior 0.0168805 older 0.0143466 tasks 0.013823 old 0.011839 individual 0.0114329 ages 0.0102134 young 0.00985875 genotypic 0.00963096 social 0.00883439 Prior: labor 0.2 division 0.2

Incorporating Topic Priors: Sample Topic 2 behavioral 0.110674 age 0.0789419 maturation 0.057956 task 0.0318285 division 0.0312101 labor 0.0293371 workers 0.0222682 colony 0.0199028 social 0.0188699 behavior 0.0171008 performance 0.0117176 foragers 0.0110682 genotypic 0.0106029 differences 0.0103761 polyethism 0.00904816 older 0.00808171 plasticity 0.00804363 changes 0.00794045 Prior: behavioral 0.2 maturation 0.2

Exploit Prior for Concept Switching foraging 0.142473 foragers 0.0582921 forage 0.0557498 food 0.0393453 nectar 0.03217 colony 0.019416 source 0.0153349 hive 0.0151726 dance 0.013336 forager 0.0127668 information 0.0117961 feeder 0.010944 rate 0.0104752 recruitment 0.00870751 individual 0.0086414 reward 0.00810706 flower 0.00800705 dancing 0.00794827 behavior 0.00789228 foraging 0.290076 nectar 0.114508 food 0.106655 forage 0.0734919 colony 0.0660329 pollen 0.0427706 flower 0.0400582 sucrose 0.0334728 source 0.0319787 behavior 0.0283774 individual 0.028029 rate 0.0242806 recruitment 0.0200597 time 0.0197362 reward 0.0196271 task 0.0182461 sitter 0.00604067 rover 0.00582791 rovers 0.00306051

Part 3: Task Support

Gene Summarization Task: Automatically generate a text summary for a given gene Challenge: Need to summarize different aspects of a gene Standard summarization methods would generate an unstructured summary Solution: A new method for generating semi-structured summaries

An Ideal Gene Summary http://flybase.bio.indiana.edu/.bin/fbidq.html?FBgn0000017 GP EL SI GI MP WFPI

Semi-structured Text Summarization

Summary example (Abl)

A General Entity Summarizer Task: Given any entity and k aspects to summarize, generate a semi-structured summary Assumption: Training sentences available for each aspect Method: Train a recognizer for each aspect Given an entity, retrieve sentences relevant to the entity Classify each sentence into one of the k aspects Choose the best sentences in each category

Summary All the methods we developed are General Scalable The problems are hard, but good progress has been made in all the directions The V3 system has only incorporated the basic research results More advanced technologies are available for immediate implementation Better tokenization for retrieval Domain adaptation techniques Automatic topic labeling General entity summarizer More research to be done in Entity & relation extraction Graph mining/question answering Domain adaptation Active learning

Looking Ahead: X-Space… Users … Task Support Gene Summarizer Function Annotator Space Navigation Space/Region Manager, Navigation Support Search Engine Text Miner Relational Database Words/Phrases Entities Content Analysis Natural Language Understanding Meta Data Literature Text

Thank You! Questions?