Download presentation
Presentation is loading. Please wait.
Published byGarey Holt Modified over 9 years ago
Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate School of Library & Information Science University of Illinois, Urbana-Champaign
Motivation Biomedical literature serves as a “complete” documentation of the biomedical knowledge discovered by scientists Medline: > 10,000,000 literature abstracts (1966-) Effective access to biomedical literature is essential for –Understanding related existing discoveries –Formulating new hypotheses –Verifying hypotheses –… Biologists routinely use PubMed to access literature (
Challenges in Biomedical Literature Retrieval Tokenization –Many names are irregular with special characters such as “/”, “-”, etc. E.g., MIP-1-alpha, (MIP)-1alpha –Ambiguous words: “was” and “as” can be genes Semi-structured queries –It is often desirable to expand a query about a gene with synonyms of the gene; the expanded query would have several fields (original name + symbols) –“Find the role of gene A in disease B” (3 fields) …
TREC Genomics Track TREC (Text REtrieval Conference): –Started 1992; sponsored by NIST –Large-scale evaluation of information retrieval (IR) techniques Genomics Track –Started in 2003 –Still continuing –Evaluation of IR for biomedical literature search
Typical TREC Cycle Feb: Application for participation Spring: Preliminary (training) data available Beginning of Summer: Official test data available End of Summer: Result submission Early Fall: Official evaluation; results are out in Oct Nov: TREC Workshop; plan for next year
UIUC Participation 2003: Obtained initial experience; recognized the problem of “semi-structured queries” 2005: Continued developing semi-structured language models 2006: Applied hidden Markov models to passage retrieval
Outline Standard IR Techniques Semi-structured Query Language Models Parameter Estimation Experiment Results Conclusions and Future Work
What is Text Retrieval (TR)? There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users More commonly known as “Information Retrieval” (IR) Known as “search technology” in industry
TR is Hard! Under/over-specified query –Ambiguous: “buying CDs” (money or music?) –Incomplete: what kind of CDs? –What if “CD” is never mentioned in document? Vague semantics of documents –Ambiguity: e.g., word-sense, structural –Incomplete: Inferences required Even hard for people! –80% agreement in human judgments
TR is “Easy”! TR CAN be easy in a particular case –Ambiguity in query/document is RELATIVE to the database –So, if the query is SPECIFIC enough, just one keyword may get all the relevant documents PERCEIVED TR performance is usually better than the actual performance –Users can NOT judge the completeness of an answer
Formal Formulation of TR Vocabulary V={w 1, w 2, …, w N } of language Query q = q 1,…,q m, where q i V Document d i = d i1,…,d im i, where d ij V Collection C= {d 1, …, d k } Set of relevant documents R(q) C –Generally unknown and user-dependent –Query is a “hint” on which doc is in R(q) Task = compute R’(q), an “approximate R(q)”
Computing R(q) Strategy 1: Document selection –R(q)={d C|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier –System must decide if a doc is relevant or not (“absolute relevance”) Strategy 2: Document ranking –R(q) = {d C|f(d,q)> }, where f(d,q) is a relevance measure function; is a cutoff –System must decide if one doc is more likely to be relevant than another (“relative relevance”)
Document Selection vs. Ranking + + + + - - - - - - - - - - - - - - + - - Doc Selection f(d,q)=? + + + + - - + - + - - - - - - - - Doc Ranking f(d,q)=? 1 0 0.98 d 1 + 0.95 d 2 + 0.83 d 3 - 0.80 d 4 + 0.76 d 5 - 0.56 d 6 - 0.34 d 7 - 0.21 d 8 + 0.21 d 9 - R’(q) True R(q)
Problems of Doc Selection The classifier is unlikely accurate –“Over-constrained” query (terms are too specific): no relevant documents found –“Under-constrained” query (terms are too general): over delivery –It is extremely hard to find the right position between these two extremes Even if it is accurate, all relevant documents are not equally relevant Relevance is a matter of degree!
Ranking is often preferred Relevance is a matter of degree A user can stop browsing anywhere, so the boundary is controlled by the user –High recall users would view more items –High precision users would view only a few Theoretical justification: Probability Ranking Principle [Robertson 77]
Evaluation Criteria Effectiveness/Accuracy –Precision, Recall Efficiency –Space and time complexity Usability –How useful for real user tasks?
Methodology: Cranfield Tradition Laboratory testing of system components –Precision, Recall –Comparative testing Test collections –Set of documents –Set of questions –Relevance judgments
The Contingency Table Relevant Retrieved Irrelevant RetrievedIrrelevant Rejected Relevant Rejected Relevant Not relevant RetrievedNot Retrieved Doc Action
How to measure a ranking? Compute the precision at every recall point Plot a precision-recall (PR) curve precision recall x x x x precision recall x x x x Which is better?
Summarize a Ranking Given that n docs are retrieved –Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs –E.g., if the first rel. doc is at the 2 nd rank, then p(1)=1/2. –If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero Compute the average over all the relevant documents –Average precision = (p(1)+…p(k))/k This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document Mean Average Precisions (MAP) – MAP = arithmetic mean average precision over a set of topics –gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)
Precion-Recall Curve Mean Avg. Precision (MAP) Recall=3212/4728 Breakeven Point (prec=recall) Out of 4728 rel docs, we’ve got 3212 D1 + D2 + D3 – D4 – D5 + D6 - Total # rel docs = 4 System returns 6 docs Average Prec = (1/1+2/2+3/5+0)/4 about 5.5 docs in the top 10 docs are relevant Precision@10docs
Typical TR System Architecture User querydocs results Query Rep Doc Rep (Index) Scorer Indexer Tokenizer Index judgments Feedback
Tokenization Normalize lexical units: Words with similar meanings should be mapped to the same indexing term Stemming: Mapping all inflectional forms of words to the same root form, e.g. –computer -> compute –computation -> compute –computing -> compute (but king->k?) Porter’s Stemmer is popular for English
Relevance Feedback Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query Retrieval Engine Results: d 1 3.5 d 2 2.4 … d k 0.5... User Document collection
Pseudo/Blind/Automatic Feedback Query Retrieval Engine Results: d 1 3.5 d 2 2.4 … d k 0.5... Judgments: d 1 + d 2 + d 3 + … d k -... Document collection Feedback Updated query top 10
Traditional approach = Vector space model
Vector Space Model Represent a doc/query by a term vector –Term: basic concept, e.g., word or phrase –Each term defines one dimension –N terms define a high-dimensional space –Element of vector corresponds to term weight –E.g., d=(x 1,…,x N ), x i is “importance” of term i Measure relevance by the distance between the query vector and document vector in the vector space
VS Model: illustration Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query
What’s a good “basic concept”? Orthogonal –Linearly independent basis vectors –“Non-overlapping” in meaning No ambiguity Weights can be assigned automatically and hopefully accurately Many possibilities: Words, stemmed words, phrases, “latent concept”, …
How to Assign Weights? Very important! Why weighting –Query side: Not all terms are equally important –Doc side: Some terms carry more contents How? –Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) –TF normalization
Language Modeling Approaches are becoming more and more popular…
What is a Statistical LM? A probability distribution over word sequences –p(“ Today is Wednesday ”) 0.001 –p(“ Today Wednesday is ”) 0.0000000000001 –p(“ The eigenvalue is positive” ) 0.00001 Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 )…p(w n ) Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution
Text Generation with Unigram LM (Unigram) Language Model p(w| ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining paper Food nutrition paper Sampling
Estimation of Unigram LM (Unigram) Language Model p(w| )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining paper” (total #words=100) 10/100 5/100 3/100 1/100
Language Models for Retrieval (Ponte & Croft 98) Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … Query = “data mining algorithms” ? Which model would most likely have generated this query?
Ranking Docs by Query Likelihood d1d1 d2d2 dNdN q d1d1 d2d2 dNdN Doc LM p(q| d 1 ) p(q| d 2 ) p(q| d N ) Query likelihood
Kullback-Leibler (KL) Divergence Retrieval Model Unigram similarity model Retrieval Estimation of Q and D Special case: = empirical distribution of q query entropy (ignored for ranking)
Estimating p(w|d) (i.e., D ) Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) Dirichlet prior (Bayesian): Assume pseudo counts p(w|C) Absolute discounting: Subtract a constant
Estimating Q (Feedback) Query Q Document D Results Feedback Docs F={d 1, d 2, …, d n } Generative model =0 No feedback =1 Full feedback
Generative Mixture Model w w F={d 1, …, d n } Maximum Likelihood P(w| ) P(w| C) 1- P(source) Background words Topic words = Noise in feedback documents
How to Estimate F ? the 0.2 a 0.1 we 0.01 to 0.02 … text 0.0001 mining 0.00005 … Known Background p(w|C) … text =? mining =? association =? word =? … Unknown query topic p(w| F )=? “Text mining” =0.7 =0.3 Observed Doc(s) Suppose, we know the identity of each word... ML Estimator
Can We Guess the Identity? Identity (“hidden”) variable: z i {1 (background), 0(topic)} the paper presents a text mining algorithm the paper... z i 1 0 1 0... Suppose the parameters are all known, what’s a reasonable guess of z i ? - depends on (why?) - depends on p(w|C) and p(w| F ) (how?) E-step M-step Initially, set p(w| F ) to some random value, then iterate …
Example of Feedback Query Model Trec topic 412: “airport security” =0.9 =0.7 Mixture model approach Web database Top 10 docs
Problem with Standard IR Methods: Semi-Structured Queries TREC-2003 Genomics Track, Topic 1: Problems with unstructured representation –Intuitively, matching “ATF2” should be counted more than matching “transcription” –Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach Find articles about the following gene : OFFICIAL_GENE_NAME activating transcription factor 2 OFFICIAL_SYMBOL ATF2 ALIAS_SYMBOL HB16 ALIAS_SYMBOL CREB2 ALIAS_SYMBOL TREB7 ALIAS_SYMBOL CRE-BP1 Bag-of-word Representation: activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1
Problem with Standard IR Methods: Semi-Structured Queries (cont.) A topic in TREC-2005 Genomics Track 3 different fields Should be weighted differently? What about expansion? Find information about the role of the gene interferona-beta in the disease multiple sclerosis
Semi-Structured Language Models Semi-structured query Semi-structured query model Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using Expectation-Maximization (EM)
Parameter Estimation Synonym queries: –Each field is estimated using maximum likelihood: –Each field has equal weights: i =1/k Aspect queries: –Use top-ranked documents to estimate all the parameters –Similar to single-aspect model, but use query as prior and Bayesian estimation
Maximum Likelihood vs. Bayesian Maximum likelihood estimation –“Best” means “data likelihood reaches maximum” –Problem: small sample Bayesian estimation –“Best” means being consistent with our “prior” knowledge and explaining data well –Problem: how to define prior?
Illustration of Bayesian Estimation Prior: p( ) Likelihood: p(X| ) X=(x 1,…,x N ) Posterior: p( |X) p(X| )p( ) : prior mode ml : ML estimate : posterior mode
Experiment Results TREC 2003 (Uniform weights)TREC 2005 (Estimated weights) Query ModelUnstructSemi-structImp.UnstructSemi-structImp. MAP0.160.185+13.5%0.2420.258+6.6% Pr@10docs0.140.154+10%0.3820.412+7.8%
More Experiment Results (with slightly different model)
Conclusions Standard IR techniques are effective for biomedical literature retrieval Modeling and exploiting the structure in a query can improve accuracy Overall TREC Genomics Track findings –Domain-specific resources are very useful –Sound retrieval models and machine learning techniques are helpful
Future Work Using HMMs to model relevant documents Incorporate biomedical resources into principled statistical models
Similar presentations
© 2025 Inc.
All rights reserved.