Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei 03.16.2005.

Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei 03.16.2005

Concept Processing Component for Beespace: A Big Picture Relevant documemts Query terms Retrieval Filtering Module A list of Representative Terms Or phrases Normalization And Clustering Module Pre-processed Text Collection entities & phrases of interest Similarity Groups Of Terms and Phrases (Concepts)

Concept Processing Component for Beespace: Input and Output Input: texts (indices) with entities and phrases tagged.  Filtering: a group of relevant documents for a query  Normalization: a list of terms, entities or phrases of interest to be normalized Output:  Filtering: list of highly representative terms & phrases  Normalization: hierarchical structure of concepts (compacted, loose) Concept dictionary texts tagged with concepts

Filtering

Term Filtering: Heuristics  We want to find a list of representative terms & phrases short enough to enable interactive selection and navigation.  We want terms with higher frequency in the given documents, (high Term Frequency), however…  Terms too frequent in the whole collection are considered harmful: the, is, cell, bee, …(low Document Frequency)

Term Filtering: TF*IDF Adding IDF to frequency count:  Weight = tf * log ((N – 1)/df) TF-IDF formula in Okapi method:  Weight = IDF TF part

Term Filtering (cont.) Results 1: Results 1  Collection: honeybee.biosis 1980  Query: “pollen-foraging”  Select top 2 documents Results 2: Results 2  Collection: GENIA (on “ human & blood cell & transcription factor ” ), with noun phrases of entities tagged  Query: “ il-2 ”

Normalization

From Term to Concept: Normalization and Theme Clustering Normalization: Tight concepts  Group terms/entities/phrases with similarity so that one can represent others Forage: forager, forage-bee, foraging, foragers, pollen- foraging… Theme clustering: Looser concepts  Group terms/entities/phrases representing the same subtopic (semantically related) forage, pollen, food, detect, feeding, dance, … In a hierarchical manner.

Normalization Morphological approach? (stemming)  Normalize English words of morphological variations, e.g. forag: forage/foraging/forager/foragers  Concerns: Too cruel? one->on; day->dai; apis-> api; useful -> us Handling biological entities? (some do nothing when detect “-”) Not sufficient to normalize phrases

Normalization: Stemmers  Porter Stemmer: does not stem words beginning with an uppercase letter  Krovetz' Stemmer: Less aggressive than porter  Sample results: Honeybee: Honeybee Genia: Genia

Normalization (cont.) Semantic and Contextual Approach:  Group the terms which are considered “Replaceable” with each other in a context. E.g. …the pollen-foraging activity of a mellifera… …the nectar-foraging activity of a cerana…  Generally handled with clustering approaches based on statistical information in a large corpus  Usually in the form of hierarchical clusters

Normalization: A clustering approach A N-gram clustering method:  Ideally, if we consider the terms in its N-Gram context, the replaceable relation would be global and reliable.  Concerns: efficiency Computing complexity is high!  For 2-gram, NV 2 even after optimization! (initially V 5 ) Space complexity is high!!  V 3  Compromising: use 2-gram (equivalent to computing the average mutual information of 2-grams and group two terms which will bring the smallest loss to this avg. MI)

Normalization: A clustering approach (cont.) Toy Example on honeybee:  Vocabulary size: 9100 words;  Collection size: 5505 abstracts; (honeybee.biosis1980)  Terms to be Clustered: 18 Genia collection, 2000 abstracts  200 noun phrases (entities) to be clustered

nectar-foraging foraging-related pollen-foraging preforaging non-foraging foragers worker bee honeybee workers nurseries nursery nursing forage forager foraging queen queens

Sample clusters on Genia: human_and_mouse_gene mouse_il-2r_alpha_gene saos_2_cells saos-2 human_osteosarcoma_ epstein-barr_virus_ interleukin-2 interleukin-2_ epstein-barr_virus phorbol_myristate_acetate phorbol_12-myristate_13-acetate u937_cells monocytic_cells jurkat_cells human_t_cells ipr_cd4-8-_t_cells j_delta_k_cells lymphoid_cells activated_t_cells hematopoietic_cells transcription_factors transcription_factor b_cells jurkat_t_cells hela_cells thp-1 hl60_cells k562_cells thp-1_cells i_kappa_b_alpha nf_kappa_b 2_gene_expression 2_gene

Normalization: Clustering Methods Other Possible Clustering Approaches Cluster terms based on features such as:  Co-occurring terms Tends to ignore position information  Correlation of Nouns and Verbs  Dependency-based Word Similarity  Proximity-based Word Similarity Depend on highly accurate parsing result, which may be not easy to get for biology literature.

Theme Clustering Looser Clusters  Usually in the form of partitioning clusters  K-Means, Latent Semantic Indexing, Probabilistic LSI Compute loose clusters of terms, or clusters represented by term distributions Example: # cluster = 10 Example Sometimes helpful to find normalizations (e.g., when #clusters are large; when no stemming was done)  Comparative Text Mining for concept switching

Future Plan: Customize the stemmers Try more morphological approaches.  e.g. pollen-foraging, nectar-foraging Exam more clustering methods:  How to use theme clustering to help normalization  Find a way to divide the hierarchical clustering structure into concepts

Thanks!

Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei 03.16.2005.

Similar presentations

Presentation on theme: "Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei 03.16.2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei 03.16.2005.

Similar presentations

Presentation on theme: "Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei 03.16.2005."— Presentation transcript:

Similar presentations

About project

Feedback