Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei
Concept Processing Component for Beespace: A Big Picture Relevant documemts Query terms Retrieval Filtering Module A list of Representative Terms Or phrases Normalization And Clustering Module Pre-processed Text Collection entities & phrases of interest Similarity Groups Of Terms and Phrases (Concepts)
Concept Processing Component for Beespace: Input and Output Input: texts (indices) with entities and phrases tagged. Filtering: a group of relevant documents for a query Normalization: a list of terms, entities or phrases of interest to be normalized Output: Filtering: list of highly representative terms & phrases Normalization: hierarchical structure of concepts (compacted, loose) Concept dictionary texts tagged with concepts
Filtering
Term Filtering: Heuristics We want to find a list of representative terms & phrases short enough to enable interactive selection and navigation. We want terms with higher frequency in the given documents, (high Term Frequency), however… Terms too frequent in the whole collection are considered harmful: the, is, cell, bee, …(low Document Frequency)
Term Filtering: TF*IDF Adding IDF to frequency count: Weight = tf * log ((N – 1)/df) TF-IDF formula in Okapi method: Weight = IDF TF part
Term Filtering (cont.) Results 1: Results 1 Collection: honeybee.biosis 1980 Query: “pollen-foraging” Select top 2 documents Results 2: Results 2 Collection: GENIA (on “ human & blood cell & transcription factor ” ), with noun phrases of entities tagged Query: “ il-2 ”
Normalization
From Term to Concept: Normalization and Theme Clustering Normalization: Tight concepts Group terms/entities/phrases with similarity so that one can represent others Forage: forager, forage-bee, foraging, foragers, pollen- foraging… Theme clustering: Looser concepts Group terms/entities/phrases representing the same subtopic (semantically related) forage, pollen, food, detect, feeding, dance, … In a hierarchical manner.
Normalization Morphological approach? (stemming) Normalize English words of morphological variations, e.g. forag: forage/foraging/forager/foragers Concerns: Too cruel? one->on; day->dai; apis-> api; useful -> us Handling biological entities? (some do nothing when detect “-”) Not sufficient to normalize phrases
Normalization: Stemmers Porter Stemmer: does not stem words beginning with an uppercase letter Krovetz' Stemmer: Less aggressive than porter Sample results: Honeybee: Honeybee Genia: Genia
Normalization (cont.) Semantic and Contextual Approach: Group the terms which are considered “Replaceable” with each other in a context. E.g. …the pollen-foraging activity of a mellifera… …the nectar-foraging activity of a cerana… Generally handled with clustering approaches based on statistical information in a large corpus Usually in the form of hierarchical clusters
Normalization: A clustering approach A N-gram clustering method: Ideally, if we consider the terms in its N-Gram context, the replaceable relation would be global and reliable. Concerns: efficiency Computing complexity is high! For 2-gram, NV 2 even after optimization! (initially V 5 ) Space complexity is high!! V 3 Compromising: use 2-gram (equivalent to computing the average mutual information of 2-grams and group two terms which will bring the smallest loss to this avg. MI)
Normalization: A clustering approach (cont.) Toy Example on honeybee: Vocabulary size: 9100 words; Collection size: 5505 abstracts; (honeybee.biosis1980) Terms to be Clustered: 18 Genia collection, 2000 abstracts 200 noun phrases (entities) to be clustered
nectar-foraging foraging-related pollen-foraging preforaging non-foraging foragers worker bee honeybee workers nurseries nursery nursing forage forager foraging queen queens
Sample clusters on Genia: human_and_mouse_gene mouse_il-2r_alpha_gene saos_2_cells saos-2 human_osteosarcoma_ epstein-barr_virus_ interleukin-2 interleukin-2_ epstein-barr_virus phorbol_myristate_acetate phorbol_12-myristate_13-acetate u937_cells monocytic_cells jurkat_cells human_t_cells ipr_cd4-8-_t_cells j_delta_k_cells lymphoid_cells activated_t_cells hematopoietic_cells transcription_factors transcription_factor b_cells jurkat_t_cells hela_cells thp-1 hl60_cells k562_cells thp-1_cells i_kappa_b_alpha nf_kappa_b 2_gene_expression 2_gene
Normalization: Clustering Methods Other Possible Clustering Approaches Cluster terms based on features such as: Co-occurring terms Tends to ignore position information Correlation of Nouns and Verbs Dependency-based Word Similarity Proximity-based Word Similarity Depend on highly accurate parsing result, which may be not easy to get for biology literature.
Theme Clustering Looser Clusters Usually in the form of partitioning clusters K-Means, Latent Semantic Indexing, Probabilistic LSI Compute loose clusters of terms, or clusters represented by term distributions Example: # cluster = 10 Example Sometimes helpful to find normalizations (e.g., when #clusters are large; when no stemming was done) Comparative Text Mining for concept switching
Future Plan: Customize the stemmers Try more morphological approaches. e.g. pollen-foraging, nectar-foraging Exam more clustering methods: How to use theme clustering to help normalization Find a way to divide the hierarchical clustering structure into concepts
Thanks!