Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute for Infocomm Research, Singapore 2 School of Computing, National Univ. of Singapore (Bioinformatics, Vol.20, No.7, 2004, p )
2/29 Abstract Present a named entity recognition system in the biomedical domain: PowerBioNE. Evidential features (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) POS; (4) head noun trigger; (5) special verb trigger and (6) name alias feature Hidden Markov model (HMM) Use k-Nearest Neighbor algorithm to resolve the data sparseness problem Use pattern-based post-processing to deal with the cascaded entity name phenomenon.
3/29 Special Naming Conventions (1/2) Descriptive naming convention normal thymic epithelial cells->difficulty for identifying the left boundaries 18.6% consist of at least four words in GENIA 3.0 Conjunction and disjunction 91 and 84 kDa proteins 2.06% have such construction in GENIA 3.0 Non-standardized naming convention N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine
4/29 Special Naming Conventions (2/2) Abbreviation Frequently used in the biomedical domain. Ambiguous: 81.2% are ambiguous and have an average of 16.6 senses in MEDLINE abstracts Cascaded construction kappa 3 binding factor 16.7% have such construction in GENIA 3.0
5/29 GENIA Corpus GENIA V MEDLINE abstracts of 123K words. Use it for comparison of their work with others. GENIA V2.1 Incorporate POS to GENIA V1.1. Used to train POS tagger and evaluate the usefulness of POS. GENIA V3.0 2,000 MEDLINE abstracts of 360K words. Used it to do the great scope of experiments. GENIA ontology includes 23 distinct classes.
6/29 Features Word formation pattern (F WFP ) Morphological pattern (F MP ) Part-of-speech (F POS ) Semantic triggers Name alias feature (F ALIAS )
7/29 Word Formation Pattern It is useful to distinguish between biomedical entity names and others.
8/29 Morphological Pattern (1/2)
9/29 Morphological Pattern (2/2) They count the frequency of each prefix/suffix in each entity class and group prefixes/suffixes with the similar distribution among the entity classes into one category. Average 37 prefixes/suffixes are selected from the training data and further grouped into 23 categories.
10/29 Part-of-Speech POS may provide useful evidence about the boundaries of biomedical entity names. Authors adapt an HMM-based POS tagger to GENIA V2.1 by training on PENN TreeBank (2,500 WSJ articles, 1M words) and 590 GENIA abstracts.
11/29 Semantic Triggers (1/2) Head noun trigger (F HEAD ) The major noun of a noun phrase, often describes the function or the property of the NP. E.g. activated human B cells Extract unigram and bigram head nouns, rank by frequency. Select 60% as head noun trigger.
12/29 Semantic Triggers (2/2) Special verb trigger (F VERB ) They may provide the evidence on the boundaries and the classes of biomedical entity names.
13/29 Name Alias Feature Inter-sentential name alias phenomenon TCF: proposed as an entity name candidate. The name alias algorithm is invoked. If ‘T cell Factor’ is a ‘Protein’ name recognized earlier in the document, ‘TCF’ is determined as an alias of ‘T cell Factor’ with the name alias feature Protein3L3. Inner-sentential abbreviation When an abbreviation with parentheses is detected, remove the abbreviation and the parentheses. Applying the HMM-based named entity recognizer to the sentence, restore the abbreviation with parentheses to its original position. The abbreviation is classified as the same class of the expanded form. The expanded form and its abbreviation are stored in the recognized list of biomedical entity names.
14/29 HMM-based Biomedical Named Entity Recognition (1/2) Given an output sequence, the purpose of an HMM is to find the most likely tag (state) sequence that maximizes. Here, o i =, where w i is the word and is the feature set of the word w i, and s i = BOUNDARY i _ENTITY i _FEATURE i, where BOUNDARY i denotes the position of the current word in the entity; ENTITY i indicates the class of the entity; and FEATURE i is the feature set.
15/29 HMM-based Biomedical Named Entity Recognition (2/2) Assume MI independence =>
16/29 k-NN Algorithm for Computing (1/2) Assume, where the pattern entry E i =o i-N …o i …o i+N. The k-NN algorithm estimates P(·|E i ) by first finding the K-nearest neighbors of frequently occurring pattern entries to the initial pattern entry E i and then aggregating them to make a proper estimation of P(·|E i ).
17/29 k-NN Algorithm for Computing (2/2) Conditional state probability distribution likelihood(E, E i ), the likelihood of a pattern entry E, is one of the K nearest neighbors to the initial pattern entry E i.
18/29 Post-processing: Cascaded Entity Name Resolution (1/2) Six patterns are extracted from GENIA := + head noun, e.g. binding motif := +, e.g. := modifier +, e.g. anti := + word +, e.g. infected
19/29 Post-processing: Cascaded Entity Name Resolution (2/2) := modifier + + head noun := + + head noun In the experiments, all the rules of above six patterns are extracted from the cascaded entity names in the training data to deal with the cascaded entity name phenomenon.
20/29 Experiments (1/5) Evaluate the PowerBioNE on GENIA V3.0 and V1.1. For GENIA V1.1, they select 80 abstracts for testing and the remaining 590 abstracts as the training data. For GENIA V3.0, they select 200 abstracts as the test data and the remaining 1800 abstracts as the training data. All the experimentations are done 10 times and the evaluations are averaged over the test data. Average 63 rules are extracted from the cascaded entity names from GENIA V1.1 while average 102 rules are extracted from the cascaded entity names in the training data of GENIA V3.0.
21/29 Experiments (2/5)
22/29 Experiments (2/4)
23/29 Experiments (3/5)
24/29 Analysis for Table 8 The contribution of the word formation pattern feature in the biomedical domain is very limited compared with that in the newswire domain. The morphological pattern feature is useful. POS after adaptation is proven to be very useful in the biomedical domain. The head noun trigger feature is very useful. The use of the special verb trigger feature decreases the recall rate while keeping the precision. The name alias feature only slightly improves the F-measure by 0.6. This may be due to the complexity of the name alias phenomenon and the simple strategy applied in the system. The pattern-based post-processing for cascaded entity name resolution is proven to be very useful.
25/29 Experiments (4/5) More verb triggers only decrease the performance more.
26/29 Experiments (5/5)
27/29 Analysis for Table 10 It suggests that stable and significant performance improvement can only be achieved for inclusion of POS with enough accuracy. The performance of HMMs is the highest HMMs have the better ability of capturing the locality of various biomedical entity names. The feature vector-based classifiers, such as SVM, C4.5, C4.5 rules and RIPPER, cannot effectively capture the local context dependence by assuming the independence between the features while the baseline naïve Bayes classifier fails to capture local context dependence by assuming the conditional probability independence among the local context.
28/29 Error Analysis Randomly choose 100 errors from results. Errors that are due to the strict annotation scheme and the annotation inconsistence in the GENIA corpus, can be considered acceptable. (total/acceptable) Left boundary errors (15/12) Cascaded entity name errors (17/13) Misclassification errors (16/3) True negative (29/12) False positive (18/10) Miscellaneous (11/1)
29/29 Conclusion Propose and integrate various evidential features, including word formation pattern, morphological pattern, POS, head noun trigger, special verb trigger and name alias feature. K-NN algorithm effective resolves the data sparseness problem. The pattern-based post-processing deals with the cascaded entity name phenomenon. It is the first system which deals with the cascaded entity name phenomenon.