Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.

Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign

2 Named Entity Recognition A fundamental task in IE An important and challenging task in biomedical text mining  Critical for relation mining  Great variation and different gene naming conventions

3 Need for domain adaptation Performance degrades when test domain differs from training domain Domain overfitting taskNE typestrain → testF1 newsLOC, ORG, PER NYT → NYT0.855 Reuters → NYT0.641 biomedicalgene, protein mouse → mouse0.541 fly → mouse0.281

4 Existing work Supervised learning  HMM, MEMM, CRF, SVM, etc. (e.g., [Zhou & Su 02], [Bender et al. 03], [McCallum & Li 03] ) Semi-supervised learning  Co-training [Collins & Singer 99] Domain adaptation  External dictionary [Ciaramita & Altun 05]  Not seriously studied

5 Outline Observations Method  Generalizability-based feature ranking  Rank-based prior Experiments Conclusions and future work

6 Observation I Overemphasis on domain-specific features in the trained model wingless daughterless eyeless apexless … fly “suffix –less” weighted high in the model trained from fly data Useful for other organisms? in general NO! May cause generalizable features to be downweighted

7 Observation II Generalizable features: generalize well in all domains  …decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly)  …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)

8 Observation II Generalizable features: generalize well in all domains  …decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly)  …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse) “w i+2 = expressed” is generalizable

9 Generalizability-based feature ranking flymouseD3D3 DmDm … training data … -less … expressed … expressed … -less … expressed … -less … expressed … -less … 1234567812345678 1234567812345678 1234567812345678 1234567812345678 s(“expressed”) = 1/6 = 0.167 s(“-less”) = 1/8 = 0.125 … expressed … -less … 0.125 … 0.167 …

10 Feature ranking & learning labeled training data supervised learning algorithm trained classifier... expressed … -less … top k features F

11 Feature ranking & learning labeled training data supervised learning algorithm trained classifier... expressed … top k features F

12 Feature ranking & learning labeled training data supervised learning algorithm trained classifier... expressed … -less … prior F logistic regression model (MEMM) rank-based prior variances in a Gaussian prior

13 Prior variances Logistic regression model MAP parameter estimation σ j 2 is a function of r j prior for the parameters

14 Rank-based prior variance σ 2 rank rr = 1, 2, 3, … a important features → large σ 2 non-important features → small σ 2

15 Rank-based prior variance σ 2 rank rr = 1, 2, 3, … a b = 6 b = 4 b = 2 a and b are set empirically

16 Summary D1D1 DmDm … training data O1O1 OmOm … individual domain feature ranking generalizability-based feature ranking O’ learning E testing entity tagger test data optimal b 1 for D 1 optimal b 2 for D 2 b = λ 1 b 1 + … + λ m b m λ 1, …, λ m rank-based prior optimal b m for D m rank-based prior

17 Experiments Data set  BioCreative Challenge Task 1B  Gene/protein recognition  3 organisms/domains: fly, mouse and yeast Experimental setup  2 organisms for training, 1 for testing  Baseline: uniform-variance Gaussian prior  Compared with 3 regular feature ranking methods: frequency, information gain, chi-square

18 Comparison with baseline ExpMethodPrecisionRecallF1 F+M→YBaseline0.5570.4660.508 Domain0.5750.5160.544 % Imprv.+3.2%+10.7%+7.1% F+Y→MBaseline0.5710.3350.422 Domain0.5820.3810.461 % Imprv.+1.9%+13.7%+9.2% M+Y→FBaseline0.5830.0970.166 Domain0.5910.1390.225 % Imprv.+1.4%+43.3%+35.5%

19 Comparison with regular feature ranking methods generalizability-based feature ranking information gain and chi-square feature frequency

20 Conclusions and future work We proposed  Generalizability-based feature ranking method  Rank-based prior variances Experiments show  Domain-aware method outperformed baseline method  Generalizability-based feature ranking better than regular feature ranking To exploit the unlabeled test data

21 Thank you! The end

22 How to set a and b? a is empirically set to a constant  b is set from cross validation  Let b i be the optimal b for domain D i  b is set as a weighted average of b i ’s, where the weights indicate the similarity between the test domain and the training domains

Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback