Presentation is loading. Please wait.

Presentation is loading. Please wait.

Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.

Similar presentations


Presentation on theme: "Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007."— Presentation transcript:

1 Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007

2 10/17/072 Outline Why do we need domain adaptation? Solutions:  Intelligent learning methods  Knowledge bases  Expert supervision Connections with BeeSpace V4

3 10/17/073 Why do we need domain adaptation? Many biomedical information extraction problems are solved by supervised machine learning methods such as support vector machines (SVMs).  Entity recognition  Relation extraction  Sentence categorization In supervised machine learning, it is assumed that the training data and the test data have the same distribution.

4 10/17/074 Why do we need domain adaptation? Existing labeled training data is often limited to certain domains.  GENIA corpus  human, blood cells, transcription factors  PennBioIE  Genetic variation in malignancy, Cytochrome P450 inhibition  Training data for sentence categorization in gene summarizer  fly Even when the training data is diverse (containing multiple domains), it would still be nice to customize the classifier for the particular target domain that we are working on.

5 10/17/075 Why do we need domain adaptation? NER TaskTrain → TestF1 to find PER, LOC, ORG from news text NYT → NYT0.855 Reuters → NYT0.641 to find gene/protein from biomedical literature mouse → mouse0.541 fly → mouse0.281

6 10/17/076 Solutions to domain adaptation Intelligent learning methods  Instance weighting  Feature selection Knowledge bases Expert supervision thesis research future work discussion

7 10/17/077 Domain adaptive learning methods Two-stage approach Two frameworks  Instance weighting  Feature selection Use of unlabeled data

8 10/17/078 Intuition Source Domain Target Domain

9 10/17/079 Goal Target Domain Source Domain

10 10/17/0710 Start from the source domain Source Domain Target Domain

11 10/17/0711 Focus on the common part Source Domain Target Domain

12 10/17/0712 Pick up some part from the target domain Source Domain Target Domain

13 10/17/0713 Formal formulation? Source Domain Target Domain How to formally formulate these ideas?

14 10/17/0714 Instance weighting Source Domain Target Domain instance space (each point represents an example) to assign different weights to different instances in the objective function

15 10/17/0715 Instance weighting Observation source domain target domain

16 10/17/0716 Instance weighting Observation source domain target domain

17 10/17/0717 Instance weighting Analysis of domain difference p(x, y) p(x)p(y | x) p s (y | x) ≠ p t (y | x) p s (x) ≠ p t (x) labeling difference instance difference labeling adaptation instance adaptation ?

18 10/17/0718 Instance weighting Three sets of instances DsDs D t, l D t, u X  D s + D t,l + D t,u ?

19 10/17/0719 Instance weighting Framework a flexible setup covering both standard methods and new domain adaptive methods labeled source data labeled target data unlabeled target data

20 10/17/0720 Feature selection Source Domain Target Domain feature space (each point represents a feature) to identify features that behave similarly across domains

21 10/17/0721 Feature selection Observation Domain-specific features wingless daughterless eyeless apexless … “suffix -less” weighted high in the model trained from fly data Useful for other organisms? in general NO ! May cause generalizable features to be downweighted fly genes

22 10/17/0722 Feature selection Observation Generalizable features: generalize well in all domains …decapentaplegic and wingless are expressed in analogous patterns in each … …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. flymouse

23 10/17/0723 Feature selection Observation Generalizable features: generalize well in all domains …decapentaplegic and wingless are expressed in analogous patterns in each … …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. flymouse “w i+2 = expressed” is generalizable

24 10/17/0724 Feature selection Intuition for identification of generalizable features … source domains … -less … expressed … expressed … -less … expressed … -less … expressed … -less … 1234567812345678 1234567812345678 1234567812345678 1234567812345678 … expressed … -less … flymouseD3D3 DKDK

25 10/17/0725 Feature selection Framework Matrix A is for feature selection

26 10/17/0726 Feature selection results on gene/protein recognition

27 10/17/0727 New directions to explore Knowledge bases Expert supervision

28 10/17/0728 Knowledge bases – entity recognition Well-documented nomenclatures  Fly, Mouse, Rat FlyMouseRat  Help filter out false positives?  Help select features? Dictionaries of entities  “Dictionary features”  Automatic summarization of nomenclatures?  Automatic identification of good features?

29 10/17/0729 Knowledge bases – sentence categorization in gene summarizer For fly, the training sentences are automatically extracted from FlyBase. For other organisms, do we have similar resources?

30 10/17/0730 Expert supervision – entity recognition Computer system selects ambiguous examples for human experts to judge. Computer system asks human experts other questions.  Similar organisms?  Typical surface features? (e.g. cis-regulatory elements, “-RE”) Computer system summarizes possible features from pseudo labeled data, and asks human experts for confirmation.

31 10/17/0731 Connections to BeeSpace V4 A major challenge in BeeSpace V4 is extraction of new types of entities and relations. Exploiting knowledge bases and expert supervision is especially important. For new types, no labeled data is available even from other domains. Use of bootstrapping methods should be explored.

32 10/17/0732 New entity types Recognition of many new types will be dictionary based: organism, anatomy, biological process, etc. Recognition of some new types will need some NER techniques: chemical, regulatory element

33 10/17/0733 New relation types Bootstrapping (?)  Seed patterns from knowledge bases or human experts  Human inspection of newly discovered patterns?

34 10/17/0734 The end


Download ppt "Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007."

Similar presentations


Ads by Google