Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin

Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin
Bioinformatics, Data Integration and Machine Learning a Thesis Proposal Kaushik Sinha Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin November 18

Roadmap Motivation Our Approach Current Work Proposed Work Conclusion
Learning Layouts of Flat-file Biological Datasets Exploratory Tools for Biological Data Analysis Proposed Work Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning Conclusion November 18

Motivation Integration is hard Data explosion New analysis tools
Data size & number of data sources New analysis tools Autonomous resources Heterogeneous data representation & various interfaces Frequent Updates New trend: web and grid services November 18

Motivation contd… In recent years DNA microarry and other gene and protein assays have become essential tools for biologists Next step of biological enquiry is to find out What is known about these genes? How are these genes related to each other or other genes identified in similar studies? However, major difficulties are How do we extract key properties shared by a candidate genes? How do we generate reasonable hypothesis to explain them? How do we define and evaluate similarity between sets of genes? November 18

Motivating Example Suppose after a micro array experiment a biologist suspects that a small set of genes are related to a disease This can be confirmed by searching existing literature One would expect related genes to appear together in literature Due to sheer volume Searching is time consuming and error prone Some complications could arise as well However, suppose Gene A and C are related and both of them are weakly related to gene B In literature, one would expect A,C appear together OR/AND A,B appear together B,C appear together How do we efficiently conclude that A,C are actually related? November 18

Our Approach Using data mining / machine learning techniques to extract useful information from biological data Different forms of data Flat-file data Microarray data Online literature abstracts Develop different forms of tools Layout extractor Hypergraph mining Similarity measure among sets of genes November 18

Learning Layout of a Flat-File
In general – intractable Try and learn the layout, have a domain expert verify Key issue: what delimiters are being used ? November 18

Finding Delimiters Some knowledge from domain expert is required (Semi-automatic) Naïve approaches Frequency Counting Counts frequently occurring single tokens (word separated by space) Sequence Mining Counts frequently occurring sequence of tokens November 18

Assumptions Biological datasets are written for humans to read
It is very unlikely that delimiters will be scattered all around, in different places in a line Position of the possible delimiters might provide useful information Combination of positional and frequency information might be a better choice November 18

Positional Weight Let P be the different positions in a line where a token can appear For each position i є P, tot_seqji represents total # of token sequences of length j starting at position i For each position i є P, tot_unique_seqji represents total # of unique token sequences of length j starting at position i For any tuple (i,j), p_ratio(i,j) is defined as shown above p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j) є (0,1) November 18

Delimiter score (d_score)
Frequency weight for any token sequence sji with length j and starting at position i, f_wt(sji), is obtained by log normalizing frequency f(sji) Obviously, f_wt(sji) є (0,1) Positional and frequency weight now can be combined together to get d_score as follows, d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sji) Where α є(0,1) Thus d_scrore has the following two properties, d_score(sji) є(0,1) d_score(sji) > d_score(sjk) implies sji is more likely to be a delimiter than sjk November 18

Generating layout descriptor
Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters November 18

Results By suitably varying α, a tight superset of possible delimiters are found A domain expert can then help to identify the true delimiters Results from 3 different flat file datasets are as follows November 18

Comparison with naïve approaches
d_score based approach definitely does a better job as compared to the naïve approaches The following table clearly shows the improvement November 18

Realistic Situation The task of identifying complete list of correct delimiters is difficult Most likely we will end up with getting an incomplete list of delimiters The delimiters which does not appear in every data record (optional) are the ones to be possibly missed November 18

Identifying Optional Delimiters
Given a list of incomplete delimiters how can we identify optional delimiters, if any? Build a NFA based on given incomplete information Perform clustering to identify possible crucial delimiters Perform contrast analysis November 18

Crucial delimiter A delimiter is considered crucial, if missing delimiters will appear immediately following these delimiters The goal is to create two clusters, one having delimiters which are not crucial The other one having crucial delimiters November 18

Identifying crucial delimiters: A few definitions
Succ(X): Set of delimiters that can immediately follow X Dist_App: # of groups of occurrences of X based on # of text lines between X and immediately next delimiter Info_Tuple(nXi,fXi,tXi): Information for each Dist_App Info_Tuple_List Lx: For any X, list of all possible Info_Tuple. November 18

Metric for clustering rXf is likely to be low if an optional delimiter appears immediately after X, and high otherwise Choose a suitable cut-off value rc and assign delimiters to different groups as follows,- If rXf < rc, assign X to a group containing possible crucial delimiters Else assign X to the group containing non crucial delimiters November 18

Observations and Facts
Missing optional delimiters can appear immediately after crucial delimiters ONLY Non-crucial delimiters can be pruned away Consider two Info_Tuples (nX1, fX1 ,tX1) and (nX2, fX2 ,tX2) in LX If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- nX1 > nX2 Missing delimiter will appear in tX1 but not in tX2 November 18

A hypothetical example illustrating Contrast Analysis
Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter November 18

Contrast Analysis For any i,j, if nXi > nXj , look for frequently occurring sequences in tXi and tXj, call them fsXi and fsXj respectively If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter November 18

Generalized Contrast Analysis
In case of more than two Info_Tuples, identify mean of all nXi values Form a group by appending text from all Info_Tuples, where Form another group by appending text from all Info_Tuples, where Perform contrast analysis among all such possible groups November 18

Another example illustrating Generalized Contrast Analysis
Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3 , as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) L3=(15, 10, l3 .txt) Mean number of lines, Append l2 .txt and l3 .txt , call it t2 .txt Sequence mining on l1 .txt and t2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter November 18

Overall Algorithms November 18

Results: Optional delimiters
% Pruning= November 18

Results: Non-optional Missing delimiters
Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works November 18

Hypergraph Mining Basic Motivation Example (Gene-Disease Relationship)
To find useful “Transitive Relation” (hypergraphs) among genes Example (Gene-Disease Relationship) Gene A is related to a gene B Gene B is related to a gene C Is Gene A related to Gene C ? Gene Source Microarray Experiments Information Source Online Literature abstracts November 18

Formal Problem Definition
Given A dictionary KT of keywords A dictionary KM of user provided key words (KTכKM) Collection of literature abstracts,- each abstract is represented as a set of keywords Task To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM November 18

Modeling Purpose Define Issues
To use a similar approach as frequent itemset mining Define total weight=support + cross support Support: set of keywords appear together in one document Cross support: set of keywords can be partitioned so that each partition appears in different document Issues Since downclosure property does not hold for total weight modified downclosure property can be defined November 18

Idea Support satisfies downclosure property
Let X be a set, Ω be its power set. A function f : Ω →R+ satisfies downclosure property if for all A,B ∈ Ω , A כ B ,f(B)>f(A) Cross support can be designed to be restricted below a particular value, i.e., it is bounded Form a function h as addition of two functions h=f+g f satisfies downclosure property g is bounded h satisfies modified down closure property For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))} This property can be used to devise efficient algorithm November 18

Results November 18

Similarity Measure among sets of genes
Each file containing gene names can be considered as a Discrete Random Variable (DRV) Each such DRV can take several values (gene names) For two such files X,Y and for any pair (x,y), x∈X and y∈Y, p(x,y) can be computed from online abstracts based on co-occurrence Now defining Z=g(X,Y), Z is a RV Expectation of Z can be used as a similarity measure Different g gives rise to different similarity measure November 18

Query Planning for Deepweb Mining
A huge source of online biological information is available in the form of deepweb An online query form query form needs to be filled out Required information is available by filling out may such forms from different websites There might be some dependency among these forms Requires Redundancy elimination November 18

Semi-supervised Ranking
Given a training set of examples with labels/pair wise relationships Task is to rank an unseen test set, i.e. to get a permutation so that relevant examples are ranked higher than irrelevant ones This corresponds to learning a ranking function Semi-supervised Ranking Incorporating unlabeled examples to learn the ranking function Out of sample extension November 18

Potential Application
Following a microarray experiment it might be possible to guess if gene A is more important than gene B involved in the experiment However all possible order relationship is time consuming end error prone Thus, from a small set of order relationship and using other genes from the experiment as unlabeled data a semi-supervised ranking function can be learned November 18

Multiple Instance Learning
Instead of instance-label pair (x,y), bag-label pair (B,y) is provided as training data A bag contains multiple instances A bag label is negative, if each instance in the bag has negative label A bag label is positive, if there exists at least one instance with positive label Given an unseen bag, the task is to predict its label November 18

Potential Application
Following a microarray experiment it might be possible to form bags of genes with appropriate labels From different biological labs doing similar experiments, many such bags can be obtained to use as training data Before, designing a new microarray experiment, gene set can be selected based on multiple instance learning November 18

Summary Use of data mining /machine learning techniques to extract information for biological data Work done Learning layouts of flat-file biological datasets Hypergraph Mining Similarity Measure among sets of genes Proposed Work Study and application of machine learning techniques November 18

Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin

Similar presentations

Presentation on theme: "Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin

Similar presentations

Presentation on theme: "Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin"— Presentation transcript:

Similar presentations

About project

Feedback