Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019.

Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets
Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019

Overall Goal Informatics tools for biological data integration driven by: Data explosion Data size & number of data sources New analysis tools Autonomous resources Heterogeneous data representation & various interfaces Frequent Updates Common Situations: Flat-file datasets Ad-hoc sharing of data 5 May 2019

Current Approaches Manually written wrappers
Problems O(N2) wrappers needed, O(N) for a single updates Mediator-based integration systems Need a common intermediate format Unnecessary data transformation Integration using web/grid services Needs all tools to be web-services (all data in XML?) 5 May 2019

Our Approach Automatically generate wrappers
Transform data in files of arbitrary formats No domain- or format-specific heuristics Layout information provided by users Help biologists write layout descriptors using data mining techniques 5 May 2019

Our Approach: Challenges
Description language Format and logical view of data in flat files Easy to interpret and write Wrapper generation and Execution Correspondence between data items Separating wrapper analysis and execution Interactive tools for writing layout descriptors What data mining techniques to use ? 5 May 2019

Wrapper Generation System Overview
Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter 5 May 2019 Synchronizer

Key Open Questions How hard is it to write layout descriptors ?
Given a flat file, how hard is it to learn its layout? Can we make the process semi-automatic ? 5 May 2019

Learning Layout of a Flat-File
In general – intractable Try and learn the layout, have a domain expert verify Key issue: what delimiters are being used ? 5 May 2019

Finding Delimiters Difficult problem
Some knowledge from domain expert is required (Semi-automatic) Naïve approaches Frequency Counting Counts frequently occurring single tokens (word separated by space) Sequence Mining Counts frequently occurring sequence of tokens 5 May 2019

Frequency Counting Problems Possible Solution
Some tokens, appearing very frequently, are not delimiters Delimiters could be a sequence of token rather than a single token Possible Solution Use knowledge from frequency of token sequence and all its subsequences to decide possible delimiter sequence 5 May 2019

Sequence Mining Example
For any sequence of tokens s, f(s) represents frequency of s Lets say A,B,C are tokens Case 1: f(ABC)=10, f(AB)=10, f(BC)=10, f(CA)=10 Information about AB, BC, CA is already embedded in ABC ABC is possible delimiter but AB, BC, CA are not Case 2: f(ABC)=10, f(AB)=20, f(BC)=10, f(CA)=10 BC and CA occur less frequently than AB ABC cannot be a delimiter AB is possible delimiter 5 May 2019

Limitations of Sequence Mining
Does not work very well if token frequencies are distributed in a skewed manner An example where it does not work in (Pfam dataset) \n, #=GF, AC are tokens with f(\n,#=GF)>>f(#=GF,AC) F(\n,#=GF)>>f(\n,#=GF,AC) \n #=GF is concluded as possible delimiter In reality \n #=GF AC is a delimiter 5 May 2019

Can we do better? Biological datasets are written for humans to read
It is very unlikely that delimiters will be scattered all around, in different places in a line Position of the possible delimiters might provide useful information Combination of positional and frequency information might be a better choice 5 May 2019

Positional Weight Let P be the different positions in a line where a token can appear For each position i є P, tot_seqji represents total # of token sequences of length j starting at position i For each position i є P, tot_unique_seqji represents total # of unique token sequences of length j starting at position i For any tuple (i,j), p_ratio(i,j) is defined as shown above p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j) є (0,1) 5 May 2019

Delimiter score (d_score)
Frequency weight for any token sequence sji with length j and starting at position i, f_wt(sji), is obtained by log normalizing frequency f(sji) Obviously, f_wt(sji) є (0,1) Positional and frequency weight now can be combined together to get d_score as follows, d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sji) Where α є(0,1) Thus d_scrore has the following two properties, d_score(sji) є(0,1) d_score(sji) > d_score(sjk) implies sji is more likely to be a delimiter than sjk 5 May 2019

Finding delimiters using d_score
Since delimiter sequence length is not known in advance, an iterative algorithm is used to get a superset S of potential delimiters, where, At any iteration i, ci represents the cut-off value which is determined by observing a substantial difference in sorted d_score values All token sequences above ci are called Ni 5 May 2019

Generating layout descriptor
Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters 5 May 2019

Realistic Situation The task of identifying complete list of correct delimiters is difficult Most likely we will end up with getting an incomplete list of delimiters The delimiters which does not appear in every data record (optional) are the ones to be possibly missed 5 May 2019

Identifying Optional Delimiters
Given a list of incomplete delimiters how can we identify optional delimiters, if any? Build a NFA based on given incomplete information Perform clustering to identify possible crucial delimiters Perform contrast analysis 5 May 2019

Crucial delimiter A delimiter is considered crucial, if missing delimiters will appear immediately following these delimiters The goal is to create two clusters, one having delimiters which are not crucial The other one having crucial delimiters 5 May 2019

Identifying crucial delimiters: A few definitions
Succ(X): Set of delimiters that can immediately follow X Dist_App: # of groups of occurrences of X based on # of text lines between X and immediately next delimiter Info_Tuple(nXi,fXi,tXi): Information for each Dist_App Info_Tuple_List Lx: For any X, list of all possible Info_Tuple. 5 May 2019

Metric for clustering rXf is likely to be low if an optional delimiter appears immediately after X, and high otherwise Choose a suitable cut-off value rc and assign delimiters to different groups as follows,- If rXf < rc, assign X to a group containing possible crucial delimiters Else assign X to the group containing non crucial delimiters 5 May 2019

Observations and Facts
Missing optional delimiters can appear immediately after crucial delimiters ONLY Non-crucial delimiters can be pruned away Consider two Info_Tuples (nX1, fX1 ,tX1) and (nX2, fX2 ,tX2) in LX If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- nX1 > nX2 Missing delimiter will appear in tX1 but not in tX2 5 May 2019

A hypothetical example illustrating Contrast Analysis
Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter 5 May 2019

Contrast Analysis For any i,j, if nXi > nXj , look for frequently occurring sequences in tXi and tXj, call them fsXi and fsXj respectively If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter 5 May 2019

Generalized Contrast Analysis
In case of more than two Info_Tuples, identify mean of all nXi values Form a group by appending text from all Info_Tuples, where Form another group by appending text from all Info_Tuples, where Perform contrast analysis among all such possible groups 5 May 2019

Another example illustrating Generalized Contrast Analysis
Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3 , as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) L3=(15, 10, l3 .txt) Mean number of lines, Append l2 .txt and l3 .txt , call it t2 .txt Sequence mining on l1 .txt and t2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 } Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified by a domain expert as a valid delimiter 5 May 2019

Overall Algorithms 5 May 2019

Results: Optional delimiters
% Pruning= 5 May 2019

Results: Non-optional Missing delimiters
Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works 5 May 2019

Summary Semi-automatic tool for learning the layout of a flat-file dataset Mechanism for identifying missing optional delimiters Automatic tool for wrapper generation Once the layout descriptor is known Can ease integration of new/updated sources 5 May 2019

Questions.. 5 May 2019

Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019.

Similar presentations

Presentation on theme: "Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019.

Similar presentations

Presentation on theme: "Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets Kaushik Sinha Xuan Zhang Ruoming Jin Gagan Agrawal 5 May 2019."— Presentation transcript:

Similar presentations

About project

Feedback