Let us build a platform for structure extraction and matching that.... Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A AA A A A
Knows when it failed Attaches every extraction module with a error detection logic Two types of errors Precision errors: easier to detect Reference databases Alternative models Human feedback Recall errors: much harder A research challenge Represents errors and exposes them to users Imprecise data models for results of extraction and deduplication another research challenge
Seamlessly integrates rules, humans and statistics Existing systems partitioned on Rule-based Vs Statistical Manual Vs Learning-based Smooth co-existence of all combinations a must given varying difficulty of tasks and sophistication of users
Treats models as first class objects Tens and thousands of schema elements Cannot afford separate extraction and matching model for each How to share models across different levels of hierarchies, natural languages, formatting languages, versions along time. How quickly can we interactively adapt to new domains starting from existing libraries of models
Is selectively lazy Cannot run away from the hard tasks Only way to attack the long tail of missed extractions is via expensive resources Explicitly represent increasing levels of cost and payoffs and do cost-sensitive processing Selective linguistic processing: POS Chunking Dependency parsing Full parsing Database lookups No lookups Boolean matches TF-IDF matches Edit distance Web seaches
Supports multi-spectrum queries Knowledge [Schema] should be like a pocket watch, surfaced only when needed; not like a wrist watch, always flaunted. - A Bengali saying. Fully schema-aware: SQL, XML,… Schema-less: Keyword queries Common-sense schema-aware User understands Is-a, Part-of, Properties Use world knowledge (ontologies, word-nets, etc) to map both schema and content elements in the query Can use limited rounds of user interaction