From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of Karlsruhe, Karlsruhe The Third International Semantic Web Conference - ISWC 2004 November 07 – 11, 2004, Hiroshima, Japan
From Tables To Frames - ISWC 2004, Hiroshima, Japan Outline Motivation Foundation: Table Model Methodology Evaluation Conclusion Future Work
From Tables To Frames - ISWC 2004, Hiroshima, Japan Motivation problem: well-known annotation bottleneck solution: automatic metadata generation goal: describe the semantics of tables in model-theoretic-way (F-Logic) tables with different structure but same meaning (should) have the same representation benefit: enable e.g. query answering all conferences where ‘prof. Studer’ is in PC all tours to COUNTRY at DATE where price<AMOUNT
From Tables To Frames - ISWC 2004, Hiroshima, Japan Foundation: Table Model dimensions of table model [Hurst’00] graphical (image processing) physical (inter-cell relative location) structural (organization of cells indicating their navigational relationship) functional (purpose of regions in terms of data access) two functional cell types: A-cell and I-cell two functional I-cell roles: data and access semantic (relation between cell content, structure and orientation) frame makes explicit the meaning of the cell contents (F-Logic concepts) the functional dimension of the table (method signature) the semantic dimension of the table (frame structure) example:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Table model A-cell I-cell (access) I-cell (data) LEGEND: A-cell I-cell (access) I-cell (data) LEGEND:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Simple Table Classes 1-Dimensional 2-Dimensional
From Tables To Frames - ISWC 2004, Hiroshima, Japan Complex Table Classes 1. Over-expanded labels 2. Partition labels 3. Combination – running example
From Tables To Frames - ISWC 2004, Hiroshima, Japan Methodology the methodology instantiates stepwise the table model main differences: do not consider graphical component extent semantic component
From Tables To Frames - ISWC 2004, Hiroshima, Japan Cleaning & Norm. construct an initial matrix structure DOM tree cleaning: syntactic errors ( CyberNeko HTML parser ) normalization: aligning the table, resorting cells spanning multiple rows/columns (colspan, rowspan) example:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Structure Detection detecting table orientation: rely on similarity of cells (size, content, token types) intuition: if rows are similar, then orientation is vertical (top-to- down) if columns are similar, then orientation is horizontal (left-to-right) initialize logical units and regions split table into LUs group same-sized, similar cells into regions within LUs
From Tables To Frames - ISWC 2004, Hiroshima, Japan Structure Detection heuristics for an assignment of initial functional types and probabilities to cells: I-cell: content of cell consists mostly of tokens recognized as dates, numbers, and currencies lower-right cell is always an I-cell (p=1) upper-left cell is always an A-cell (p=1) detecting table orientation: rely on similarity of cells (size, content) intuition: if rows are similar, then orientation is vertical (top-to-down) if columns are similar, then orientation is horizontal (left-to-right)
From Tables To Frames - ISWC 2004, Hiroshima, Japan Table Orientation token type hierarchy hierarchical ordering permits measuring the distance between different types (i.e. in number of edges)
From Tables To Frames - ISWC 2004, Hiroshima, Japan Table Orientation difference between two cells difference between rows/columns orientation decision example: orientation set to vertical where ; if, then horizontal (left-to-right) else vertical (top-to-bottom)
From Tables To Frames - ISWC 2004, Hiroshima, Japan Discovery of Regions algorithm (7-steps): determine a table class 1D, 2D, and complex (partition labels, over-expanded labels, combination) reformulate a table
From Tables To Frames - ISWC 2004, Hiroshima, Japan Discovery of Regions initialize logical units and regions splits: every row with a cell spanning multi columns (vertical orientation) every column with a cell spanning multi rows (horizontal orientation) regions: group same-sized, similar cells within one logical unit update functional types and probabilities learn string patterns of regions learn significant forward and backward patterns pattern is a sequence of token types and tokens, describing a content of a significant number of cells i.e. pattern ‘FIRST_UPPER Room’ covers ‘Double Room’ and ‘Single Room’ implementation of DATAPROG algorithm [Lerman et al., 2003] example:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Discovery of Regions
From Tables To Frames - ISWC 2004, Hiroshima, Japan Discovery of Regions do while (distribution in LU not uniform) (explanation of uniformity: logical unit consists of logical sub-units where each sub-unit includes only regions of same size and orientation) choose the best coherent region used to propagate and normalize the neighboring regions normalize logical sub-unit choose neighboring regions (i.e. only within same rows for vertical orientation) example:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Discovery of Regions do while (distribution in LU not uniform) (explanation of uniformity: logical unit consists of logical sub-units where each sub-unit includes only regions of same size and orientation) choose the best coherent region used to propagate and normalize the neighboring regions choose region that maximizes: normalize logical sub-unit choose neighboring regions (i.e. only within same rows for vertical orientation) two options: neighboring regions within one column DO NOT extend over boundaries of best region neighboring regions within one column DO extend over boundaries of best region update string patterns for updated regions example:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Building FTM functional table model regions as nodes arranged in a tree properties of leaf nodes: are only regions consisting exclusively of I-cells are assigned their functional role (access, data) are assigned two semantic labels: label describing the content of the region (instances) label as a combination of a region label and parent A-cell nodes labels inner nodes are either regions consisting of A-cells or ‘connection’ nodes (e.g. root) construction of FTM bottom-up approach (from lowest logical unit upwards) description through an example
From Tables To Frames - ISWC 2004, Hiroshima, Japan Building FTM type of the (colored) logical unit = I-cells only regions are turned into leaves semantic labels and roles are set to a default value Adult Adult Adult Child Child Child Single Room Double Room Extra Bed Occupation No Occupat… Extra Bed 35,450 32,500 30,550 25,800 / 22,900 2,510 1, ,
From Tables To Frames - ISWC 2004, Hiroshima, Japan Building FTM type of the (colored) logical unit = A-cells only regions turned into inner nodes and connected to appropriate sub- nodes (leaves) Adult Adult Adult Child Child Child Single Room Double Room Extra Bed Occupation No Occupat… Extra Bed 35,450 32,500 30,550 25,800 / 22,900 2,510 1, , Class/Price Economic Extended
From Tables To Frames - ISWC 2004, Hiroshima, Japan Building FTM type of the (colored) logical unit = special case close a subtree by inserting a ‘connection’ node which reflects a logical separation in the table (transition from a LU with only A-cells to a LU with I-cells) assign functional roles to leaves within a connected sub-tree: functional role access assigned to all consecutive leaves (from left) that together form a unique identifier (key); other leaves assign functional role data (possible) change of reading orientation in the new logical unit access Adult Adult Adult Child Child Child access Single Room Double Room Extra Bed Occupation No Occupat… Extra Bed data 35,450 32,500 30,550 25,800 / 22,900 data 2,510 1, , Class/Price Economic Extended Connection Node DP9LAX01AB
From Tables To Frames - ISWC 2004, Hiroshima, Japan Building FTM type of the (colored) logical unit = A-cells only regions turned into inner nodes and connected to appropriate sub-nodes (leaves) finally, connect all unconnected nodes to a root node access … access … data … data … Class/Price Economic Extended Connection Node data DP9LAX01AB data Tour Code Valid Root
From Tables To Frames - ISWC 2004, Hiroshima, Japan Building FTM recapitulation of FTM: consider multiple-level sub-trees for merging conditions: same tree structure and at least one level of matching A-cells merging step: merge nodes at the same position and level (leaf and inner nodes) if merged inner nodes (A-cells) are not equal find a semantic label of a new merged node create a new leaf node (with A-cells as values) assign functional role of the new leaf to access example:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Building FTM access Adult Adult Adult Child Child Child access Single Room Double Room Extra Bed Occupation No Occupat… Extra Bed data 35,450 32,500 30,550 25,800 / 22,900 data 2,510 1, , Class/Price Economic Extended Connection Node access Adult Adult Adult Child Child Child access Single Room Double Room Extra Bed Occupation No Occupat… Extra Bed data 35,450 32,500 30,550 25,800 / 22,900 2,510 1, , Class Price Connection Node access Economic Extended
From Tables To Frames - ISWC 2004, Hiroshima, Japan Semantic Enriching of FTM find semantic labels for regions by consulting: Wordnet lexical ontology: use synsets to find hypernyms GoogleSets service: additonal way to find synonyms transformations of region’s cell labels: punctuation removal stopword removal compute IDF (document is a cell) for each word, and filter out the ones with value lower than treshold select words that appear at the end of the labels (nominal head in the nominal compound is at the end) query GoogleSets with the remaining words to filter out the ones that are not mutually similar
From Tables To Frames - ISWC 2004, Hiroshima, Japan Semantic Enriching of FTM assign each leaf its semantic label that describes the content (instances) of the region Person access Adult Adult Adult Child Child Child Room access Single Room Double Room Extra Bed Occupation No Occupat… Extra Bed data 35,450 32,500 30,550 25,800 / 22,900 2,510 1, , Class Price Connection Node data DP9LAX01AB Date data Tour CodeValid Root Type access Economic Extended
From Tables To Frames - ISWC 2004, Hiroshima, Japan Final FTM (final) semantic labels of leaves: label is a combination of a region label and parent A-cell nodes labels Person access Adult Adult Adult Child Child Child Room access Single Room Double Room Extra Bed Occupation No Occupat… Extra Bed data 35,450 32,500 30,550 25,800 / 22,900 2,510 1, , ClassPrice Connection Node Tour Code Valid Root PersonClass RoomClass Price Type access Economic Extended TypePrice data DP9LAX01AB Code Date data DateValid
From Tables To Frames - ISWC 2004, Hiroshima, Japan Map FTM to a Frame method is a tuple frame is a pair generation of a frame create method m for every leaf node, which functional role is data parameters of m are all leaf nodes with functional role access, where they must be located on the same level of m ’s sub-tree or on m ’s parent path towards root node set range for m according to the syntactic token type of its region names for parameters and methods are obtained from a final FTM example: Tour [ Code => ALPHANUMERIC; DateValid => DATE; Price (PersonClass, RoomClass, TypePrice) => LARGE_NUMBER ].
From Tables To Frames - ISWC 2004, Hiroshima, Japan Evaluation task: for each table compare automatically generated frame against two manually created frames measure in terms of Precision, Recall and F-measure dataset: consists of 21 tables: 3 tables for each simple table class (1D, 2D) and 5 tables for each complex table class tourism domain annotators: 14 subjects each subject had to annotate 3 tables, each belonging to a different table class (14x3=21x2=42)
From Tables To Frames - ISWC 2004, Hiroshima, Japan Evaluation performed along following 4 functions: - example: [m1 (X, Y) => INTEGER] vs. [method1 (X, YY, W)=>INTEGER] syntactic correctness: how well the functional dimension of the table is captured (SynC=2/3) strict comparison: calculate how identical are name M, range M, and P M identifiers of methods (P=2/4, R=2/5) soft comparison: for soft matching we used a combination of TFIDF and Jaro-Wrinkler string distance scheme [Cohen et al., 2003] calculate soft matching for identifiers of methods (P=3/4, R=3/5, where ‘Y’≈‘YY’) conceptual comparison: conceptually equivalent identifiers have been determined (i.e. ‘RegionType’=‘Region’=‘Location’) calculate conceptual matching for identifiers of methods (P=4/4, R=4/5, where ‘m1’≈‘method1’)
From Tables To Frames - ISWC 2004, Hiroshima, Japan Evaluation performed from 2 aspects: average: consider all frames maximum: choose only the best manually created frame for each generated frame results:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Conclusion shown that our methodology stepwise instantiates the underlying table model experiments show that: from conceptual point of view the system gets appropriate names for frames in almost 75% it gets totally identical names in more than 50% we demonstrated and evaluated the successful automatic generation of frames from HTML tables
From Tables To Frames - ISWC 2004, Hiroshima, Japan Future Work generate one (most general) frame from multiple tables reduction of complexity population of ontologies with instances show feasibility of approach in practical setting use given ontology as background knowledge
From Tables To Frames - ISWC 2004, Hiroshima, Japan TNX
From Tables To Frames - ISWC 2004, Hiroshima, Japan Inter-annotator agreement max (F X )=F conceptual ≈60% only 2 totally identical frames (2/21=9.52%) only 5 identical frames from a conceptual view (5/21=23.81%) this 5 tables cover all 1D class tables and 2 (out of 3) 2D class tables possible reasons for low agreements: the annotators did not follow the guidelines precisely the task itself is hard the annotation guidelines were not clear/detailed enough actual results:
From Tables To Frames - ISWC 2004, Hiroshima, Japan Example 1
From Tables To Frames - ISWC 2004, Hiroshima, Japan Example 1 Generated Frame Annotator 1: Annotator 2: Tour [ Name (Code) => TOKEN Price (Code) => CURRENCY Hotel (Code) => TOKEN Meal (Code) => TOKEN ] Tour [ TourCode => ALPHANUMERIC TourName => TOKEN Price => CURRENCY Hotel => TOKEN Meal => TOKEN ] TourCode [ TourName => TOKEN Price => CURRENCY Hotel => ALPHANUMERIC Meal => ALPHANUMERIC ]
From Tables To Frames - ISWC 2004, Hiroshima, Japan Example 2
From Tables To Frames - ISWC 2004, Hiroshima, Japan Example 2 Generated Frame: Annotator 1: Annotator 2: Trip[ Cost (TimePeriod) => CURRENCY Insurance (TimePeriod) => CURRENCY ] Trip[ Cost(Duration) => CURRENCY Insurance(Duration) => CURRENCY ] Trip[ Duration=>ALPHANUMERIC DurationType=>ALPHANUMERIC Cost=>CURRENCY Insurance=>CURRENCY ]
From Tables To Frames - ISWC 2004, Hiroshima, Japan Example 3
From Tables To Frames - ISWC 2004, Hiroshima, Japan Example 3 Generated Frame: Annotator 1: Transportation[ Description (Transportation) => STRING HalfDay (Transportation) => CURRENCY FullDay (Transportation) => CURRENCY HoursHakone (Transportation)=> CURRENCY ] Transportation [ Vehicle => ALPHANUMERIC Seats => NUMBER WheelChairs => NUMBER JumpSeats => NUMBER Baggage => NUMBER Toilet => NUMBER Duration(TourType) => NUMBER Cost(TourType) => CURRENCY ]