Semi-Automatically Generating Data-Extraction Ontology Yihong Ding March 6, 2001
Extract information from Web document Cars Application Ontology $Revision: 1.2 $ $Log: cars.osm,v $ -- Revision /02/20 00:15:55 liddl -- Cleaned up header Revision /02/20 00:14:14 liddl -- Initial revision -- Car [-> object]; Car [0:1] has Year [1:*]; Year matches [4] constant { extract "\d{2}"; context "([^\$\d]|^)[4-9]\d[^,\dkK]"; substitute "^" -> "19"; }, { extract "\d{2}"; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, { extract "\d{2}"; context "\b'[4-9]\d\b"; substitute "^" -> "19"; }, { extract "\d{2}"; context "([^\$\d]|^)0\d[^,\dkK]"; substitute "^" -> "20"; },
Ontology a computational entity, a resource containing knowledge about what “concepts” exist in the world and how they relate to one another Components Concepts Domain dependent Context free Context sensitive Domain independent Context free Context sensitive Relationship (relational schema between the concepts) Constraints Car [-> object]; Car [0:1] has Make [1:*]; Make matches [10] constant { extract "\baudi\b"; }; end; Car [0:1] has Model [1:*]; Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; }; end; Car [0:1] has Mileage [1:*]; Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";}; end; Car [0:1] has Price [1:*]; Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";}; end;
My work Pre-assumptions Given information knowledge base that already containing domain dependent and domain independent concepts Pre-defined ontologies Mikrokosmos, Gene, our ontologies, etc. Component recognizers date, time, price, phone number, etc. Given sample training Web documents Semi-automatically generate the ontology
Architecture Information knowledge base Training Web documents Output final ontology Pattern learning & updating Raw completed ontology Satisfied Partial completed ontology Classify related concepts for the sample documents Need modification User Control Interface Pattern learning & updating Raw completed ontology
Example: CIA Factbook Country: China Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea and Vietnam Geographic coordinates: N, E Map references: Asia Area: total: 9,596,960 sq km land: 9,326,410 sq km water: 270,550 sq km
Partial completed ontology CountryName matches [30] constant { extract “\bChina\b”; }, { extract “\bUnited States\b”; }; … end; Location matches [50] constant { extract "\bAsia\b"; }, { extract "\bEurope\b"; }, … { extract “\bYellow Sea\b”; }, … end; Latitude matches [10] constant { extract "\b[1-9]\d{0,2}\b[1- 9]\d{0,1}(E|W)"; }, end; Longitude matches [10] constant { extract "\b[1-9]\d{0,2}\b[1- 9]\d{0,1}(N|S)"; }, end; Number matches [6] constant { extract "[1-9]\d{0,5}"; }, { extract "[1-9]\d{0,2},\d{3}"; }, end; Country: China Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea and Vietnam Geographic coordinates: N, E Map references: Asia Area: total: 9,596,960 sq km land: 9,326,410 sq km water: 270,550 sq km
Raw completed ontology Country [-> object]; Country [0:1] has CountryName [1:1]; Country [0:1] has Location1 [1:*];... Country [0:1] has Location8 [1:*]; Country [0:1] has Latitude [1:*]; Country [0:1] has Longitude [1:*]; Country [0:1] has Number1 [1:*]; Country [0:1] has Number2 [1:*]; Country [0:1] has Number3 [1:*]; -- ** Generalization/Specializations Location1 : Location... Location8 : Location Number1 : Number Number2 : Number Number3 : Number Country: China Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea and Vietnam Geographic coordinates: N, E Map references: Asia Area: total: 9,596,960 sq km land: 9,326,410 sq km water: 270,550 sq km
User control interface Output to user raw completed ontology tagged training web pages the query results User may modify attribute name combine attributes delete useless attributes change relationships add new attributes, new relations, and constraints … When satisfied, output the final ontology Country: China {CountryName} Location: Eastern Asia {Location1}, bordering the East China Sea {Location2}, Korea Bay {Location3}, Yellow Sea {Location4}, and South China Sea {Location5}, between North Korea {Location6}, and Vietnam {Location7} Geographic coordinates: N {Latitude}, E {Longitude} Map references: Asia {Location8} Area: total: 9,596,960 {Number1} sq km land: 9,326,410 {Number2} sq km water: 270,550 {Number3} sq km Country: China {CountryName} Location: Eastern Asia {Location1}, bordering the East China Sea {Location2}, Korea Bay {Location3}, Yellow Sea {Location4}, and South China Sea {Location5}, between North Korea {Location6}, and Vietnam {Location7} Geographic coordinates: N {Latitude}, E {Longitude} Map references: Asia {MapReference} Area: total: 9,596,960 {TotalArea} sq km land: 9,326,410 {LandArea} sq km water: 270,550 {WaterArea} sq km Country: China {CountryName} Location: Eastern Asia, bordering the East China Sea, Korea Bay, Yellow Sea, and South China Sea, between North Korea, and Vietnam {Location} Geographic coordinates: N {Latitude}, E {Longitude} Map references: Asia {MapReference} Area: total: 9,596,960 {TotalArea} sq km land: 9,326,410 {LandArea} sq km water: 270,550 {WaterArea} sq km
Problems Obtain knowledge base Classify related concepts for the sample documents Refine Tag the document based on the raw completed ontology User interface design and control Update strategy to raw completed ontology based on user modification
Contribution Exploit existing knowledge Semi-automatically generate an extraction ontology