Understanding System_T By Mao Xianling
Outline Introduction to System_T Primary tests Problem
Outline Introduction to System_T Primary tests Problem
Installing the Development Environment downloaded from IBM's AlphaWorks site; just search for "System Text" at uncompress the.zip/.tar file onto your computer's hard drive run the startup script Sh SystemText-[version]/bin/startserver.sh start the Development Environment by pointing your web browser at the address
Development Environment
create view PhoneNum as extract regex /[0-9]{3}-[0-9]{4}/ on D.text as number from Document D; output view PhoneNum;
One Example for AQL Code create view PhoneNum as extract regex /[0-9]{3}-[0-9]{4}/ on D.text as number from Document D; output view PhoneNum;
Introduction to AQL AQL:a language for building annotators that extract structured information from unstructured or semistructured text. AQL is the primary method of creating new annotators in System Text for Information Extraction.
Introduction to AQL The syntax of AQL is similar to that of SQL, but with several important differences: AQL is case sensitive. AQL allows regular expressions to be expressed in Perl syntax, e.g. /regex/ instead of 'regex'. AQL currently does not support advanced SQL features like correlated subqueries and recursive queries. AQL has a new statement type, extract, which is not present in SQL.
Data Model AQL's data model is similar to the standard relational model used by SQL databases like DB2. All data in AQL is stored in tuples, data records of one or more columns, or fields. A collection of tuples forms a relation. All tuples in a relation must have the same schema — the names and types of their fields.
Data Model The fields of an AQL tuple must belong to one of the language's built-in scalar types Integer: A 32-bit signed integer. Text: A Unicode string, with additional metadata to indicate which tuple the string belongs to. Span: A contiguous region of characters in a Text object.
Execution Model
AQL Statement The create view Statement The extract Statement –Extraction Specifications Regular Expressions Dictionaries Splits The select Statement The create table Statement Built-In Functions –Predicate Functions –Scalar Functions –Table Functions
create view PersonFirstOrLastName as extract dictionary 'names.dict' on D.text as name from Document D having MatchesRegex(/[A-Z].+/, name); create view PhoneNumber as extract regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ on D.text as num from Document D; create view ExtensionNumbers as extract regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/ on D.text return group 1 as num and group 0 as completenum from Document D; create view PhoneNumberWithExtension as select CombineSpans(P.num,E.completenum) as num from PhoneNumber P, ExtensionNumbers E where FollowsTok(P.num, E.completenum,0,1); create view PhoneNumberAll as (select P.num as num from PhoneNumber P) union all (select E.completenum as num from ExtensionNumbers E) union all (select P.num as num from PhoneNumberWithExtension P); create view PhoneNumberAllConsolidated as select P.num as num from PhoneNumberAll P consolidate on P.num using 'ContainedWithin'; create view PersonsPhone as select person.name as person, phone.num as phone, CombineSpans(person.name, phone.num) as personphone from PersonFirstOrLastName person, PhoneNumberAllConsolidated phone where Follows(person.name, phone.num, 0, 30); output view PersonsPhone;
Outline Introduction to System_T Primary tests Problem
Primary Tests DataSet From TianWang Clawer; Chinese; Firstname.dict/Lastname.dict (for Chinese) Method Using AQL to build Annotators
Annotator for extract phone num
Annotator for extract name
Time && Space
Outline Introduction to System_T Primary tests Problem
Problem English VS Chinese [extract regex /[0-9]{3}/ on 1 token in D.text] Time && Space && Network? MultiSet? The express ability of Regex ? No source code && MapReduce? Zip?