Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding System_T By Mao Xianling 2009.02.28.

Similar presentations


Presentation on theme: "Understanding System_T By Mao Xianling 2009.02.28."— Presentation transcript:

1 Understanding System_T By Mao Xianling 2009.02.28

2 Outline  Introduction to System_T  Primary tests  Problem

3 Outline  Introduction to System_T  Primary tests  Problem

4 Installing the Development Environment downloaded from IBM's AlphaWorks site; just search for "System Text" at http://www.alphaworks.ibm.com/ http://www.alphaworks.ibm.com/ uncompress the.zip/.tar file onto your computer's hard drive run the startup script Sh SystemText-[version]/bin/startserver.sh start the Development Environment by pointing your web browser at the address http://localhost:8083/aql http://localhost:8083/aql

5 Development Environment

6

7 create view PhoneNum as extract regex /[0-9]{3}-[0-9]{4}/ on D.text as number from Document D; output view PhoneNum;

8 One Example for AQL Code create view PhoneNum as extract regex /[0-9]{3}-[0-9]{4}/ on D.text as number from Document D; output view PhoneNum;

9

10

11

12 Introduction to AQL AQL:a language for building annotators that extract structured information from unstructured or semistructured text. AQL is the primary method of creating new annotators in System Text for Information Extraction.

13 Introduction to AQL The syntax of AQL is similar to that of SQL, but with several important differences: AQL is case sensitive. AQL allows regular expressions to be expressed in Perl syntax, e.g. /regex/ instead of 'regex'. AQL currently does not support advanced SQL features like correlated subqueries and recursive queries. AQL has a new statement type, extract, which is not present in SQL.

14 Data Model AQL's data model is similar to the standard relational model used by SQL databases like DB2. All data in AQL is stored in tuples, data records of one or more columns, or fields. A collection of tuples forms a relation. All tuples in a relation must have the same schema — the names and types of their fields.

15 Data Model The fields of an AQL tuple must belong to one of the language's built-in scalar types Integer: A 32-bit signed integer. Text: A Unicode string, with additional metadata to indicate which tuple the string belongs to. Span: A contiguous region of characters in a Text object.

16 Execution Model

17 AQL Statement The create view Statement The extract Statement –Extraction Specifications Regular Expressions Dictionaries Splits The select Statement The create table Statement Built-In Functions –Predicate Functions –Scalar Functions –Table Functions

18 create view PersonFirstOrLastName as extract dictionary 'names.dict' on D.text as name from Document D having MatchesRegex(/[A-Z].+/, name); create view PhoneNumber as extract regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ on D.text as num from Document D; create view ExtensionNumbers as extract regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/ on D.text return group 1 as num and group 0 as completenum from Document D; create view PhoneNumberWithExtension as select CombineSpans(P.num,E.completenum) as num from PhoneNumber P, ExtensionNumbers E where FollowsTok(P.num, E.completenum,0,1); create view PhoneNumberAll as (select P.num as num from PhoneNumber P) union all (select E.completenum as num from ExtensionNumbers E) union all (select P.num as num from PhoneNumberWithExtension P); create view PhoneNumberAllConsolidated as select P.num as num from PhoneNumberAll P consolidate on P.num using 'ContainedWithin'; create view PersonsPhone as select person.name as person, phone.num as phone, CombineSpans(person.name, phone.num) as personphone from PersonFirstOrLastName person, PhoneNumberAllConsolidated phone where Follows(person.name, phone.num, 0, 30); output view PersonsPhone;

19

20

21 Outline  Introduction to System_T  Primary tests  Problem

22 Primary Tests DataSet From TianWang Clawer; Chinese; Firstname.dict/Lastname.dict (for Chinese) Method Using AQL to build Annotators

23 Annotator for extract phone num

24 Annotator for extract name

25 Time && Space

26 Outline  Introduction to System_T  Primary tests  Problem

27 Problem English VS Chinese [extract regex /[0-9]{3}/ on 1 token in D.text] Time && Space && Network? MultiSet? The express ability of Regex ? No source code && MapReduce? Zip?

28


Download ppt "Understanding System_T By Mao Xianling 2009.02.28."

Similar presentations


Ads by Google