Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri 1.

Similar presentations


Presentation on theme: "CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri 1."— Presentation transcript:

1 CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri sowji@ksu.edu 1

2 O UTLINE Introduction Terms Motivation Goal Project Overview Project Data Flow Diagram Component Design Project Evaluation Future Work Prototype Demonstration Questions / Comments 2

3 T ERMS [1] Knowledge Discovery in Databases (KDD) a group headed by Dr. Hsu primary focus is machine learning, data mining, human-computer intelligent interaction Natural Language Processing (NLP) To allow computers to process and understand human languages Some areas like Text Segmentation (identify word boundaries) Part-of-speech tagging Word sense disambiguation (words with more than one meaning) 3

4 TERMS[2] Named Entity Recognition (NER) Locating and classifying atomic elements (single part of speech) in text into predefined categories such as Names of Persons Names of Locations Names of Organizations Names of Miscellaneous Entities Example Dr. William H. Hsu is a Professor at Kansas State University located in Manhattan, Kansas. Dr. [PER William H. Hsu ] is a Professor at [ORG Kansas State University ] located in [LOC Manhattan ], [LOC Kansas ]. 4

5 TERMS[3] Shallow Parsing/Chunking NLP technique that attempts to look for key phrases but not to fully parse into a parse tree. Output - series of words mostly nouns, verbs, preposition phrases etc., Example Chunker: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only L1.8 billion ] Full Parser: (PRP)He (VBZ)reckons (DT)the (JJ)current (NN)account (NN)deficit (MD)will (VB)narrow (TO)to (RB)only (L)L (CD)1.8 (CD)billion 5

6 P ROJECT O VERVIEW [1] Motivation Occurrence of events is naturally anchored in time within the narrative text Is Bush currently the President of America? When was India attacked by Pakistan in last century? To know the quantities of entities How many Oscar awards are won by Steven Spielberg? What was the highest temperature recorded in the year 2008? 6

7 PROJECT OVERVIEW[2] Goal To develop a system that extracts Numerical Phrases from raw text displays value – unit – unit-type System is set as a service on the web server User interacts through a webpage Numerical Phrase: Types Number Phrase 33 dollars, 100 Watts, 13 years, two miles Date Phrase Aug 1998, Nov 10 th 1984, between 1989 and 2006 7

8 PROJECT OVERVIEW[3] Purpose To understand the timestamp of an event To understand the order of occurrence of events To understand the persistence of an event i.e., the time period over which the event occurred and continued For KDD Group To gather certain statistical information from the data they gather by crawling different web pages How many cattle have been affected by the virus? When did the disease break out? Sample NABC (National Agricultural Bio-Security Centre) data is given to the system for testing 8

9 APPLICATION AREAS Textual Entailment (TE) Recognition Given two fragments, whether the meaning of one text can be inferred from another text. Question Answering (QA) System Identifies text that entails the expected answer. Possible inferences (TE) 10,000 cattle were killed because of RVF. RVF occurred during 1997. Possible Questions (QA) How many cattle were killed during 1997 RVF outbreak? When did RVF occur? 9 Ex: During 1997, 10,000 cattle were killed because of the RVF.

10 S YSTEM O VERVIEW 10

11 P ROJECT D ATA F LOW D IAGRAM : N UMERICAL E NTITY S EARCHER 11

12 M ODULES IN THE P ROJECT Webpage (JSP): For requesting and receiving information from the service. POS Tagger (Java): Stanford POS Tagger Numerical Phrase Extractor (Java): Implemented using Shallow Parsing Technique Number-Unit/Date Pattern Recognizer (Java): Implemented based on the Numerical Quantifier developed by Benjamin Sapp, UIUC. 12

13 POS TAGGER TAGSET 13 http://www.cs.ualberta.ca/~lindek/650/Slides/POSTagging.ppt

14 I MPLEMENTING N UMERICAL P HRASE E XTRACTOR Input: Tagged Text I/PRP lost/VBD thirty-three/JJ dollars/NNS in/IN 1998/CD Regular expressions (regex) are used to determine the numerical patterns in the input. thirty-three/JJ dollars/NNS in/IN 1998/CD Output: Numerical Phrases thirty-three dollars in 1998 14

15 S OME P ATTERNS "\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN" parses "(between|Between|from|From|In|in|since| Since|during|During)/IN..../CD (([a-zA- Z]+/CC|[a-z]+/TO)..../CD)?” parses 'between 1987 and 1997', 'in 2007 and 2008’ 15 \\d+-\\d+(/JJ|/CD)[a-zA-Z]+/NN 3-2/JJlead/NN 20-20/JJmatch/NN

16 C OMPONENT D ESIGN 16 Contains class variables and functions Added separate table to describe the roles of functions

17 COMPONENT DESIGN (MYPATTERNS)[1] 17 PatternsMatching Numerical Phrases p_words about, around, approximately, more than, nearly, almost, no more than, at least, less than, no fewer than p_tnlthis, next, last, since, in p_inlbetween, from, in, since, during p_words + p_abtfrac about two-thirds of the vote, millions of books p_words + p_age 27 year-old bachelor, 27-year-old bachelor p_words + p_ampm About 3:00 a.m., 4:15 p.m. CST p_and3,792 children and adolescents p_tnl + p_anydate Oct 1st 1987, Nov 5, December 21, 1998

18 COMPONENT DESIGN (MYPATTERNS)[2] 18 PatternsMatching Numerical Phrases p_inl + p_btwfrm between 1987 and 1997, in 2007 and 2008 p_inl + p_btwfrmd from 200 to 300 miles, from 7.5 percent to 6.85 percent p_date18 April 2008 p_tnl + p_daysthis Monday, next Saturday, last Friday, Tuesday, Wednesday, p_centuary17 th century, 17 th -centuary p_words + p_hyphenww million-dollar home, six-bedroom home, thirty-three dollars p_hyphennumn um the 20-20 match, a 3-2 lead p_in9 in 10 people, 1 in every 8 women p_midsmid-1990s, the early 1990s, 1970s p_monthsJanuary, February, December, Jan, Feb, Sept, Dec

19 COMPONENT DESIGN (MYPATTERNS)[3] 19 PatternsMatching Numerical Phrases p_words + p_numunit 33 USD, about 34 miles, 33,333 tons, 3.3 million dollars, one thing, 3.4 billion p_words + p_per $33 per day, about 100 miles per hour p_words + p_percentinches 39%, 0.5-1%, about 90 %, 20" p_ratioone of the five people, 89 percent of people, 3 out of 5 people p_ttytoday, tomorrow, yesterday, noon p_twmy this year, this month, next year, next month, last week, last year, last month p_xbits1024KB, 8MB, 320GB, 1TB p_words + p_yrange In 1998-99, during 2000-09

20 S AMPLE S ENTENCES [1] SentencePatterns I have lost 33,000 dinars in 1998p_numnit p_btwfrm At just 12-years-old, he enrolled as a freshman at F.I.U. in Miami. p_age The 20" iMac is cheaper at $1200 and it has a 320GB hard drive. p_percentinches p_numunit p_xbits Volunteers bring in a heavy crane for work on a bridge last month. p_twmy As for those who do not invest, around 40% say capitalism is better. p_percentinches As of 7 January 2007, about 75 people have died and another 183 infected. p_date p_numunit 20

21 S AMPLE S ENTENCES [2] SentencePatterns Approximately 1% of human sufferers die of the disease. p_percentinches Current listings of 2,000 children and adults who are reported missing, including in-depth coverage of high-profile cases. p_and 38 of the 62 patients who provided blood samples tested positive. p_ratio She became an exotic dancer at Scores in New York City in the mid-1990s. p_mids Peterson's three capped the surge, giving New Orleans a 64-51 lead. p_numunit p_hyphennumnum 21

22 P ROBLEMS E NCOUNTERED 22 Determining the Patterns Lots of Numerical Phrases found Designed Patterns to filter more than one kind of Numerical Pattern Prioritizing the Patterns More than one pattern may match the same Numerical Phrase To avoid clashes between the Patterns

23 P ROJECT E VALUATION [1] 23 Test CaseMain Functionality TestedPass/Fail Test Case 1Application FunctionalityPass Test Case 2POS Tagger FunctionalityPass Test Case 3Numerical Phrase Extractor FunctionalityPass Test Case 4 Number-Unit/Date Pattern Recognizer Functionality Pass

24 P ROJECT E VALUATION [2] 24 PhaseExpected Completion PhaseActual Completion Phase 1February 26, 2009February 24, 2009 2March 26, 2009March 31, 2009 3April 17, 2009April 14, 2009

25 P ROJECT E VALUATION [3] Phase 2 took more time since Implementation and Testing are done simultaneously 25

26 P ROJECT E VALUATION [4] More time for Coding and the Documentation 26

27 P ROJECT E VALUATION [5] More time spent in discussing since it’s the initial phase 27

28 P ROJECT E VALUATION [6] More time is spent in Coding after gather the requirements in the first phase. 28

29 P ROJECT E VALUATION [7] Lot of time spent on Documenting the things as per the ETDR standards. 29

30 F UTURE W ORK Adding more Patterns To filter more different kinds of numerical phrases Improving the Output Display By displaying the number and date phrases in different colors To make it more readable for the user 30

31 L ESSONS L EARNED Java Tool Usage Java Eclipse IDE Design Development MS Visio SDLC Documentation 31

32 P ROTOTYPE D EMONSTRATION KSNES Project Set up as a Service on the CIS Server A webpage is set up: http://viper.cis.ksu.edu:11603/numerical/ 32

33 F INAL S TEPS Final Examination Ballot Make necessary changes to the MSE Portfolio Deliver the Portfolio 33

34 Questions?? Suggestions!! THANK YOU 34


Download ppt "CIS 895 – MSE P ROJECT KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th, 2009 Naga Sowjanya Karumuri 1."

Similar presentations


Ads by Google