Daniel Bevis William King Villanova University Spring 2006 CS9010

Daniel Bevis William King Villanova University Spring 2006 CS9010
Project Status Daniel Bevis William King Villanova University Spring 2006 CS9010

Project Overview Complete a subset of the Ontology Project (Project Archive) Generate ontology from existing documentation Determine if it is possible to generate an Ontological classification (categories) from raw data characteristics Support flexibility to define a process that allows the ontology to be naturally extended as raw data is incorporated

Development Plan review
Select subset of subject areas Initially select limited subject area Important to support reasonably quick review and analysis of results Expand subject area iteratively if time permits Define characteristics associated with a subset of the raw data from the web site Consider Processing of subject documentation Natural Language Indexing search with cross references Consider simple keyword searches

Development Plan review
Build categories from the characteristics Consider generating a tool that allows you to describe a different subset from the rest of the raw data Create higher level categories based upon common subsets of characteristics Repeat process until top level categories or characteristics conform to existing high level classifications or prove alternate categories Place subjects into categories Review categorization Manually analyze results Test existing categorization on remaining subjects of the initially selected subset

Development Tools Natural Language Recognition via NLTK is the basis for initial research Slow but well documented and supported Installation details (Win32 API) NLTK Lite w/ Corpora package Python 2.4.2 PyWordNet WordNet 2.1 Numarray 1.5

Ontology Subset Take SIGMICRO category as a single subject set
Break data into subsets Initial subset allows for simpler manual verification and validation International Symposium on Microarchitecture Initially a small subset of the available archive material will be used Remaining subsets provide for further testing and validation of technique Additional subsets from the ACM documentation will be added as time permits

Defined Process Take a subset of the raw data elements and define the elements characteristics Read text in for processing Tokenize text Perform Probabilistic Parsing via ViterbiParse Consider other parsing techniques if time permits Consider training parsing process Select Tokens for analysis Supposition: Nouns will provide adequate tokens to define characteristics Potential Goal: identify a ‘reasonable’ subset of tokens for use as characteristics

Defined Process Select Tokens for analysis (continued)
May be reasonable to use only a subset of nouns Proper nouns are likely to have little impact if removed Redundant terms and synonymous should likely be consolidated What impact would the use of other types (e.g. verbs) have in generating characteristics? Limiting to Nouns will greatly reduce the amount of information to be processed Reduce processing time thereby allowing for faster generation of results in an time consuming process Defines a bound on what constitutes a characteristic and thereby reduces volume of data to be manually reviewed during development Will initially require additional testing to verify concept

Defined Process Based on common characteristics develop categories
Analyze each individual document’s parse tree Use statistical analysis of parse trees between documents Supposition: Higher frequency of terms relative to all documents implies higher level characteristic Potential Goal: Identify a ‘reasonable’ subset of term inter-relations for use as characteristics Assume that some raw data values will cross categories Group elements into those categories Identify common characteristics associated with other characteristics Identify higher level characteristics and categories from categories generated associated with the raw data Recursive categorization approach

Current Development Focus
Automating retrieval of document Obtain documents from web sources automatically Convert documents for use in NLTK environment Automate Execution of the analysis of documents Python based code to handle processing in batch style execution Use Existing NLTK tools where available

Daniel Bevis William King Villanova University Spring 2006 CS9010

Similar presentations

Presentation on theme: "Daniel Bevis William King Villanova University Spring 2006 CS9010"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Daniel Bevis William King Villanova University Spring 2006 CS9010

Similar presentations

Presentation on theme: "Daniel Bevis William King Villanova University Spring 2006 CS9010"— Presentation transcript:

Similar presentations

About project

Feedback