Download presentation
Presentation is loading. Please wait.
Published byVictoria Melton Modified over 9 years ago
1
A Model for Learning Words by Crawling the Web Jeff Thomson, Sygys.com Rex Gantenbein, University of Wyoming 1CAINE November 2009
2
Overview Goal: create an autonomous language learning system – Use Web crawler technology – Extract meaning from paragraphs and sentences to create language understanding Major issues – Irregularity of natural language constructions – Understanding paragraphs and sentences – Determining meaning of new words CAINE November 2009 2
3
Handling irregularities Most major parts of a language (English, anyway) can be generalized – Exceptions require preprocessing to fit them into generalizable categories – Example: Inflectional endings on verbs batis batsam battingare battedwas CAINE November 20093
4
Handling irregularities Idiomatic phrases require understanding of the entire phrase in a colloquial context “Go jump in the lake” vs. “Go cook yourself an egg” Pronoun resolution “Three boys each bought a pizza. They ate them in the park.” CAINE November 20094
5
Extracting understanding Paragraph understanding – Matching paragraph structure to common forms – Finding the nucleus of the paragraph’s meaning Sentence understanding – Matching sentence structure to common forms – Determining the meaning of the words in the sentence CAINE November 20095
6
Our approach Exception-first processing – Preprocessing to handle irregularities Linguistic classifications based on tree structure CAINE November 20096 ClauseFiniteImperativeIndicativeNon-finite Interrogative Declarative
7
Our approach Parser (incorporated into Web crawler) to determine structure – Some structures are disregarded when keywords are already classified Word classification – Type, gender, number – Unknown words are analyzed according to rules using placement in sentence and surrounding classified words CAINE November 20097
8
Our approach Keyword recognition – Use “word chains” (sequences of words) with application of linguistic knowledge Word-level understanding – Reduce words to root form to process them as keywords – Reduce irregular forms using an exception database created at preprocessing CAINE November 20098
9
System model Exception database – Separates generalizable and exception verbs – Processes word endings – Scans exception database for exception – Processes “normal” words according to rules CAINE November 20099
10
System model Categorization generator – Separates generalizable and exception words – Processes word endings – Scans exception database for exceptions and processes these first – Processes “normal” words according to rules Sentence parser with disregard capacity Paragraph understanding rules CAINE November 200910
11
System model Web crawler searches for source material – Processes the material and enhances its own rules and exceptions – Eventually will learn enough to understand most material in a given language Future work – Implement a pilot version of this system – Determine how to control for a “given” language CAINE November 200911
12
Questions? CAINE November 200912
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.