Download presentation
Presentation is loading. Please wait.
1
Chapter 7 Lexical Analysis and Stoplists
薛茹分 吳佳真 吳寶鳳
2
Outline Introduction What counts as a word or token?
Implementing a Lexical Analyzer Implementing Stoplists Hashing Finite State Machine A Lexical Analyzer Generator DFA for stoplists 2019/1/14
3
Introduction Lexical analysis
The process of converting an input stream of characters into a stream of words or tokens. The first stage of automatic indexing and of query processing. Tokens are groups of characters with collective significance. Automatic indexing is the process of algorithmically examining information items to generate lists of index terms. Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items. 2019/1/14
4
Introduction (Cont’d)
Stoplists Many of the most frequently occurring words in English (like “the,” “of,” “and,” “to,” etc.) are worthless as index terms. Eliminating such words from consideration early in automatic indexing speeds processing, saves huge amounts of space in indexes, and does not damage retrieval effectiveness. A list of words filtered out during automatic indexing because they make poor index terms is called a stoplists or a negative dictionary. 2019/1/14
5
What counts as a word or token?
Some consideration Digits (e.g. B6, B12) Hyphens (e.g. F-16, MS-DOS) Other Punctuation (e.g. COMMAND.COM, OS/2) Case (Upper to lower case) 2019/1/14
6
Implementing a Lexical Analyzer
Finite state machine 1 Letter, digit 2 ( 3 Space ) & 4 | ^ 5 eos 6 other 7 8 2019/1/14
7
Implementing a Lexical Analyzer
GetToken 2019/1/14
8
Implementing Stoplists
There are two ways to filter stoplist words from an input token stream: Examine lexical analyzer output and remove any stopwords. - Hashing Remove stopwords as part of lexical analysis. - Deterministic finite automata (DFA) 2019/1/14
9
A Lexical Analyzer Generator
Stop words file Read stop words file Create a DFA Input file Lexical analysis Close the input file and return terms terms 2019/1/14
10
DFA for stoplists q1 q4 q0 q2 q5 q6 q3 {, n, nd} {, d} n d a
L0{a, an, and, in, into, to} {n, nto} {, to} {} i q0 q2 n q5 q6 t t o {o} q3 2019/1/14
11
Questions and Comments
Thank you!!! Questions and Comments 2019/1/14
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.