Download presentation
Presentation is loading. Please wait.
Published byDominick Powell Modified over 9 years ago
1
Tokeniser Francisco Miguel Pérez Romero University of Sevilla
2
Roadmap Introduction Class Diagram Libraries Conclusions
3
Roadmap Introduction Class Diagram Libraries Conclusions
4
Web Wrapping Information retrieval VerifierOntologiser Extractor Query NavigatorFormFiller
5
Tokeniser Tokenisation Rules Configuration File Web Page Parser
6
Tokeniser Usage Web Page Classification Information Extraction Learners Information Extraction
7
Example Config File Token List Web Page Tokeniser XML File Token List
8
Concepts Configuration File Token Tokenisation types
9
Roadmap Introduction Class Diagram Libraries Conclusions
10
Example
11
Class Diagram: Tokenisation
12
Tokenisation Example
13
Class Diagram: Tokeniser
14
Roadmap Introduction Class Diagram Libraries Conclusions
15
Comparison Features 1 Comparison Features: Javadoc documentation? Support UNICODE UTF-8 Support UNICODE UTF-16 Named Groups Indexable Groups > 9 Negative Groups Nested groups Lazy qualifications?
16
Comparison Features 2 Comparison Features: Fuzzy matching? Support POSIX? Support Ignore Case? Support New Line Option? Use State Machine? Support accent?
17
Libraries Tabla 1
18
Libraries Tabla 2
19
Libraries Tabla 3
20
Benchmark 1 Regular Expression List String List Matching all one another Time in ms
21
Benchmark 1: 10000 Iterations org.apache: -> 7078 ms com.stevesoft : -> 19782 ms kmy.regex : -> 781 ms java.util : -> 1266 ms jregex.Pattern : -> 1000 ms org.apache.oro : -> 2156 ms dk.brics.automaton : -> 265 ms com.karneim.util.collection : -> 407 ms
22
Benchmark 1: 20000 Iterations org.apache: -> 11796 ms com.stevesoft : -> 26641 ms kmy.regex : -> 906 ms java.util : -> 1891 ms jregex.Pattern : -> 1422 ms org.apache.oro : -> 3375 ms dk.brics.automaton : -> 312 ms com.karneim.util.collection : -> 610 ms
23
Benchmark 1: 50000 Iterations org.apache: -> 28656 ms com.stevesoft : -> 63297 ms kmy.regex : -> 1781 ms java.util : -> 4281 ms jregex.Pattern : -> 3219 ms org.apache.oro : -> 7641 ms dk.brics.automaton : -> 531 ms com.karneim.util.collection : -> 1312 ms
24
Diagram
25
Benchmark 2 Source Code Matching tags
26
Benchmark 2: Amazon org.apache : -> 218 ms com.stevesoft : -> 63 ms kmy.regex : ->94 ms java.util : -> 0 ms jregex.Pattern : -> 93 ms org.apache.oro : -> 32 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 47 ms
27
Benchmark 2: Marca org.apache : -> 62 ms com.stevesoft : -> 47 ms kmy.regex : ->93 ms java.util : -> 0 ms jregex.Pattern : -> 94 ms org.apache.oro : -> 16 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 62 ms
28
Benchmark 2: Ebay org.apache : -> 31 ms com.stevesoft : -> 125 ms kmy.regex : ->266 ms java.util : -> 0 ms jregex.Pattern : -> 156 ms org.apache.oro : -> 47 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 172 ms
29
Diagram
30
To sum up… Dk.brics.automaton is the faster Dk.brics and com.karneim fail with URL Kmy.regex or java.util
31
Roadmap Introduction Class Diagram Libraries Conclusions
32
Tokenisation test Searching information A real project Experience
33
Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.