Tokeniser Francisco Miguel Pérez Romero University of Sevilla
Roadmap Introduction Class Diagram Libraries Conclusions
Roadmap Introduction Class Diagram Libraries Conclusions
Web Wrapping Information retrieval VerifierOntologiser Extractor Query NavigatorFormFiller
Tokeniser Tokenisation Rules Configuration File Web Page Parser
Tokeniser Usage Web Page Classification Information Extraction Learners Information Extraction
Example Config File Token List Web Page Tokeniser XML File Token List
Concepts Configuration File Token Tokenisation types
Roadmap Introduction Class Diagram Libraries Conclusions
Example
Class Diagram: Tokenisation
Tokenisation Example
Class Diagram: Tokeniser
Roadmap Introduction Class Diagram Libraries Conclusions
Comparison Features 1 Comparison Features: Javadoc documentation? Support UNICODE UTF-8 Support UNICODE UTF-16 Named Groups Indexable Groups > 9 Negative Groups Nested groups Lazy qualifications?
Comparison Features 2 Comparison Features: Fuzzy matching? Support POSIX? Support Ignore Case? Support New Line Option? Use State Machine? Support accent?
Libraries Tabla 1
Libraries Tabla 2
Libraries Tabla 3
Benchmark 1 Regular Expression List String List Matching all one another Time in ms
Benchmark 1: Iterations org.apache: -> 7078 ms com.stevesoft : -> ms kmy.regex : -> 781 ms java.util : -> 1266 ms jregex.Pattern : -> 1000 ms org.apache.oro : -> 2156 ms dk.brics.automaton : -> 265 ms com.karneim.util.collection : -> 407 ms
Benchmark 1: Iterations org.apache: -> ms com.stevesoft : -> ms kmy.regex : -> 906 ms java.util : -> 1891 ms jregex.Pattern : -> 1422 ms org.apache.oro : -> 3375 ms dk.brics.automaton : -> 312 ms com.karneim.util.collection : -> 610 ms
Benchmark 1: Iterations org.apache: -> ms com.stevesoft : -> ms kmy.regex : -> 1781 ms java.util : -> 4281 ms jregex.Pattern : -> 3219 ms org.apache.oro : -> 7641 ms dk.brics.automaton : -> 531 ms com.karneim.util.collection : -> 1312 ms
Diagram
Benchmark 2 Source Code Matching tags
Benchmark 2: Amazon org.apache : -> 218 ms com.stevesoft : -> 63 ms kmy.regex : ->94 ms java.util : -> 0 ms jregex.Pattern : -> 93 ms org.apache.oro : -> 32 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 47 ms
Benchmark 2: Marca org.apache : -> 62 ms com.stevesoft : -> 47 ms kmy.regex : ->93 ms java.util : -> 0 ms jregex.Pattern : -> 94 ms org.apache.oro : -> 16 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 62 ms
Benchmark 2: Ebay org.apache : -> 31 ms com.stevesoft : -> 125 ms kmy.regex : ->266 ms java.util : -> 0 ms jregex.Pattern : -> 156 ms org.apache.oro : -> 47 ms dk.brics.automaton : -> 0 ms com.karneim.util.collection : -> 172 ms
Diagram
To sum up… Dk.brics.automaton is the faster Dk.brics and com.karneim fail with URL Kmy.regex or java.util
Roadmap Introduction Class Diagram Libraries Conclusions
Tokenisation test Searching information A real project Experience
Thanks!