Download presentation
Presentation is loading. Please wait.
Published byColleen Webster Modified over 9 years ago
1
Science: Text and Language Dr Andy Evans
2
Text analysis Processing of text. Natural language processing and statistics.
3
Processing text: Regex Java Regular Expressions java.util.regex Regular expressions: Powerful search, compare (and replace) tools. (other types of regex include direct replace options – in java regex these are separate methods)
4
Regex Standard java: if ((email.indexOf(“@” > 0) && (email.endsWith(“.org”))) { return true; } Regex version: if(email.matches(“[A-Za-z]+@[A-Za-z]+\\.org”)) return true;
5
Example components [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z, or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction). Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w] ?Once or not at all * Zero or more times + One or more times
6
Matching Find all words that start with a number. Pattern p = Pattern.compile(“\\d\\.*”); Matcher m = p.matcher(stringToSearch); while (m.find()) { String temp = m.group(); System.out.println(temp); }
7
Replacing replaceFirst(String regex, String replacement) replaceAll(String regex, String replacement)
8
Regex Good start is the tutorial at: http://docs.oracle.com/javase/tutorial/essential/regex/ Also Mehran Habibi’s Java Regular Expressions.
9
Natural Language Processing A large part is Part of Speech (POS) Tagging: Marking up of text into nouns, verbs, etc., usually based on the location in the text and other context rules. Often formulates these rules using machine-learning (of various kinds), training the program on corpora of marked-up text. Used for : Text understanding. Knowledge capture and use. Text forensics.
10
NLP Libraries Popular are: Natural Language Toolkit (NLTK; Python) http://www.nltk.org/ OpenNLP (Java) http://opennlp.apache.org/index.html
11
OpenNLP Sentence recognition and tokenising. Name extraction (including placenames). POS Tagging. Text classification. For clear examples, see the manual at: http://opennlp.apache.org/documentation.html
12
Other info Other than the Numerical Recipes books, the other classic texts are Donald E. Knuth’s The Art of Computer Programming Fundamental Algorithms Seminumerical Algorithms Sorting and Searching Combinatorial Algorithms But at this stage, you’re better off getting…
13
Other info Michael T. Goodrich and Roberto Tamassia’s Data Structures and Algorithms in Java. Basic java, arrays and list. Recursion in algorithms. Key mathematical algorithms. Algorithm analysis. Data storage structures (stacks, queues, hashtables, binary trees, etc.) Search and sort. Text processing. Graph/network analysis. Memory management.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.