Presentation is loading. Please wait.

Presentation is loading. Please wait.

590 Scraping – NER shape features

Similar presentations


Presentation on theme: "590 Scraping – NER shape features"— Presentation transcript:

1 590 Scraping – NER shape features
Topics Chapter 13 - Testing Readings: Text – chapter 13 April 4, 2017

2 Today Scrapers from scrapy_documentation Cleaning NLTK data
loggingSpider.py openAllLinks.py Cleaning NLTK data Removing common words Testing in Python unitest Testing websites

3 Rest of the semester Tuesday April 4 – testing
Thursday April 6 – cleaning data Tuesday April 11 – images “CAPTCHA”s Thursday April 13 – Test 2 Tuesday April Mining Social Networks Thursday April 20 – Ethical and legal issues Tuesday April 25 – Reading Day Tuesday May 2 – 9:00 a.m.  EXAM

4 Google(morphology in nltk python)
Morphological analyzer Morphological analysis may be defined as the process of obtaining grammatical information from tokens, given their suffix information. Morphological analysis can be performed in three ways: morpheme-based morphology (or anitem and arrangement approach), lexeme-based morphology (or an item and process approach), and word-based morphology (or a word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output. In order to perform morphological analysis on a given non-whitespace token, the pyEnchant dictionary is ... With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more. Start Free Trial

5 Google (Word shape in NER) in nltk python
sklearn-crfsuite

6 Shape Feature def shape(word): if re.match('[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$', word, re.UNICODE): return 'number' elif re.match('\W+$', word, re.UNICODE): return 'punct' elif re.match('\w+$', word, re.UNICODE): if word.istitle(): return 'upcase' elif word.islower(): return 'downcase' else: return 'mixedcase' return 'other'

7 String.isTitle() # str = "This Is String Example...Wow!!!"; print (str.istitle()) str = "This is string example....wow!!!"; str = "Mr."; str = "M.";

8 String.functions capitalize() - Capitalizes first letter of string
center(width, fillchar)- Returns a space-padded string with the original string centered to a total of width columns. count(str, beg= 0,end=len(string)) - Counts how many times str occurs in string or in a substring of string if starting index beg and ending index end are given. decode(encoding='UTF-8',errors='strict') - Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. encode(encoding='UTF-8',errors='strict') - Returns encoded string version of string; on error, default is to raise a ValueError unless errors is given with 'ignore' or 'replace'. endswith(suffix, beg=0, end=len(string))- Determines if string or a substring of string (if starting index beg and ending index end are given) ends with suffix; returns true if so and false otherwise. expandtabs(tabsize=8) - Expands tabs in string to multiple spaces; defaults to 8 spaces per tab if tabsize not provided. find(str, beg=0 end=len(string)) - Determine if str occurs in string or in a substring of string if starting index beg and ending index end are given returns index if found and -1 otherwise index(str, beg=0, end=len(string)) - Same as find(), but raises an exception if str not found isalnum() - Returns true if string has at least 1 character and all characters are alphanumeric and false otherwise isalpha() - Returns true if string has at least 1 character and all characters are alphabetic and false otherwise

9 isdigit() - Returns true if string contains only digits and false otherwise.islower() - Returns true if string has at least 1 cased character and all cased characters are in lowercase and false otherwise. isnumeric() - Returns true if a unicode string contains only numeric characters and false otherwise. isspace() - Returns true if string contains only whitespace characters and false otherwise. istitle() - Returns true if string is properly "titlecased" and false otherwise. isupper() - Returns true if string has at least one cased character and all cased characters are in uppercase and false otherwise. join(seq) - Merges (concatenates) the string representations of elements in sequence seq into a string, with separator string.len(string) - Returns the length of the string ljust(width[, fillchar]) - Returns a space-padded string with the original string left-justified to a total of width columns. lower() - Converts all uppercase letters in string to lowercase. lstrip() - Removes all leading whitespace in string. maketrans() - Returns a translation table to be used in translate function. max(str) - Returns the max alphabetical character from the string str. min(str) - Returns the min alphabetical character from the string str. replace(old, new [, max]) - Replaces all occurrences of old in string with new or at most max occurrences if max given. rfind(str, beg=0,end=len(string)) - Same as find(), but search backwards in string.

10 rindex( str, beg=0, end=len(string)) - Same as index(), but search backwards in string.
rjust(width,[, fillchar]) - Returns a space-padded string with the original string right-justified to a total of width columns. rstrip() - Removes all trailing whitespace of string. split(str="", num=string.count(str)) - Splits string according to delimiter str (space if not provided) and returns list of substrings; split into at most num substrings if given. splitlines( num=string.count('\n')) - Splits string at all (or num) NEWLINEs and returns a list of each line with NEWLINEs removed. startswith(str, beg=0,end=len(string)) - Determines if string or a substring of string (if starting index beg and ending index end are given) starts with substring str; returns true if so and false otherwise. strip([chars]) - Performs both lstrip() and rstrip() on string swapcase() - Inverts case for all letters in string. title() - Returns "titlecased" version of string, that is, all words begin with uppercase and the rest are lowercase. translate(table, deletechars="") - Translates string according to translation table str(256 chars), removing those in the del string. upper() - Converts lowercase letters in string to uppercase. zfill (width) - Returns original string leftpadded with zeros to a total of width characters; intended for numbers, zfill() retains any sign given (less one zero). isdecimal() - Returns true if a unicode string contains only decimal characters and false otherwise.

11 Modifications to shape
? Capitalized abbreviation: Dr. Mr. Mrs. … Initial: [A-Z].

12 Test 2 – inclass sample questions
x

13 Exam – Scraping project
Proposal statement (April 11) – one sentence description Project description (April 18) Demo (May 2)

14 Cleaning Natural Language data
Removing common words Corpus of Contemporary English In addition to this online interface, you can also download extensive data for offline use -- full-text, word frequency, n-grams, and collocates data. You can also access the data via WordAndPhrase (including the ability to analyze entire texts that you input).

15 Most common words in English
1rst 25 make up 1/3 of English text 1rst 100 makeup ½ common = [‘the’, ‘be’, …] if isCommon(word) …


Download ppt "590 Scraping – NER shape features"

Similar presentations


Ads by Google