10-K filing annual report word and document statistics

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1. Content – Collective term for all text, images, videos, etc. that you want to deliver to your audience. 2. Structure – How the content is placed on.
How Tags are used to form your Web Page
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
 there: I’d love to go there. their: Is that their cat? they’re (they are): They’re here.  to: I’m going to work. too: Are you coming too? two: I have.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)
Information Retrieval IR 4. Plan This time: Index construction.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
The Vector Space Model …and applications in Information Retrieval.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Project 2 Issues Dr. Ralph D. Westfall February, 2006.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Check Writing Mrs. Flowers Finance & Business Technology 1.
Risk Metrics in an Integrated Financial Discipline David L. Ruhm The Hartford Insurance Group 2004 Bowles ERM Symposium Session CS 3B: Risk Metrics.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
A Guide to Using Google Docs for Miss Micklos and Mr. Kelly Google Docs.
STAYING SAFE: Here are some safety tips when using Change your password regularly and keep it in a safe place. Don’t share your password with anyone.
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
When you open Access you can open or import an existing.csv file. Check that it recognises that the fields are separated by commas.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Henrik Kjems-Nielsen ICES Secretariat InterCatch – the screen guide.
High Frequency Words. a able about a-b-o-u-t about.
Dept. of Community Medicine, PDU Government Medical College,
Information Retrieval in Practice
NOODLETOOLS SIGN-IN Student ID #
Queensland University of Technology
Objectives Create a folder in Google Drive.
Status Report of EDI on the CAA
New Scottish Income Tax Powers from 17/18 onwards
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
What is APA Format? & Using APA Format in Word
Indexing & querying text
Text Based Information Retrieval
Textural sentiment in finance
BA Yearend Procedure.
CS 430: Information Discovery
Microsoft® Word 2010 Training
Object of the game Yahtzee Large straight Small straight Full house
Windows xp PART 1 DR.WAFAA SHRIEF.
Representation of documents and queries
About Everything You Always Wanted to Know
Text Categorization Assigning documents to a fixed set of categories
Dept. of Community Medicine, PDU Government Medical College,
6. Implementation of Vector-Space Retrieval
Object of the game Yahtzee Large straight Small straight Full house
Introduction to Invoicing
Chapter 5: Information Retrieval and Web Search
Introduction to Text Analysis
Automatic Global Analysis
Object of the game Yahtzee Large straight Small straight Full house
Object of the game Yahtzee Large straight Small straight Full house
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Quality of Written Communication
Presentation transcript:

10-K filing annual report word and document statistics 9-10-2017 David Ling

Document statistics Downloaded S&P 500 companies 10-K filings 2011-1-1 to 2017-1-1 1 filing per year, 6 reports per company (some are less due to newly joined) Using regexp to extract item 7 Items are stored as separated files

For documents with words < 4000, we may consider it as a fail extraction: Incomplete extraction (part of them are extracted) Referring to some where else Regexp cannot be found

Extracted document statistics Total documents: 2859 Documents with words > 4000: 2459 (valid extraction) Companies with valid extraction for recent 3 years: 409 Companies with valid extraction for recent 6 years: 369 We can rank that 409 companies Extracted number of words for some companies: [CIK, 2016, 2015, 2014, 2013, 2012, 2011] ['93751' 40711 41958 41740 31540 28126 27087] ['9389' 8953 7578 7397 7615 7877 8162] ['940944' 89 89 89 89 89 89] ['943819' 7202 6636 6653 6714 6688 6712] ['96021' 18994 18870 22269 19989 18672 19268] ['97476' 4661 5477 80 69 69 69]

Top 50 frequent words among valid extracted 59290 distinct words in valid extracted Did not apply Stemming and lemmatization (eg. cat and cats, play and played, company and company’s are distinct) They are distinct in downloaded GloVe data Frequency in valid extracted

Frequency percentile About 10% of words appear only 1 times Frequency are highly dominated by 1% of the frequent words Percentage (59290 words) Frequency in valid extracted Percentile 1 20 2 40 4 60 11 80 51 95 781 96 1148 97 1954 98 3617 99 9369 99.5 20254 100 2061086

Some selected uncommon words Rank, word, freq., doc freq. 58783,lncome,1,1 58951,padding-bottom,1,1 58784,quality.,1,1 58952,post-january,1,1 58785,2.53x,1,1 58953,disappear,1,1 58786,amrisc,1,1 58954,low-point,1,1 58787,1.85x,1,1 58955,-balance,1,1 58788,2.09x,1,1 58956,earnings.we,1,1 58789,1.36x,1,1 58957,non- deductible.our,1,1 58790,mid-fifties,1,1 58958,decemberr,1,1 Some are due to: Numbers without spaces Full stop without followed by a capital letter (‘…quality. table of …’) Missing space (blue) Hyphen Wrong spelling As their appear frequency is small, we may just ignore them, or regard them as noise at this stage.

Discussions Next step: term weighting and stop words Filtering stop words by stop word list on internet (Bill McDonald) Examples: A ABOUT ABOVE ACROSS AFOREMENTIONED AFORESAID AFTER AFTERWARDS AGAIN AGAINST ALL ALMOST ALONE ALONG ALREADY ALSO ALTHOUGH ALWAYS AMONG AMONGST AN AND ANOTHER ANY ANYHOW ANYONE ANYTHING ANYWHERE ARE AROUND AS AT BE BECAME BECAUSE Filtering stop words by inverse document frequency Idf = log( 1/ document frequency) As document length is long, this is not able to differentiate frequent word and stop words, eg. Both ‘the’ and ‘income’ appear on all documents (same idf) , but ‘income’ is much more meaningful than ‘the’