1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet.

Slides:



Advertisements
Similar presentations
1 Publishing in European Journal of Teacher Education 28th August 2010 Kay Livingston, Editor, EJTE Geri Smyth, Co-Editor, EJTE Katie Peace, Publisher,
Advertisements

Understanding CP Writing Tasks
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.
Project Proposal.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
FAIRTRADE FOUNDATION OCR Nationals in ICT Unit 1 ICT Skills for Business AO4.
Writing a Research Paper
Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.
User studies. Why user studies? How do we know security and privacy solutions are really usable? Have to observe users! –you may be surprised by what.
© AJC /18 Extended Matching Sets Questions for Numeracy Assessments: A Case Study Alan J. Cann Department of Microbiology & Immunology University.
Becoming an academic writer 10 steps to assignment success.
Kap. 20 – Case: Flexible User Interfaces How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Word Processing. ► This is using a computer for:  Writing  EditingTEXT  Printing  Used to write letters, books, memos and produce posters etc.  A.
Sentence Structure Fragments & Run-ons. Two kinds of sentence errors ● Fragments o Writing only part of a sentence as though it were a whole sentence.
Research Methods & Data AD140Brendan Rapple 2 March, 2005.
PROOFREADING WORKSHOP By: Kristina Yegoryan. WHAT IS PROOFREADING? Proofreading means examining your text carefully to find and correct typographical.
Lesson 9: Peer Review Topics Role of the Peer Reviewer
An Introduction to Content Management. By the end of the session you will be able to... Explain what a content management system is Apply the principles.
Functional ICT Lesson Four Finding and Selecting information.
 By the end of this, you should be able to state the difference between DATE and INFORMAITON.
Paraphrasing and Plagiarism. PLAGIARISM Plagiarism is using data, ideas, or words that originated in work by another person without appropriately acknowledging.
Useful Alternatives to PowerPoint for Instruction Librarians WILU May 10, 2013 Brad Sietz LOEX, Eastern Michigan University.
English Language.
Research Paper Arguments Premises Fallacies Take Notes!
Link Resolvers: An Introduction for Reference Librarians Doris Munson Systems/Reference Librarian Eastern Washington University Innovative.
Essay Improvements.
How to Revise an Essay. Done-ness  After you finish the first draft of an essay, a sense of calm settles over your body. “At last,” you say, “I’m done.”
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Understand About Essays What exactly is an essay? Why do we write them? What is the basic essay structure?
Steps to Writing A Research Paper In MLA Format. Writing a Research Paper The key to writing a good research paper or documented essay is to leave yourself.
Ginny Smith Managing Editor: Planning and Urban Studies Taylor & Francis Ltd.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Data and information. Information and data By the end of this, you should be able to state the difference between DATE and INFORMAITON.
Databases. What is a database?  A database is used to store data. The word DATA is actually Latin for FACTS. A database is, therefore, a place, or thing.
Ian White Publisher, Journals (Education) Routledge/Taylor & Francis
16 Ways To Take To Work. How Would You Use PBwiki At Work? Over 1,000 non-business users were surveyed How would you use PBwiki at work? All responses.
When rhetoric pays—literally. Showing Added Value  The main purpose of the job application letter is to establish a professional identity highlight and.
1 CREATING A RESEARCH PAPER (25 June 2010) Objectives: To create a Research Paper using MLA Documentation style.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
ITGS Databases.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Newspaper in Education Web Site (NEWS) Usability Evaluation Conducted by Terry Vaughn School of Information The University of Texas at Austin November.
INFORMAL REPORTS. 2 SPECIFIC GUIDELINES 3 III. SPECIFIC GUIDELINES TYPES of INFORMAL REPORTS A. Problem Analyses B. Recommendation Reports C. Equipment.
Planning Regular words Keep it short Get active Proofreading Hidden verbs Presentation Assessments Clear English Proofreading Tips.
Final Year Projects 8: Developing your Academic Writing Style.
DESIGNING AN ARTICLE Effective Writing 3. Objectives Raising awareness of the format, requirements and features of scientific articles Sharing information.
College Composition I: Unit 8 Seminar Susan W. Trestrail, M.S. Ed. Instructor.
BUSINESS COMMUNICATION REVISING BUSINESS MESSAGES REVISING BUSINESS MESSAGES By Mustafa Mustafa MBA EXE HIMS.
HOW TO REVISE AND EDIT EFFECTIVELY. REVISION VS. EDITING  Revision is content-focused. Revision is a time to identify holes in an argument, information.
Overview In this tutorial you will: learn what an e-portfolio is learn about the different things e-portfolios may be used for identify some options for.
White Smoke  This software renders the highest quality proofreading abilities available, correcting spelling and grammar mistakes, word choices, and.
TELEPORT PRO Website to Hard Drive Completely download a website, enabling you to “Browse Offline” at much greater speeds than if you were to browse the.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
The M&M Mystery National University.
Big Data is a Big Deal!.
INTERMEDIATE PROGRAMMING WITH JAVA
Kap. 20 – Case: Flexible User Interfaces
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Cracking the Coding Interview
A semantic proofreading tool for all languages based on a text repository Kai A. Olsen Molde University College and Department of Informatics, University.
Statistical n-gram David ling.
A semantic proofreading tool for all languages based on a text repository Kai A. Olsen Molde University College and Department of Informatics, University.
Academic Debate and Critical Thinking
Information Retrieval and Web Design
Founded in 2002, Credit Abuse Resistance Education (CARE) educates high school and college students on the responsible use of credit and other fundamentals.
Presentation transcript:

1 Kap. 60 – Case: Proofreading How Information Technology Is Conquering the World: Workplace, Private Life, and Society Professor Kai A. Olsen, Universitetet i Bergen og Høgskolen i Molde

2 A semantic proofreading tool for all languages based on a text repository Kai A. Olsen Molde University College and Department of Informatics, University of Bergen Norway Bård Indredavik Technical Manager, Oshaug Metall AS, Molde, Norway

Kai A. Olsen, Proofreading is important When we write in a foreign language If we are not proficient in our own language To find typos and other mistakes Errors can make the text unreadable and give a very bad impression: I am a student of MSc Logistics and Supply Chain Management from Westminitser University, London. Last weel I had the presentation regarding Molde College University and I heart that you are the module leader of Management of value. I am wondering if you may write me back more about that module, because it not really clear for me? In particular, when I am considering to go foe the second semestr to Molde. I will be really approciate for it.

Kai A. Olsen, Manual proofreading When we are in doubt about an expression we could ask a language proficient colleague However, we may not have anybody to ask it may be too much to ask somebody to proofread everything that we write Can we do it automatically?

Kai A. Olsen, Automatic language processing An important research area since the nineteen sixties The results have been far from what many envisioned Natural languages seems to be too complex to be formalized (some argue that you have to be a human being to understand natural language) But, due to faster computers we have workable spelling checkers and studies of syntax have offered grammar checkers that handle at least some types of mistakes Still, clear limitations, e.g., the language tools in Office 2003 will not find these errors: “I have a red far ” ”A forest has many threes” “I live at London” ”We had ice cream for desert”

Kai A. Olsen, For our student If she had used a spelling and grammar checker in Office only a few mistakes would have been found:

Kai A. Olsen, Another approach Instead of asking another person to proofread, we could ask the whole world That is, use the Web as a text repository and compare our sentences to those of everyone else For example, by using Google: ”we live at the west coast” – 0 ”we live on the west coast” – 3,500,000 ”we live in the west coast” – 5,960,000

Kai A. Olsen, Background paper (2004) Journal of the American Society for Information Science and Technology, Volume 55, Issue 11, September 2004

Kai A. Olsen, What if the alternatives are unknown? We can use a wild card (*) Example: ”we live * the west coast” Study the alternatives, and check the complete sentence with each candidate to get a frequency number

Kai A. Olsen, A tedious process

Kai A. Olsen, Disadvantages A lot of work We have to know where we are in doubt It can be difficult to find all the alternatives But we can make a tool that can do this job automatically

Kai A. Olsen, Prototype Consist of: 1. A spider that collects text from the Web 2. An index builder that creates an index structure 3. An analyzing program that finds alternatives for each word in the user’s sentence

Kai A. Olsen, Spider Starts with a list of seeds, e.g., links to Web sites of universities, newspapers, state organizations, etc. Retrieves text from these sites “Cleans” the text of formatting data Stores all links that are found,.html,.pdf and.doc if these have not been encountered previously Follows html-links recursively (we have separate spiders to parse.pdf and.doc files). Stores the text in files, numbered consecutively.

Kai A. Olsen, Index builder For each word we get the files that contain at least O occurrences of the word. If O is 1 all words are included, but we may use a higher value to avoid (at least some) misspelled words. Word File Word Lines For each file we have a list of all words in the file, each word giving the lines in the file where the word occurs All structures are represented as Boolean arrays stored as.txt files.

Kai A. Olsen, In English 2.5 Gb text 2,500 files (1 Mb each) for raw text 200,000 words (O=10, includes only words with a frequency of 10 or higher) and the same number of text files to show in which files the word occurs 43 million text files with line references (for each word in each file) No problem for Windows 7

Kai A. Olsen, In Norwegian 1 Gb text 10,000 files (0.1 Mb each) for raw text 550,000 words (O=1, all words) and the same number of text files to show in which files the word occurs 42 million text files with line references (for each word in each file)

Kai A. Olsen, The analyzer Finds the frequency of the complete sentence (N words) offered by the user Parses the files where at least N-1 words of the sentence occur Replaces one and one word with a wild card Collects alternatives Checks the frequency of each alternative Calculates a confidence value based on the ratio of frequencies and the similarity between the original word and the alternative (Hirschberg’s algorithm) Suggests improvements where the alternative sentence get a higher score than the original

Kai A. Olsen, Analyzer (example) I live at London changed to: I live in London

Kai A. Olsen, Analyzer (example 2) We had ice cream for desert changed to: We had ice cream for dessert

Kai A. Olsen, What kind of errors can be found Typos, as in: I have a red far Spelling, using the wrong word: e.g., mixing desert and dessert Grammar, using the wrong preposition, verb, etc. e.g., mixing in/at/on/ Facts Beethoven was born in 1970 – corrected to Punctuation That is, most types of mistakes that we make when writing.

Kai A. Olsen, When the system fails Examples: We eat avocado, may be corrected to we eat apples Neptune is the outer planet in the solar system, may be corrected to Pluto is the outer planet… When we have date specific data, as in the sentence “the prime minster of Great Britain is” In practice these failures will seldom be problematic as they often will address an area where the user is competent, also a learning system can reduce some of these cases In addition, a system that takes dates into consideration should help

Kai A. Olsen, The prototype Is only a prototype: 1 or 2.5 Gb is not enough to get a wide range of sentences Catching data from the Web gives a repository with many spelling and grammar errors (also with a lot of repeated text) The system works too slow to handle many users Still, it can correct many types of mistakes, e.g., all the examples that we used in our 2004 paper.

Kai A. Olsen, What we need in order to improve the text repository: A text quality checker, that ignores text with too many errors Or, perhaps better, text repositories based on books, company reports, government reports, scientific papers, … improve speed: A site with many thousands (millions) of simple computers (i.e., a “Google” setup) The task is ideal for parallel computing

Kai A. Olsen, Parallel computing: MapReduce An algorithm offered by Dean and Ghemawat from Google Idea – algorithms that work in parallel on large data sets In our case: The map operation could be applied to each file, offering the frequency of each alternative sentence (one computer can work on one file at a time) The reduce could take these intermediate results in order to compute the final frequencies.

Kai A. Olsen, Discussion Do we want to write as the majority? Yes, when we write in a foreign language When we are not too proficient writers Can we leave everything to the proofing tool? No, as with other type of proofing tools what we get is a suggestion only What the tool really does is helping the user to use reading competency when writing Will the system find examples of all sentences? No Why do not Google and others offer this tool? Perhaps because it will be very resource demanding (or because they are not smart enough) What about false negatives? This (the system indicating expressions that are correct) may be a problem.

Kai A. Olsen, Conclusion With a multicomputer setup and a large repository many mistakes can be indicated Works in any language that can be digitized Can be an offline or online tool (perhaps online is achievable one time in the future?) We could have repositories that reflects style (academic, business, social…)?

Kai A. Olsen, Big data is becoming important To analyze buying patterns of customers Recommendation systems Traffic patterns for planning new flights or new roads (Norwegian to Molde) In science (meteorology, medicine, physics, astronomy…) In many areas

Kai A. Olsen, Data is available From the Web From user actions on the Web (keywords entered for searching, pages visited…) From automatic sensors, modern equipment (such as better telescopes), online activities, cameras… The computers and software are here to analyze the data

Kai A. Olsen, That is BIG DATA can be used to understand many complex processes Will becoming an important issue in the next ten years of computing