Download presentation
Presentation is loading. Please wait.
Published byJunior Parrish Modified over 9 years ago
1
HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library angela.zoss@duke.edu Duke University Libraries, Digital Scholarship Text > Data, October 25
2
DOCUMENTS AS CONTEXT
3
ANGELA AS CONTEXT But first,
4
How I learned to love the document. B.A. courses: Linguistics, Communication M.S. courses: Communication, Human-Computer Interaction Employment: arXiv.org AdministratorarXiv.org Ph.D. courses: Bibliometrics/Scientometrics Computer Mediated Discourse Analysis Latent Structure Analysis Natural Language Processing
5
DOCUMENTS AS CONTEXT Now,
6
Text analysis from… documents down to words (“low-level”) words up to documents (“high-level”)
7
Using documents to learn about language (or other social phenomena) Analyzing documents as records/proxies of language, social structures, events, etc. Linguistic studies: morphology, word counts, syntax, etc. … over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches)Google ngram viewer Underwood, T. (2012). Where to start with text mining.Where to start with text mining.
8
Using documents to learn about language Historical culturomics of pronoun frequencies
9
Using documents to learn about language Universal properties of mythological networks
10
Using language to learn about documents Analyzing documents as artifacts themselves, with their own properties and dynamics Literary, documentary studies: Structural/rhetorical/stylistic analysis Document categorization, classification Detecting clusters of document features (topic modeling) Underwood, T. (2012). Where to start with text mining.Where to start with text mining.
11
Using language to learn about documents Literary Empires, Mapping Temporal and Spatial Settings in Swinburne
12
Using language to learn about documents Using Word Clouds for Topic Modeling Results
13
What are documents? For this discussion, digital versions of works of spoken or written language Examples: books, articles, transcripts, emails, tweets…
14
Documents as context Documents have: form(at) style provenance entities intentions
15
STUDIES OF DOCUMENTS
16
Why study documents? Describe a corpus Compare/organize documents Locate relevant information/filter out irrelevant information
17
Describing a corpus Finding regularities/differences across groups of documents Developing theories of structure, style, etc. that can then be tested or applied May be manual (content analysis) or computer-assisted (statistical)
18
Example: Storylines http://xkcd.com/657/
19
Differences of format, genre, participants… Articles may have sections, but these will vary by discipline and type of article Books may be fiction or non-fiction (or both) Transcripts may refer to multiple speakers, non-text content …ad infinitum
20
Example: Literature Fingerprinting Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.438900410.1109/VAST.2007.4389004
21
Organizing documents Detect similarity between documents and a known category (or simply among themselves) Supports browsing, sentiment analysis, authorship detection
22
Example: Bohemian Bookshelf Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization
23
Similarity based on… common document attributes authorship, genre common language patterns topics, phrases common entity references characters, citations
24
Example: Quantitative Formalism Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).Quantitative formalism: An experiment
25
Example: Clinton’s DNC Speech http://b.globe.com/TogUqq
26
Example: View DHQ http://digitalliterature.net/viewDHQ/vis3.html
27
Classification assigning an object to a single class often supervised, using an existing classification scheme and a tagged corpus
28
Example: Relative signatures Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp. 103-112).
29
Categorization assigning documents to one or more categories suggestive of unsupervised clustering techniques design choices made to fit particular tasks or goals
30
Example: UCSD Map of Science Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS ONE, 7(7), e39464. Design and update of a classification system: The UCSD Map of Science
31
Example: NIH Map Viewer https://app.nihmaps.org/nih/browser/
32
Reference systems, infrastructure What do we gain by adding structure? What do we lose?
33
SUMMARIZING DOCUMENTS
34
Text is only one component of a document. Research questions often push us to be creative with how we operationalize constructs. The richness of language and documents is best preserved by using multiple, complementary approaches.
35
QUESTIONS? angela.zoss@duke.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.