Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects.

Similar presentations


Presentation on theme: "Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects."— Presentation transcript:

1 Digital Text and Data Processing Week 7

2 □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects and historical changes) □ Not knowing beforehand what is possible / relevant Challenges

3 □ Digital humanities methodology often demands experimentation □ Method is mostly inductive approach (cf. deductive approach advocated by Stanley Fish) Stanley Fish □ When experiments are not motivated, there is a risk that the research simply exposes "a correlation between a formal feature the computer program just happened to uncover and a significance that has simply been declared, not argued for". □ Also see Chris Anderson, The End of TheoryThe End of Theory

4 □ The DH methodology is partly inductive and partly deductive □ Computational analyses often lead to unexpected results □ Techniques can help scholars to generate hypotheses

5 □ Data acquisition □ Clean up and enrichment (removal of stopwords, POS, lemmatisation) □ Quantification □ Data analysis Phases

6 □ Page images and machine-readable text (removal of typography and of paratext) □ Low quality of OCR, see, e.g. Laura Mandell, How to Read a Literary VisualisationHow to Read a Literary Visualisation □ Motivation of the choice of a specific edition Data acquisition

7

8 □ Text2Genome □ OSCAR □ NeuroElectro □ Peter Murray Rust’s work on Chemical Compounds TM on recent scientific articles

9 □ The right to read does not imply the right to mine □ Study commissioned by EC led by by prof. Ian Hargreaves Licences

10 Article 7.2 of Settlement: □ Creation of a “Research Corpus”; □ Solely for “non-consumptive” reading, or research “in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book” Google Books Settlement

11

12

13 □ Lev Manovich, The Language of New Media □ Textual narrative: linearity and reliance on typography □ Database: random access, non-linear, no form Database and Narrative

14

15 The Semantic web □ Envisaged by Tim Berners-Lee as “a web of data that can be processed directly and indirectly by machines” □ RDF-Triples □ Examples: Subject: “Book-URI” Predicate: “hasISBN” Object: “978-0-252-07829-0”

16 dbPedia

17

18

19 Nano-Publications

20 Semantic Publishing

21 STCN SPARQL Endpoint


Download ppt "Digital Text and Data Processing Week 7. □ POS: total counts: normalise by token count □ Unicode support □ Synchronic and diachronic variation (dialects."

Similar presentations


Ads by Google