Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007

3 "Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."

4 The Mining Metaphor


6 Gold Mining

7 Diamond Mining

8 Data Mining

9 Data Mining- What it isn’t

10 ≠ Information Retrieval

11 ≠ Information Extraction

12 ≠ Information Analysis

13 ++ Information Retrieval Information Extraction Information Analysis

14 Data Mining new, previously unknown information

15 And so what is text data mining?

16 Text Mining


18 ++ Information Retrieval Information Extraction Information Analysis


20 Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to “publish”?

21 So how did we get here?

22 The word tobacco originates from the Taino indians. There is no I in the word Team. The book captured the zeitgeist of the time. I am sure that I turned the gas off.

23 The book captured the zeitgeist of the time. I am sure that I turned the gas off.



26 Semantic Web “Light”






32 But we can do more...

33 The web as a database

34 TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... The Relational Model

35 TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... Rows represent things

36 TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... Columns are properties

37 TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... The book has an author “Jorge Luis Borges” The thing’s property SubjectPredicateObject

38 The book has an author “Jorge Luis Borges” SubjectPredicateObject URI

39 has an author RDF: Resource Description Framework

40 Journal A Journal B Wiki Blog Personal Website OPAC

41 Journal A Journal B Wiki Blog Personal Website OPAC


43 PREFIX rdf: PREFIX foaf: SELECT DISTINCT ?name WHERE { ?x rdf:type foaf:Person. ?x foaf:name ?name } ORDER BY ?name SPARQL



46 RSS 1.0 FRBR Creative Commons FOAF Geo SKOS

47 The Early Modern Internet

48 Data Mining = With the goal of discovering new, previously unknown information Information retrieval + Information extraction + Information analysis...

49 Data Mining = Text Data Mining = With the goal of discovering new, previously unknown information Complex data extraction layer + data mining Information retrieval + Information extraction + Information analysis...





54 Why do we publish text?

55 Thank You

