Download presentation
Presentation is loading. Please wait.
Published byDavid Verney Modified over 9 years ago
2
Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org
3
"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."
4
The Mining Metaphor
6
Gold Mining
7
Diamond Mining
8
Data Mining
9
Data Mining- What it isn’t
10
≠ Information Retrieval
11
≠ Information Extraction
12
≠ Information Analysis
13
++ Information Retrieval Information Extraction Information Analysis
14
Data Mining new, previously unknown information
15
And so what is text data mining?
16
Text Mining
18
++ Information Retrieval Information Extraction Information Analysis
20
Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to “publish”?
21
So how did we get here?
22
The word tobacco originates from the Taino indians. There is no I in the word Team. The book captured the zeitgeist of the time. I am sure that I turned the gas off.
23
The book captured the zeitgeist of the time. I am sure that I turned the gas off.
26
Semantic Web “Light”
32
But we can do more...
33
The web as a database
34
TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... The Relational Model
35
TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... Rows represent things
36
TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... Columns are properties
37
TitleAuthorISBN-13Publisher Labyrinths Jorge Luis Borges 978- 0811200127 New Directions HopscotchJulio Cortazar 978- 0394752846 Pantheon The Aleph Jorge Luis Borges 978- 0140286809 Penguin... The book has an author “Jorge Luis Borges” The thing’s property SubjectPredicateObject
38
The book has an author “Jorge Luis Borges” SubjectPredicateObject URI
39
http://www.amazon.com/isbn/978-0140286809 has an author http://www.wikipedia.com/borges RDF: Resource Description Framework
40
Journal A Journal B Wiki Blog Personal Website OPAC
41
Journal A Journal B Wiki Blog Personal Website OPAC
43
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX foaf: http://xmlns.com/foaf/0.1/ SELECT DISTINCT ?name WHERE { ?x rdf:type foaf:Person. ?x foaf:name ?name } ORDER BY ?name SPARQL http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss
46
RSS 1.0 FRBR Creative Commons FOAF Geo SKOS
47
The Early Modern Internet
48
Data Mining = With the goal of discovering new, previously unknown information Information retrieval + Information extraction + Information analysis...
49
Data Mining = Text Data Mining = With the goal of discovering new, previously unknown information Complex data extraction layer + data mining Information retrieval + Information extraction + Information analysis...
54
Why do we publish text?
55
Thank You gbilder@crossref.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.