Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation.

Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation Centre a centre of expertise in data curation and preservation

2 What is Digital Curation? Preserving stuff? –Librarians and archivists –Scientists (with huge amounts of regular experimental data) Publishing stuff? –Publishers of “reference” data: compendia, dictionaries, bibliographies, gazetteers, etc. –Scientists (with lots of complex annotated data) Both communities call themselves “curators” but at first sight they have almost orthogonal concerns

3 Their concerns look orthogonal, but… Shouldn’t the “publishers” be concerned about the long-term usefulness of their findings? The “preservers” do more than preserve – they classify and annotate. –Shouldn’t they publish (and preserve) their own work? As you dig deeper you find that there is a lot of commonality.

4 Curated Databases are Central Much/most scientific data is now in databases They often do not contain source experimental data. Sometimes just annotation/metadata They borrow extensively from, and refer to, other databases You are now judged by your databases as well as your (paper) publications!! These databases are built and maintained with a great deal of human or computational effort. What makes a database? –it has internal structure or it changes. Size alone doesn’t qualify

5 The Research Agenda Data integration and publishing –Slowly coming to market. Publishing in community formats is a new twist Annotation –Everybody agrees this is important. No-one understands it. Metadata extraction –Semantic or otherwise, it’s a key part of annotation Archiving and Appraisal –What do we do about databases – they change! Legal issues –Can we at least help to clarify what is going on? Provenance and data quality –Again, we don’t fully understand it. Organisational dynamics of repositories Economic analyses of curation Ontologies, performance, registries, structure evolution…

6 Archiving (preserving) databases How do you preserve something that changes every hour or minute? –Important for the scientific record – someone might have cited your data at time t. Current practice –Create versions (how often?) –Log changes –Use diffs –Do nothing (common!)

7 A Sequence of Versions

8 [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ] This relies on a deterministic / keyed model Pushing time down

9 100 days of OMIM Size (bytes) x 10 6 XMill(archive) gzip(inc diff) version archive, inc diff Legend archive inc diff version compressed inc diff compressed archive Uncompressed Archive size is –  1.01 times diff repository size –  1.04 times size of largest version Compressed archive size between 0.94 and 1 times compressed diff repository size gzip - unix compression tool XMill - XML compression tool

10 The Bottom Line Can archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file) Retrieval is a linear scan Works well with compression to less than 30% of current file. Archive is an XML file Archive as often as you like! (Almost) Works well with indexing Permits temporal queries on objects

11 How do we cite data? A URL or citation to an article is already unsatisfactory. –DCC client complaint: “I spend a lot of time searching [electronic documents] for the part that is relevant to the citation.” The problem is much worse when you are citing something in a very large database. How do you use a citation to locate data? How do you ensure that the citation persists? – Connections with DB archiving and DOIs

12 File and directory names that contain data /timit/train/dr1/fcjf0/sa1.wav speaker-id: cjf0 sex: f sentence-id: sa1 file-type: waveform dialect-region:1 type: training corpus: timit Compound keys traditionally indicated location: BL MS Cotton Nero A.ix Manuscript in the British Library, which used to be in the library of a Mr. Cotton [which burnt down] under a statue of Nero, top shelf, nine books along from the left. Location is typically informative?

13 Keys for XML Implicit keys are ubiquitous in scientific data formats (easily converted to XML) Some proposals for key specifications in XML work (DTD IDs, XML-Schema) “Deep citation” in digital libraries. Natural consequence of translating back from deterministic model to XML (node-labeled) Interactions with data models/formats

14 Relative keys General form: Q{P 1,..., P n }. Q’{P’ 1,..., P’ n’ }... Example: book{name}.chapter{number}.verse{number} number specifies verse only within chapter number specifies chapter only within book Also : bible{}.book{name}.chapter{number}.verse{number} empty key: at most one bible node

15 Keys and file formats Understanding and registering formats is only a first step The real issue is still integration and transformation. Keys and other constraints may help Remember: structured files are databases!

16 Data exchange on the Web All members of a community (industry) agree on a DTD and then exchange data w.r.t. it: e-commerce, health- care,... XML Publishing: mapping relational data to XML conforming to the predefined DTD DB1 DB2 XML DTD Q: XML view Web XML

17 Progress report on DCC research (funding period: -2 weeks) Four new research fellows at Edinburgh: –Mags McGinley (legal practice) IP, copyright in databases –James Cheney (Cornell) Programming Languages, Digital Libraries, XML compression –Tasos Kemensietsidis (Toronto) Data integration, P2P databases –Rajendra Bose (UCSB) Earth sciences data. “Workflow” provenance in scientific data. At UKOLN –Michael Day, metadata and Interoperability At CCLRC –Shoaib Sufi, data portals and metadata At Glasgow –Position in metadata extraction advertised

18 Progress report on DCC research (continued) Pleasant DCC space (thanks to Edina and Informatics) to house DCC and database group. Collaboration with –biologists (EBI & Edinburgh) on data publishing and –astronomers (Edinburgh) on XML manipulation & representation of large data sets. First DCC research visitor (Michael Lesk) Work with partners in progress on –annotation –DOIs Please join us!!!

19 DCC has research positions in databases, digital curation, XML, web technology, fundamentals. Top-rated department. World-class database group. Good connections with logical foundations, scientific DBs, distributed computation (Grid) Edinburgh is a great place!! Contact Peter Buneman opb@inf.ed.ac.uk

Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation.

Similar presentations

Presentation on theme: "Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation.

Similar presentations

Presentation on theme: "Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: The Research Agenda Digital Curation."— Presentation transcript:

Similar presentations

About project

Feedback