Download presentation
Presentation is loading. Please wait.
Published byAllyson Ramsey Modified over 8 years ago
1
Data Mining at Scale(s): Collaborating to Build Sharable Skill Sets and Data Sets SCOTT WARREN, DOUG DUHAIME, PATRICK WILLIAMS
2
Who we are » Doug Duhaime Proquest »Text and Data Mining Product Manager » Scott Warren Syracuse University Libraries »Associate Dean for Research and Scholarship » Patrick Williams Syracuse University Libraries »Librarian for Literature, Communication & Rhetorical Studies, Composition & Cultural Rhetoric, English/Textual Studies 2
3
Roles » Patrick Work with researcher, identify resources » Scott Negotiate, review licenses, allocate funds, process » Doug Put together data set, data mine 3
4
Why seek data? 4
5
Knowledge can be distributed in a collection 5
6
Seeing the big picture – developing new insights – new questions 6 Sebastian Opitz
7
Researcher questions » What don’t I know? » Is this something possibly answered by (enough) data? » What data? » Where does it come from? » What or who is the source of the data? » Do I have access to this data in a form I can use? 7
8
Process questions – library and vendor » Who owns that data? Who provides that data? » Is their a way to (meaningfully) define that data? » Once defined, is there a way to extract that data? » What sort of costs are attached to that process? » Are legal issues attached to that process? 8
9
We have the data, what next?!? » Who does the actual mining? Library or researcher? » Is statistical, analytical, computing, or visualizing help needed? Who provides that? » Who preserves the data? Library, researcher, or vendor? No one? Re-use would be nice 9
10
Broken Pencil » Earlier request by same faculty member » PQ does not own perpetual rights to this current title Aggregator title that is licensed, rather than sold Titles is from 1990s to present » Ultimately unable to deliver data set Publisher sent all print copies to researcher Old School TDM!! 10
11
Bookman » The Bookman was a monthly magazine published in London from 1891 until 1934 by Hodder & Stoughton.LondonHodder & Stoughton » Part of PQ’s British Periodicals Collection 1 PQ has permanent ownership to this – unlike Broken Pencil Which is why they can pass it on to us » “It was a catalogue of the current publications that also contained reviews, advertising and illustrations.” -Wikipedia 11
12
Why Bookman? » Need for a very specific type of information From a specialized historical periodical Not so much just a giant dataset. » A periodical to which we already have “access” But the question is not served by our purchased access. Access distributed article by article Good for reading, not analyzing at scale 12
13
Service » TDM is a new type of service provision » Is it sustainable? For libraries? For vendors/publishers? » The 80/20 rule Built in applications on platforms for ‘basic’ mining? Or customized data dumps that researcher can explore? 13
14
Cost recovery issues for vendors » How many TDM requests going on at same time? » Impact on other projects, routine maintenance, development? » Does the data sit in one central database? Or is it distributed? Normalized or not? » What third party storage or delivery costs are involved in the process? » Will the size impact delivery mode? » Is the timeframe reasonable – or realistic? 14
15
Cost recovery issues for libraries » Price based on??? » Number of users likely to be 1, or perhaps a couple » Libraries generally do not license or purchase for 1-off use At least not at scale and not at 4+ figures per transaction » Content already licensed so fee is seen as process fee Rather than a content fee 15
16
What did we learn? The SU Libraries… » How things work on the vendor side. Way harder than thought. » How PQ is anticipating needs we’re seeing locally » We need to be thinking about Preservation & reuse Policies on maintaining “medium data” collections like this Beyond Patrick having it on a thumbdrive in his desk!! » So far nothing scales – this is all case by case 16
17
What did we learn? ProQuest… » Publishing rights are complex. Two fundamental licenses come into play when discussing data mining »The original publisher’s contract with ProQuest, »ProQuest’s contract with the university. Various national laws » Legacy software is expensive to maintain. Simple tasks - retrieving all files in a given newspaper collection –become incredibly difficult. » Data mining drives algorithmic analysis against the platform » Publisher worries - researchers sometimes post data EEBO placed on the Internet Archive » A solution that meets the needs of researchers, librarians, and publishers is an important task, but not easy 17
18
Preserving data - remember Patrick’s thumbdrive?!?
19
Centralize data 19
20
Liblicense model license » Section 3. Authorized Users and Uses » Clause J. Text and Data Mining. » Authorized Users may use the Licensed Materials to perform and engage in text and/or data mining activities for academic research, scholarship, and other educational purposes, utilize and share the results of text and/or data mining in their scholarly work, and make the results available for use by others, so long as the purpose is not to create a product for use by third parties that would substitute for the Licensed Materials. Licensor will cooperate with Licensee and Authorized Users as reasonably necessary in making the Licensed Materials available in a manner and form most useful to the Authorized User. If Licensee or Authorized Users request the Licensor to deliver or otherwise prepare copies of the Licensed Materials for text and data mining purposes, any fees charged by Licensor shall be solely for preparing and delivering such copies on a time and materials basis. 20
21
Some further thoughts » Association of Research Libraries Issue brief in June 2015 » Text and Data Mining and Fair use in the United States Krista L. Cox, Director of Public Policy Initiatives. » http://www.arl.org/storage/documents/TDM-5JUNE2015.pdf http://www.arl.org/storage/documents/TDM-5JUNE2015.pdf 21
22
22
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.