Big Data Science Workshop 12 January 2017, Virginia Tech Digital Libraries and Big Data Edward A. Fox (fox@vt.edu), Prashant Chandrasekar, Islam Harb,

Slides:



Advertisements
Similar presentations
Building an Ontology for Crisis, Tragedy, and Recovery Oct. 1, 2009 NKOS Workshop, ECDL 2009 Corfu, Greece Uma Murthy, Edward Fox, Naren Ramakrishnan,
Advertisements

C3.ca in Atlantic Canada Virendra Bhavsar Director, Advanced Computational Research Laboratory (ACRL) Faculty of Computer Science University of New Brunswick.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
1 CHCI Visit by Dean Benson, Associate Dean Lesko KW II Rm – 10/10/2011 Digital Library Research Laboratory Torgersen Hall Rm 2030 –
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Digital Library Research Laboratory Torgersen Hall 2030 – (part of IT at VT) and Department of Computer Science CS4624: Multimedia, Hypertext,
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
STIM Sloan-Stanford Network for the History of Technology.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 May 6, 2014 Client Tarek Kanan 1.
Collaborative Research: Curriculum Development for Digital Library Education Presentation in May 1,2006
CTRnet: A Crisis, Tragedy, & Recovery Network ( Oct.16, 2009 VCOM Research Day Blacksburg, VA USA Edward Fox Bidisha.
CITIDEL: Computing & Information Technology Interactive Digital Educational Library Web Page: Contacts: Future.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
XXDL and CSTC and Virginia Tech NSDL Fall 2000 PI Meeting September 22-24, 2000 NSF, Arlington, VA Edward A. Fox CS DLRL.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
August 3, March, The AC3 GRID An investment in the future of Atlantic Canadian R&D Infrastructure Dr. Virendra C. Bhavsar UNB, Fredericton.
Digital Libraries Lillian N. Cassel Spring A digital library An informal definition of a digital library is a managed collection of information,
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
OOI-CYBERINFRASTRUCTURE OOI Cyberinfrastructure Education and Public Awareness Plan Cyberinfrastructure Design Workshop October 17-19, 2007 University.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
ELISQ Seminar Qatar National Library 20 May 2015 Introduction by Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Big Data Processing of School Shooting Archives
Data mining in web applications
The Web Information Technology Department
CS6604 Digital Libraries Global Events Team Final Presentation
ENGR 1014: Engineering Research Seminar 2 September 2016, Virginia Tech “Information Research” by Edward A. Fox
NSDL: A New Tool for Teaching and Learning.
Collection Management
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Launch, Persevere, and Collaborate
Xiaogang Ma, John Erickson, Patrick West, Stephan Zednik, Peter Fox,
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
For Librarians Dr. Lesley Farmer California State University Long Beach With contributions by Glen Warren (McPherson MS)
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Visualizations of School Shootings
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
Overview & Applications Welcome!
CS 5604 Information Storage and Retrieval
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
Graph Query Portal Amit Dayal David Brock
Social Interactome Recommender Team Final Presentation
Event Focused URL Extraction from Tweets
Team FE Final Presentation
ETDs for Life Panel ETD 2014: 17th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD,
Collection Management Webpages Final Presentation
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
NSF: Interested in education History: DLs dev for UG ed
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Social Interactome Recommender Team
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
SMETE Information Portal A Digital Library for Science, Mathematics, Engineering and Technology Education Alice M. Agogino, Principal Investigator Flora.
I-ASIST Meeting April 11, 2006 Stacy Kowalczyk
Web archives as a research subject
Presentation transcript:

Big Data Science Workshop 12 January 2017, Virginia Tech Digital Libraries and Big Data Edward A. Fox (fox@vt.edu), Prashant Chandrasekar, Islam Harb, Liuqing Li, Sunshin Lee Dept. of Computer Science, www.cs.vt.edu http://fox.cs.vt.edu/talks/2017/20170112BigDataWkshpFoxEtAl.pptx

Acknowledgments: Grants NSF CMMI-1638207 CRISP: Coordinated, Behaviorally-Aware Recovery for Transportation and Power Disruptions (CBAR-tpd), PI Pamela Murray-Tuite, Co-PIs Edward Fox, Kris Wernstedt; U. Mich. Ann Arbor, PI Seth Guikema NSF IIS-1619028: Global Event and Trend Archive Research (GETAR), PI Fox, Co-PIs Andrea L. Kavanaugh, Chandan Reddy, Donald J. Shoemaker; and Internet Archive, PI Jefferson Bailey. IMLS LG-71-16-0037-16: Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse; Zhiwu Xie (PI), Tyler Walters, Edward Fox (20%), Pablo Tarazaga; with eval. from University of North Texas University of North Texas (NSF flow through): CREST Partnership Supplement: Building Capacity in Information Management through a Partnership with Virginia Tech …, PI Fox VT ARC. VT-Rnet: A 10-Gbps Research Network for Virginia Tech. In-kind support to connect the Digital Library Research Laboratory Hadoop Cluster to VT's 10 gbps network NIH Grant 1R01DA039456-01: The Social Interactome of Recovery: Social Media as Therapy Development; PI Warren K. Bickel (VTCRI), Fox as co-PI NSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL); PI Fox, with co-PIs Donald Shoemaker, Andrea Kavanaugh, Steven Sheetz, and Kristine Hanna (Internet Archive) Special thanks for: XCaliber Award 2016 "for extraordinary contributions to technology-enriched learning activities" for project "Enhanced problem-based learning connecting big data research with classes", with students: Mohamed Farag, Richard Gruss, Tarek Kanan, Sunshin Lee, Xuan Zhang

Locating Digital Libraries in Computing and Communications Technology Space Digital Libraries technology trajectory: intellectual access to globally distributed information (bandwidth, connectivity) Communications Computing (flops) Digital content Note: we should consider 4 dimensions: computing, communications, content, and community (people). From S. Griffin less more

DLRL Hadoop Cluster (Cloudera) RAM: 2 of 128G 2 of 64G 20 of 32G Cores: 108 TB: 160 + backup

5S-based DL Services Taxonomy

IDEAL (Integrated Digital Event Archiving and Library) Motivation Problem definition

IDEAL Data Architecture Highlighted (as grey) are related to Sunshin Lee’s research on tweet geo-coding.

Sunshin Lee’s Data Flow

GETAR Architecture - 1

GETAR Architecture - 2

GETAR: Areas, Investigators, Courses

CRISP, CREST CRISP Multi-Agent Modeling Social Media and WWW: Information Extraction and Data Mining Coordinated, Behaviorally-Aware Recovery Transportation and/or Power Disruptions Simulation and Prediction CREST Semantic Web (Ontology + RDF + SPARQL) Data Integration (Autonomous Distributed Heterogeneous Data Sources) Education Domain (Student Success in Academia wrt Institutional Programs/Activities/Initiatives)

CREST

Social Interactome Experiment 2 networks: small-world, lattice 128 per network, 6 buddies/person 16 (12 constrained + 4 open) weeks Educational (TES, stories) resources, Video meetings, Assessments Moderator to stimulate engagement, deal with problems

S.I. Assessment Data Collected Big Five Inventory 44 items => 5 dimensions of personality Personality facets within the 5 dimensions Social Connectedness Scale (w. buddies) Addiction Severity Index Assess stability based on drug/alcohol use, status (family/social, psychiatric, medical legal), … Recovery Capital Scale Internal/external assets to initiate and sustain recovery Relapse Data

S.I. Analysis: Data Flow

Communication Analysis in the Social Interactome Abigail Bartolome, Advised by Dr. Edward A Fox NIH Grant: 1R01DA039456-01 The Social Interactome of Recovery: Social Media as Therapy Development Acknowledgements to Dr. Chris Franck, Prashant Chandrasekar, Lexie Mellis Virginia Tech CS 4994, April 2016 Text Classification Multinomial, naïve-Bayes classification considers the count for each feature name in making classifications Training the classifier: built a corpus of 150 documents– 75 of which were sentences that were clearly indicative of belonging to a success story and 75 of which were sentences that were not indicative of a success story Acknowledgements to Victoria Worrall for her efforts on this classifier last semester Network Structures Lattice Network Small-world Network 128 participants 22 users in the most connected component 4 users in the most connected component Queried the Friendica database to see who the participants wrote text to and who the participants received text from Generated graph of the private messaging communication in the lattice social network Lattice Network with Administrator Removed Small-Network with Administrator Removed Samples of Story Classification "Since being in recovery I have not been around any drugs or alcohol but if I had to, such as a wedding or something I wouldn't have a problem saying that I don't drink or I'm in recovery." => success 'Drove very drunk.' => not_success

Summary & Conclusions Big data is a characteristic of many digital library projects: Diverse range of types and instances of services Support for tailored needs of diverse user communities Connection with linked open data, Semantic Web Ex.: Tweet & webpage collection, analysis, value-add, search, visualization University education data integration and assessment Disaster management with events => modeling, simulation, prediction Clinical trials data collection, analysis – for social / behavioral sciences This can be nicely integrated with Courses Team/student projects and research Invitation: Contact fox@vt.edu for team project this semester (CS4624: Multimedia, Hypertexts, and Information Access)