Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox Informatics, Computing and Physics Indiana University Bloomington
Goal of Day Come up with a few (3-5) projects that advance Social Sciences Cyberinfrastructure Choose so that together they cover spectrum of characteristics 2 Characteristics ABC….Z Project 1XXX Project 2XXX ….. Project NXX
Data Type What is large? #Collections v. Collection Size v. #Users “Big (Social) Science” v Long Tail # rows v # columns v time dependence Structured (defined) v unstructured (inferred/discovered) metadata granularity of metadata Data modality: Streaming, video, image, text, “binary” – vector space or not (genomics, network) distributed v centralized data (production/storage/processing) Complex objects v. tables Observed v. simulation or modeling 3
Data Nature (“ilities”) Open data Sharable Data Publication model / Data citation models? – DOI or Handler Reproducibility Sustainability Standards Management Integration Dramatic change in next 10 years Data availability as in Public Windy Grid 4
Mining/Analyzing data Access: role of Community comments, crowd sourcing, Processing: “Simple” statistics, Linkage software, data visualization, GIS, analytics (SVM, LDA, Clustering...); (new) management tools Data Mining (discovering the unexpected) v. Data Analysis (discovering with excellence the ~expected) Modeling for data components and regression More data v more/better algorithms (in simulation, algorithm advances ~ as important as machine advances) Programming model: Excel, SQL, R, SPSS, Other Scripting, MapReduce, "Fortran/C++/Java", Libraries, workflow, portal/gateway Open software & sustainability of it 5
Security & Privacy Support sharing The law Risk of identification, harm from disclosure Differential Privacy and nifty obfuscation ideas IRB Federated Identity Enclave 6
The Infrastructure Repository/Archive v. Active (compute + storage) data Bring Computing to data Commercial Clouds v. XSEDE v. University Local v. cloud v. department/university Distributed (Federated) clouds as collections distributed DropBox, Google docs, Skype etc. v customized Generality of DuraCloud, Dataverse DataUp etc. Tool repository/library Cloudbursting (public-private hybrid cloud) Connectivity to cloud (can be addressed by I2?) Backup v Main Home 7
Other Characteristics Satisfying NSF Data Management requirements Breadth of applicability of solutions # Organizations collaborating on project Interdisciplinary collaborations Data (science) Curricula Relation to issues in other fields Support and Governance Industry ahead of Academia 8