Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

Information Theory EE322 Al-Sanie.
Science & Technology Centers Program National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC.
Science & Technology Centers Program National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC.
Database Theory: Back to the Future Victor Vianu UC San Diego / INRIA.
Huge Raw Data Cleaning Data Condensation Dimensionality Reduction Data Wrapping/ Description Machine Learning Classification Clustering Rule Generation.
FIN 685: Risk Management Topic 5: Simulation Larry Schrenk, Instructor.
What do Computer Scientists and Engineers do? CS101 Regular Lecture, Week 10.
SWE 423: Multimedia Systems
UCB Claude Shannon – In Memoriam Jean Walrand U.C. Berkeley
ToNC workshop Next generation architecture H. Balakrishnan, A. Goel, D. Johnson, S. Muthukrishnan, S.Tekinay, T. Wolf DAY 2, Feb
CPSC 695 Future of GIS Marina L. Gavrilova. The future of GIS.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
SING* and ToNC * Scientific Foundations for Internet’s Next Generation Sirin Tekinay Program Director Theoretical Foundations Communication Research National.
Building Knowledge-Driven DSS and Mining Data
1 An Information Theoretic Approach to Network Trace Compression Y. Liu, D. Towsley, J. Weng and D. Goeckel.
Noise, Information Theory, and Entropy
1 Trends in Mathematics: How could they Change Education? László Lovász Eötvös Loránd University Budapest.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Noise, Information Theory, and Entropy
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Fundamentals Rawesak Tanawongsuwan
The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,
National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC Berkeley UC San Diego UIUC Biology Thrust.
Get More Value from Your Reference Data—Make it Meaningful with TopBraid RDM Bob DuCharme Data Governance and Information Quality Conference June 9.
Bioinformatics Sean Langford, Larry Hale. What is it?  Bioinformatics is a scientific field involving many disciplines that focuses on the development.
Search Engines and Information Retrieval Chapter 1.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Science & Technology Centers Program National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Software Engineering ‘The establishment and use of sound engineering principles (methods) in order to obtain economically software that is reliable and.
Knowledge representation
Science & Technology Centers Program Center for Science of Information National Science Foundation Science & Technology Centers Program Bryn Mawr Howard.
Master Thesis Defense Jan Fiedler 04/17/98
National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC Berkeley UC San Diego UIUC Biology Thrust.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Science & Technology Centers Program National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
The roots of innovation Future and Emerging Technologies (FET) Future and Emerging Technologies (FET) The roots of innovation Proactive initiative on:
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Pascucci-1 Valerio Pascucci Director, CEDMAV Professor, SCI Institute & School of Computing Laboratory Fellow, PNNL Massive Data Management, Analysis,
Major Disciplines in Computer Science Ken Nguyen Department of Information Technology Clayton State University.
Science & Technology Centers Program National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC.
Coding Theory Efficient and Reliable Transfer of Information
Information Theory in an Industrial Research Lab Marcelo J. Weinberger Information Theory Research Group Hewlett-Packard Laboratories – Advanced Studies.
Chapter 4 Decision Support System & Artificial Intelligence.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
Theoretic Frameworks for Data Mining Reporter: Qi Liu.
Marwan Al-Namari 1 Digital Representations. Bits and Bytes Devices can only be in one of two states 0 or 1, yes or no, on or off, … Bit: a unit of data.
Analyzing wireless sensor network data under suppression and failure in transmission Alan E. Gelfand Institute of Statistics and Decision Sciences Duke.
National Research Council Of the National Academies
1 Multiagent Teamwork: Analyzing the Optimality and Complexity of Key Theories and Models David V. Pynadath and Milind Tambe Information Sciences Institute.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
COMP135/COMP535 Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 2 Lecture 2 – Digital Representations.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
Modelling & Simulation of Semiconductor Devices Lecture 1 & 2 Introduction to Modelling & Simulation.
Science & Technology Centers Program National Science Foundation Science & Technology Centers Program Bryn Mawr Howard MIT Princeton Purdue Stanford UC.
Information Theory of High-throughput Shotgun Sequencing David Tse Dept. of EECS U.C. Berkeley Tel Aviv University June 4, 2012 Research supported by NSF.
The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley UBC September 14, 2012 Research supported by NSF.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Sub-fields of computer science. Sub-fields of computer science.
Center for Science of Information
Manuel Gomez Rodriguez
What contribution can automated reasoning make to e-Science?
Big Ideas in Computer Science
Emerging Frontiers of Science of Information
Presentation transcript:

Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego UIUC Center for Science of Information STC Directors Meeting Denver, National Science Foundation/Science & Technology Centers Program

Science & Technology Centers Program 1.Science of Information 2.Center Mission & Center Team 3.Research –Structural Information –Learnable Information: Knowledge Extraction –Information Theoretic Approach to Life Sciences 2

Science & Technology Centers Program The Information Revolution started in 1948, with the publication of: A Mathematical Theory of Communication. The digital age began. Claude Shannon: Shannon information quantifies the extent to which a recipient of data can reduce its statistical uncertainty. “semantic aspects of communication are irrelevant...” Objective: Reproducing reliably data: Fundamental Limits for Storage and Communication. Applications Enabler/Driver: CD, iPod, DVD, video games, Internet, Facebook, WiFi, mobile, Google,.. Design Driver: universal data compression, voiceband modems, CDMA, multiantenna, discrete denoising, space-time codes, cryptography, distortion theory approach to Big Data. 3

Science & Technology Centers Program 4 Claude Shannon laid the foundation of information theory, demonstrating that problems of data transmission and compression can be precisely modeled formulated, and analyzed. Science of information builds on Shannon’s principles to address key challenges in understanding information that nowadays is not only communicated but also acquired, curated, organized, aggregated, managed, processed, suitably abstracted and represented, analyzed, inferred, valued, secured, and used in various scientific, engineering, and socio-economic processes..

Science & Technology Centers Program 5 Extend Information Theory to meet new challenges in biology, economics, data sciences, knowledge extraction, … Understand new aspects of information (embedded) in structure, time, space, and semantics, and dynamic information, limited resources, complexity, representation-invariant information, and cooperation & dependency. Center’s Theme for the Next Five Years: DataInformation Knowledge Framing the Foundation

Science & Technology Centers Program 1.Science of Information 2.Center Mission & Center Team 3.Research –Structural Information –Learnable Information: Knowledge Extraction –Information Theoretic Approach to Life Sciences 6

Science & Technology Centers Program 7 Research Thrusts: 1. Life Sciences 2. Communication 3. Knowledge Management (Data Analysis) S. Subramaniam A. Grama David Tse T. Weissman S. Kulkarni J. Neville CSoI MISSION :Advance science and technology through a new quantitative understanding of the representation, communication and processing of information in biological, physical, social and engineering systems. RESEARCH MISSION : Create a shared intellectual space, integral to the Center’s activities, providing a collaborative research environment that crosses disciplinary and institutional boundaries.

Science & Technology Centers Program 8

1.Science of Information 2.Center Mission & Center Team 3.Research –Structural Information –Learnable Information: Knowledge Extraction –Information Theoretic Approach to Life Sciences 9

Science & Technology Centers Program Structure: Measures are needed for quantifying information embodied in structures (e.g., information in material structures, nanostructures, biomolecules, gene networks, protein networks, social networks, financial transactions). [F. Brooks, JACM, 2003] Szpankowski, Choi, Manger : Information contained in unlabeled graphs & universal graphical compression. Grama & Subramaniam : quantifying role of noise and incomplete data, identifying conserved structures, finding orthologies in biological network reconstruction. Neville : Outlining characteristics (e.g., weak dependence) sufficient for network models to be well-defined in the limit. Yu & Qi: Finding distributions of latent structures in social networks.

Science & Technology Centers Program 11 Information Content of Unlabeled Graphs: A structure model S of a graph G is defined for an unlabeled version. Some labeled graphs have the same structure. Graph Entropy vs Structural Entropy: The probability of a structure S is: P(S) = N(S) · P(G) where N(S) is the number of different labeled graphs having the same structure. Y. Choi and W.S., IEEE Trans. Information Theory, 2012.

Science & Technology Centers Program 12

Science & Technology Centers Program 13 Theorem. [Choi, WS, 2012] Let be the code length. Then for Erd ö s-R é nyi graphs: (i)Lower Bound: (I)Upper Bound: where c is an explicitly computable constant, and is a fluctuating function with a small amplitude or zero.

Science & Technology Centers Program 1.Science of Information 2.Center Mission 3.Integrated Research –Structural Information –Learnable Information: Knowledge Extraction Information Theoretic Models for Querying Systems –Information Theoretic Approach to Life Sciences 14

Science & Technology Centers Program 15 Learnable Information (BigData): Data driven science focuses on extracting information from data. How much information can actually be extracted from a given data repository? Information Theory of Big Data? Big data domain exhibits certain properties: Large (peta and exa scale) Noisy (high rate of false positives and negatives) Multiscale (interaction at different levels of abstractions) Dynamic (temporal and spatial changes) Heterogeneous (high variability over space and time Distributed (collected and stored at distributed locations) Elastic (flexibility to data model and clustering capabilities) Complex dependencies (structural, long term) High dimensional Ad-hoc solutions do not work at scale! ``Big data has arrived but big insights have not..’’ Financial Times, J. Hartford.

Science & Technology Centers Program 16 Recommendation systems make suggestions based on prior information: Recommendations are usually lists indexed by likelihood. on server Prior search history Processor Distributed Data Original (large) database is compressed Compressed version can be stored at several locations Queries about original database can be answered reliably from a compressed version of it. Databases may be compressed for the purpose of answering queries of the form: “Is there an entry similar to y in the database?”. Data is often processed for purposes other than reproduction of the original data: (new goal: reliably answer queries rather than reproduce data!) Courtade, Weissman, IEEE Trans. Information Theory, 2013.

Science & Technology Centers Program 17 Fundamental Tradeoff: what is the minimum description (compression) rate required to generate a quantifiably good set of beliefs and/or reliable answers. General Results : queries can be answered reliably if and only if the compression rate exceeds the identification rate. Quadratic similarity queries on compressed data Distributed (multiterminal) source coding under logarithmic loss

Science & Technology Centers Program 1.Science of Information 2.Center Mission 3.Integrated Research –Structural Information –Learnable Information: Knowledge Extraction –Information Theoretic Approach to Life Sciences DNA Assembly Fundamentals (Tse, Motahari, Bresler) 18

Science & Technology Centers Program Reads are assembled to reconstruct the original genome. State-of-the-art: many sequencing technologies and many assembly algorithms, ad-hoc design tailored to specific technologies. Our goal: a systematic unified design framework. Central question: Given statistics of the genome, for what read length L and # of reads N is reliable reconstruction possible? An optimal algorithm is one that achieves the fundamental limit.

Science & Technology Centers Program (Motahari, Bresler & Tse 12) read length L # of reads N/N cov no coverage of genome many repeats of length L (2 log G) /H renyi reconstructable by the greedy algorithm reconstructable by the greedy algorithm

Science & Technology Centers Program 21 Example: Stapholococcus Aureus length i.i.d. fit data upper and lower bounds based on repeat Statistics (99% reconstruction reliability) A data-driven approach to finding near-optimal assembly algorithms. greedy K-mer improved K-mer optimized de Brujin repeat lower bound N min /N cov read length L coverage lower bound (Bresler, Bresler & Tse 12)

Science & Technology Centers Program 1.Science of Information 2.Center Mission 3.Integrated Research –Structural Information –Learnable Information: Knowledge Extraction –Information Theoretic Approach to Life Sciences Protein Folding Model (Magner, Kihara, Szpankowski) 22

Science & Technology Centers Program 23 Probability of protein folds vs. sequence rank Protein Folds in Nature

Science & Technology Centers Program protein-fold channel s f HPHHPPHPHHPPHHPH Information-Theoretic Model Optimal input distribution from Blahut-Arimoto algorithm:

Science & Technology Centers Program 25