BIG DATA Challenges & Opportunities Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges & Opportunities Search Feeling Lucky Lei Chen
Outline BIG DATA Background Challenges Opportunities Internet Pictures Clips Maps News Shop Email more BIG DATA Outline Background Outline Background Challenges Opportunities “Big data” is term acknowledging the exponential growth, availability and use of … Challenges “Big data” proposes ground challenges on data capture, storage, analysis … Opportunities Many applications can be benefited from “Big data” …
Super exponential growth in data volume Internet Pictures Clips Maps News Shop Email more BIG DATA Background We are capturing more data Outline Background Challenges Opportunities Super exponential growth in data volume We are capturing more data than we can handle. High resolution images, emerging sensor readings, social activities on the Internet … Data are generated in size of Petabytes everyday. Satellite imagery, mobile station, distributed sensor networks, geographical plotting … Copyright belongs to “Data Analysis Challenges”, JSR-08-142, Dec
Intelligent transportation Internet Pictures Clips Maps News Shop Email more BIG DATA Background We are using more data Outline Background Challenges Opportunities With more and more data captured, we are able to do more things. For example, data are utilized to rebuild or simulate real world event. In medical care, we are able to remotely monitor and diagnose patients in real time with the help of sensors that can capture abundant human body information. Another example, considering vehicles equipped with some device that can capture not only the GIS information but also the near by vehicles’ moving trajectory, then these data can help vehicles to schedule the optimal path and avoid careless accidents. Intelligent transportation Digital health care
Background BIG DATA We need quick processing of the data Internet Pictures Clips Maps News Shop Email more BIG DATA Background We need quick processing of the data Outline Background Challenges Opportunities Volcano monitor In the big data era, we want the large volume of data be quickly processed. Sometimes the data can be incomplete, but they are still useful. And we need to make proper decisions based on the fast processing of large volumes of incomplete information. For example, when there is a hurricane, we need to make prediction of its moving path and estimate its impact to make a evacuation plan. Like the volcano monitoring example, since we can only observe some abnormal activities from the outside, therefore, all the sensor measurements are incomplete data to predict whether there some abnormal activities of the volcano result in a volcanic eruption. Hurricane moving path predication
Internet Pictures Clips Maps News Shop Email more BIG DATA Background We are exploring the unknowns with different means of data measurements Outline Background Challenges Opportunities In addition to the exponential growth of data volume in particular filed, we are using the cross disciplinary data together to explore the unknowns. For example, we study the ocean not only with satellites images and sensor monitoring, but also biology studies and chemistry analysis. Until now, we can explore the universe not only based on telescopes images and captured cosmic microwave background radiation but also the chemistry and physical analysis on the spectrum. To summarize, we have so many data that describe different aspect of this world, we need to have them combined or integrated to describe this world better. Exploring the universe Ocean science
Background BIG DATA We are discovering new rules from data Internet Pictures Clips Maps News Shop Email more BIG DATA Background We are discovering new rules from data Outline Background Challenges Opportunities The well-formed. eigenfactor project visualizes information flow in science. This diagram shows the citation links of the journal Nature. Data mining techniques are adapted to discover new rules from the data of incredible size, as well as high diversity in source, representation and quality. For example, this picture shows the citation links of the journal Nature. It actually represents how different science research fields are affected by this Journal, which somehow implies the structure and growth of the academic research world. Copyright belongs to http://well-formed.eigenfactor.org
Background BIG DATA Defining Big Data Internet Pictures Clips Maps News Shop Email more BIG DATA Background Defining Big Data Outline Background Challenges Opportunities Wiki: Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics and visualizing. Gartner(2011): Big data is a popular term used to acknowledge the exponential growth, availability and use of information in the data-rich landscape of tomorrow. Until now there is no agreement on the definition on big data. But we can capture its essential characteristics from some descriptions. For example, on wiki, it emphasizes on the volume and the awkward situation that none of existing database solution can handle the big data problem, which mainly can be categorized as capture, storage, search, sharing analytics and visualizing. Gartner some how emphasizes on the exponential growth of the data, and more importantly the utility of Big data.
3V: Variety, Velocity and Volume Internet Pictures Clips Maps News Shop Email more BIG DATA Background Features of Big Data Outline Background Challenges Opportunities Gartner thinks the essential features of Big data can be summarized in 3V Variety: ability to handle heterogeneous data source, representation and quality Velocity: the ability to capture and analyze data with performance guarantees Volume: the ability to scale out the storage as long as there is a data allocation require 3V: Variety, Velocity and Volume
Challenges BIG DATA Applications Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Applications Data Processing (Processing lang, optimization, Visualization) Outline Background Challenges Opportunities <key,vals> Object E-R Hierarchical Data Model (Interpretation, representation) Storage (Reliability, Scalability, Availability) Network Topology Current data service infrastructure: Data being stored on storages connected with certain network topology. Data are interpreted into different data models and accordingly, particular data processing tools or APIs are provided for applications on the top. However, for big data, current storage network is not ready for data in tremendous growth and updating pace. There are constraints on the network traffic volume, cost to add more storage node and handle failures. As data are coming from heterogeneous sources, single or simply hierarchical data interpretation does not work. And new tools are necessary to adapt the 3V features of big data Data Extraction (Acquisition, Integration, Representation )
Challenges BIG DATA Data model challenges Volume Velocity Variety Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Data model challenges Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities Volume Scale up, scale out, and scale in Velocity “Interactive” properties to facilitate processing Variety Simple but unified to adapt heterogeneity How 3V challenges the data model For volume: Scale up/down refers to vertical dimension, meaning increasing processing power for certain machines, or the workload of certain machines. while scale out/in refer to horizontal dimension, meaning adding machines to increase capability and limited the computation to maybe only a few nodes. For velocity: since the underlying data may from different data source or even different fields, the data model must be adaptive enough such that data from various sources can be effectively processed. It requires the data model to be flexible and interactive, i.e., allowing the upper level processing be easily defined and performed. For variety: since data could be from various sources, the challenge is how to make it simple but able to adapt all the heterogeneity. None of existing data models can directly applied. A new tradeoff between functionality and simplicity must be found Existing data models are not satisfactory Functionality vs. Simplicity <key,vals> Object E-R Hierarchical
Challenges BIG DATA Storage challenges Storage concerns: Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Storage challenges Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities Storage concerns: Reliability: data is safe and trustable Availability: data is accessible Scalability: data operation performance does not decay along with data size growth For storage, big data does not propose essential new challenges, because the same challenging problems have be identified when it come to large scale of distributed storage. For storage system, it concerns about reliability, availability and scalability. Reliability mainly contains two part: fault tolerance and consistency. However, there is a CAP theorm However, the CAP theorem is the bottleneck. No one-for-all solution exists
Challenges BIG DATA Storage challenges CAP Theorem Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Storage challenges Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities CAP Theorem Consistency Availability Partition tolerance C, A, P cannot be satisfied at the same time point. In the figure, after A updates v from v0 to v1, network partition happens, then if partition tolerance is satisfied, then B cannot access the value of v, availability is lost; if parition tolerance is not satisfied, B can only access an old value of v, consistency is lost.
Eventually consistent Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Storage challenges ACID vs. BASE Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities RDBMS Atomic Consistent Isolated Durable NoSQL Basically Available Soft-state Eventually consistent RDBMS BigTable HyperTable HBase MongoDB Redis Scalaris etc. C Thus there are two storage design methodologies, ACID and BASE. In the industry, people like to categorize them as RDBMS-like and NoSQL-like. The chart shows how the most prevailing storage systems make choices among C A P. Some of them, for example Bigtable, is defined to handle large volume of files of relative large size. However, these kind of system only have very limited data processing applications supported. For big data, there is may be no perfect solution either. We believe the NoSQL-like design fits big data better, as scalability and availability could be the primary concern. However, base still needs to be strengthened, as big data also has the emphasize on utility and throughput performance. A P Dynamo CouchDB Cassandra SimpleDB Tokyo Cabinet Riak Voldemot etc. 14
Adaption to new requirement and new component Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Management challenges Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities “Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data” Gartner(2011) Big data management Indexing & Partition Functionality Flexibility Adaption to new requirement and new component The challenges of big data does not only lie in storing the huge volume of data, we want effective management of them. In general, 3V features ask for the management provides both rich functionality and flexibility. With rich functionality, great Velocity(performance) can be achieved, and Variety feature cannot survive without adaptive updates of management system 15
Challenges BIG DATA Management challenges Volume Variety Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Management challenges Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities E.g., Indexing over big data Volume Large volume of data captured very time unit Distributed adaptive index Significant cost on meta data exchange Requires Leads to We shall take indexing big data as an example. Considering two features, volume and variety. The box on the left side is what we want, the box in the middle is the state of art implementation technique, and the box on the right most is the inevitable cost or undesired problems. For example, to index the large volume of data, we must use distributed adaptive index, however, there is significant cost to maintain the index considering the frequent updates and the update volume. Variety Data captured from different sources Distributed adaptive index Ambiguity on indexing the same object Requires Leads to 16
Challenges BIG DATA Challenges on processing Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Challenges on processing Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities New query language (algebra) Desired Sacrifices & Overhead Flexibility Complexity in data modeling “Relational” supporting Poor scalability “Uncertain” supporting Poor scalability and significant computing overhead Scalability Less functionality Efficiency & Effectiveness After all, well defined data processing operations are what we want. However, current processing tools (or languages) may not applicable. The above table summarizes the desired features of new query language for big data, however, there are always sacrifices and extra overhead. A optimal tradeoff must be found 17
Distributed Computing Paradigm Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Challenges on processing Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities New computing paradigm for processing Distributed Computing Paradigm Limitations Message Passing Poor scalability and fault tolerance Unified Access Invalidated efficiency over large computing nodes MapReduce Poor functionality Besides, for query language, current computing paradigms need to be re-explored too. The well acknowledged three distributed computing paradigms are message passing, unified access and mapreduce. Although the first have the advantages in handling complex computing process and computing efficiency, they are not easy to maintain and not fault tolerant. Mapreduce is popular for its great scalability and fault tolerant, but suffers from naïve parallelism framework that limits the functionalities. 18
Challenges BIG DATA Challenges on processing Internet Pictures Clips Maps News Shop Email more BIG DATA Challenges Challenges on processing Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities New optimization methodology Load Balance Data Locality High Parallelism Merging Cost These conflicting optimization methodologies exists for distributed computing for a long time. We don’t believe there is a one-for-all solution, therefore, the optimization methodology must be considered case by case, depending on applications. Less Network I/O Replicated Computing 19
Fundamental Scientific Research Internet Pictures Clips Maps News Shop Email more BIG DATA Opportunities We are empowered to learn knowledge and process information more accurately, effectively and efficiently. Why “Big Data”? Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities Natural Science Study Fundamental Scientific Research Big Data Big data is affecting every aspect of our world. The ways we think, communicate and live and work have been significantly changed. With more and more available data from different sources, we are able to understand the world better and making our lives more interactive and efficient. We will elaborate how big data helps us to face the nature, speed up our fundamental scientific research, accelerate the social civilization and improve our lives Social Civilization Daily Life 20
Population, transportation, urban design data Internet Pictures Clips Maps News Shop Email more BIG DATA Opportunities Big Data for natural science study Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities E.g., natural disaster forecasting and management Natural disasters still are threatening our lives. With more data available, not only we can effectively send out alert, but more importantly, we can deploy the disaster relieve more promptly and effectively. And the population, transportation etc. information can help us estimate the damage more accurately. Thus, we can make proper decisions in the disaster management Flood Earthquake Extreme Weather Meteorological data Geographic data Fore-casting Management Population, transportation, urban design data Economic data
Opportunities BIG DATA Big Data for fundamental scientific research Internet Pictures Clips Maps News Shop Email more BIG DATA Opportunities Big Data for fundamental scientific research Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities E.g., Bio informatics and medicine The medicine care industry has a significant growth over the decades. It benefits from the gene technology greatly, which conducts gene sequencing and functionality detection. Likewise, huge amount of clinic medicine data help the gene study to identify and understand how these mysterious codes work. The mutual promotion relation between the gene technology and the clinical medicine
Quick events detection Internet Pictures Clips Maps News Shop Email more BIG DATA Opportunities Light-speed information spreading & enormous knowledge Big Data for social civilization Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities Quick events detection The way people communicate has significantly changed. Instead of chatting style communication, the whole world is posting status and tweeting news. We learn something happen at the very first moment from web and social networks. Therefore, with all these different source of data, we can easily update ourselves to the things that we are interested in. Any new event, as long as people talk about it on the web, we can know. Moreover, this communication style has empowered everyone taking the advantage of social computing. Want to have some nice coffee and not sure where to get it? Instead of searching on google, you can just tweet your question on the web… Easy collaboration Wandering where to get a real good cup of coffee ? JUST tweet your question!!
Opportunities BIG DATA Big Data for daily life Internet Pictures Clips Maps News Shop Email more BIG DATA Opportunities Big Data for daily life Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities Our life can be much easier more data… E.g., trip planning Travel to Beijing::Request 3-day stay Budget< 1000$ Forbidden City 10am Meeting every day Predefine Adaptive agenda Big data can make our life much easier. With multiple sources, or even cross-disciplinary data, we can develop intelligent traffic system; we can have intelligent urban city, and we can have amazing deep-customized personal services. For example, trip planning. Before we travel, there are something are set as targets, e.g., a professor travels to Beijing for a 3 day conference, and he want to visit the forbidden city. Also there can be some budget limitation. However, there are things can easily go out of his control, like traffic jam, luggage delay, terrible weather. Then, a trip planning service shall be able to update his schedule promptly with the combination and analysis of data from different sources. Real world incidents Traffic jam Luggage delay Bad weather Updating 24
Opportunities BIG DATA Opportunity highlights Internet Pictures Clips Maps News Shop Email more BIG DATA Opportunities Opportunity highlights Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities Volume Capture, store and analyze data help us better understand the world Velocity Guaranteed effective & efficient data processing Variety Handling heterogeneous sources of data Assume all 3V features are satisfied, big data has the potential to significantly cut operating costs across all sectors of manufacturing. The reason are the follows: cross-disciplinary data are shared for easy analysis; data of different qualities are integrated to provide better service; the flexibility property of big data management guarantees one-for-all investment for long time business growth; moreover, pay-as-you-go model is applicable. Considering all the challenges and constraints, perhaps there is no one-for-all solution However, application dependent “Big Data” solutions are promising 25
~30,000 hospitals from 50+GB source Internet Pictures Clips Maps News Shop Email more BIG DATA Opportunities Applications Outline Background Challenges . Data Model . Storage . Management . Processing Opportunities . Applications Heterogeneous data management Search doctors Search universities (undergoing) Search Doctors Data Integration Web pages on the Internet Hospital databases Search results from general- purpose search engines News / rumors Challenges: 1) Data are collected from a number of different resources, they are usually unknown and unlimited, and in many varying formats. 2) Challenging to uniquely identify certain objects For instance, in the projects, different two persons may have identical names and locations, so it is difficult to integrate one’s information to existing data only with his/her name. How we address the challenge we assume the data follows certain sematic patterns. This assumption turns out to be true according to our experiments. We apply CRF (conditional random field) to extract expected information from heterogeneous web data and restore them in a probabilistic database. How we do query We construct a probabilistic database to restore the data retrieved from heterogeneous resources. Therefore, the queries can be optimized by existing probabilistic query processing techniques. Integrated Database Data Extraction ~500,000 doctors & ~30,000 hospitals from 50+GB source … OLAP Query Processing 26