Web 2.0 Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Project Proposal.
6/2/ An Automatic Personalized Context- Aware Event Notification System for Mobile Users George Lee User Context-based Service Control Group Network.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Information Retrieval in Practice
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Overview of Search Engines
Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.
Chapter 1 Overview of Databases and Transaction Processing.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
CSC2012 Database Technology & CSC2513 Database Systems.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Why I LIKE the Facebook Database… Sharon Viente May 2010.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Data Structures & Algorithms and The Internet: A different way of thinking.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engine Architecture
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,
ITGS Databases.
CMPS 435 F08 These slides are designed to accompany Web Engineering: A Practitioner’s Approach (McGraw-Hill 2008) by Roger Pressman and David Lowe, copyright.
Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.
1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Using Bayesian Networks to Predict Plankton Production from Satellite Data By: Rob Curtis, Richard Fenn, Damon Oberholster Supervisors: Anet Potgieter,
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Chapter 13: Query Processing
General Architecture of Retrieval Systems 1Adrienn Skrop.
Traffic Source Tell a Friend Send SMS Social Network Group chat Banners Advertisement.
Getting Ready for the NOCTI test April 30, Study checklist #1 Analyze Programming Problems and Flowchart Solutions Study Checklist.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Search Engine Architecture
Text Based Information Retrieval
Search Engine Architecture
Architecture Components
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Web Mining Department of Computer Science and Engg.
Search Engine Architecture
Information Retrieval and Web Design
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Web 2.0 Data Analysis DANIEL DEUTCH

Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets” (DAMA Data Management Body of Knowledge) A major success: the relational model of databases

Relational Databases Developed by Codd (1970), who won the Turing award for the model Huge success and impact: ‒The vast majority of organizational data today is stored in relational databases ‒Implementations include MS SQL Server, MS excel, Oracle DB, mySQL,… ‒2 Turing award winners (Edgar F. Codd and Jim Gray) Basic idea: data is organized in tables (=relations) Relations can be derived from other relations using a set of operations called the relational algebra ‒On which SQL is largely based

Research in Data(base) Management : Relational Databases (tables). ‒Indexing, Tuning, Query Languages, Optimizations, Expressive Power,…. ~20 years ago: Emergence of the Web and research on Web data ‒XML, text database, web graph…. ‒Google is a product of this research (by Stanford’s PhD students Brin and Page) Recent years: hot topics include distributed databases, data privacy, data integration, social networks, web applications, crowdsourcing, trust,… ‒Foundations taken from “classical” database research Theoretical foundations with practical impact

Web 2.0 “Old” web (“Web 1.0”): static pages – News, encyclopedic knowledge... – No, or very little, interactive process between the web-page and the user. Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user. – “Network as platform" computing – The “participatory Web”

Web 2.0 “Old” web (“Web 1.0”): static pages – News, encyclopedic knowledge... – No, or very little, interactive process between the web-page and the user. Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user. – “Network as platform" computing – The “participatory Web”

Online shopping

Advertisements

Social Networks

Collaborative Applications

Data is all around Web graph “Social graph” Pictures, Videos, notifications, messages.. Data that the application processes Advertisments Even the application structure itself

(A small portion of) the web graph

News Feed

Need to Analyze Huge amount of data out there – Est billion web-pages and counting – Half a billion tweets per day and counting An average user “sees” about 600 tweets per day Most of it is irrelevant for you, some is incorrect

Maybe we shouldn’t

Haaretz, 7/3/13

Static vs. Dynamic data For “static” web data: problem addressed by search engines – Allow users to specify criteria for search (e.g. keywords), and rank the results. For “dynamic” web 2.0 data (tweets, notifications, web applications): – No satisfying solution yet. – Data volume is constantly increasing.

Filter, Rank, Explain Filter – Select the portion of data that is relevant – Group similar results Rank – Rank data by trustworthiness, relevance, recency... – Present highest-rank first Explain – An explanation of why is the data considered relevant/highly-ranked – An explanation of how has the data propagated “Why do I see this?”

Approach Leverage knowledge from “classic” database research Account for the new challenges Do so in a generic manner Leverage unique features such as collaborative contribution, distribution, etc.

21 Students Physical Storage Indexing Distribution... Data modelQuery language Select… From… Where… Students Takes sid=sid sname name=“Mary ” cid=cid Courses

Foundations Model Query Language Query evaluation algorithms Prototype implementation and optimizations Getting Data and Testing

Two concrete examples Analysis of Web Applications Analysis of Social Networks

Analysis of Web Applications

Analysis Questions For the application owner: – Can the user make a reservation without giving credit card details? – What is the probability that a user would choose a category for which there are no available products? For the application user: – How can I get the best deals (by navigating)? – How can I navigate in an optimal way (minimal number of clicks)

ShopIT-Shopping Assistant 26

ShopIT-Shopping Assistant 27

ShopIT-Shopping Assistant 28

Modeling Model should be declarative, intuitive Abstract representation Allows readability, portability, applicability, optimizations… Tradeoff between expressivity and complexity

Application Model

Effects guiding the execution Dynamic effects such as the choice of which link to follow – Probability/cost value can be associated with every such choice – Cumulative: if the same choice is made twice, double the cost Static effects such as available products in the database – Static in the sense that it is constant in a single execution – Thus non-cumulative

Filter, Rank, Explain Filter – Choose “interesting” navigation sequences Rank – Rank different sequences by length, purchase cost,… Explain – Why are these the best sequences? What do they entail in terms of purchase? How can they be refined?

Query Language Desired executions/navigation sequences are typically expressed in temporal logic Database analysis is typically done through a First- order-based query language (e.g. SQL) A query language for data-dependent application should combine both. While allowing for efficient evaluation. LTL-FO

Query Evaluation Problem: Given a process specification and a query, evaluate the query – To find all executions that satisfy the criteria – To find the probability that the criteria are satisfied – To find the minimal cost execution realizing the criteria – …

Interesting problems are hard (sometimes) The (corresponding boolean) problems are NP-hard ‒Very unlikely that they can be solved efficiently for worst case input This is often the case when modeling real-life problems To be considered, but not to be feared!

What do you do with a theoretically hard problem? Find restricted, yet realistic, cases that are “easy”, i.e. admit efficient worst case algorithms ‒This also allows to have a better understanding of “where is the hardness” ‒Example: If there are only “few” static choices, the problem becomes easy Design optimization techniques for practically efficient implementations

Optimization One idea is based on the notion of data provenance – Originally for “classic” database queries, provenance is a “concise” representation of the query computation Here we generalize the approach for the temporal queries – Allows for a generic approach and many optimizations A system prototype for provenance-aware analysis of processes is under development

How do we get data? Some applications (e.g. Yahoo! Shopping) allows a (limited) interface to their database. Application structure can be obtained by crawling. Allows to prove that the model is realistic, and to test feasibility of algorithms on real-life data scale

Open problems Uncertainty and partial knowledge Data updates by the application Incremental maintenance ‒How do modifications to the database can be accounted for without restarting the analysis?

Analysis of Social Networks Social network common model of disseminating data is in a “push” mode Facebook’s news feed, Twitter’s tweets… Huge load of information, and it continues to grow No standard way of searching and ranking such posts

A Machine Learning Approach “Learn” users preferences – Ask them whether they like a particular post/tweet, or use existing mechanisms Try to generalize by extracting characteristic features (e.g. topics) Use the generalized function to predict future preferences

A data management approach Allow users to tell the system their preferences, through a “query” language. One family of such languages are rule-based languages – Prolog, Datalog,... Simpler than standard programming languages Allows for easier analysis

Rule based language (Logic Programming) descendent(X,Y):- parent(X,Y) descendent(X,Y):- parent(X,Z),descendent(Z,Y)

For tweeter See(tweet):-Tweets(Tova,tweet) See(tweet):-Tweets(x,tweet), related(x,"basketball") Attending(party):- count[friend(x),like(x,party)]>10

Learning Unlike standard database querying, some of the predicates may be difficult to evaluate How do we know that a post is related to basketball? What can we do?

What can we do Text analysis – Problem: An ”I love basketball” tweet is not really a good match Use tagging if exists Ask other peers in the network! – Can also be used to propose predicates to users

Influence Predict effects and allow users to influence friends "How do I get at least 5 of my friends to come with the party with me?" Need them to (1) see the party, (2) click "attending" "How do I get at least 1000 people to go to a political rally?" Can be answered by an analysis of the rules!

Interface for Specifying Rules Cannot expect Facebook/Twitter users to be able to program Need to design an interface The interface components can be dynamically designed according to answers by other peers An ongoing research

Conclusions Web 2.0 data poses new challenges in research Data is dynamic, access mode is unique, analysis goals different than in classic databases Lots of research on theoretical questions with strong practical applications

Questions?