Defining Data-intensive computing

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
1 genSpace: Community- Driven Knowledge Sharing for Biological Scientists Gail Kaiser’s Programming Systems Lab Columbia University Computer Science.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
How Search Engines Work Source:
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Information Retrieval
Topics Problem Statement Define the problem Significance in context of the course Key Concepts Cloud Computing Spatial Cloud Computing Major Contributions.
Overview of Web Data Mining and Applications Part I
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Copyright © 2014 Pearson Education, Inc. 1 It's what you learn after you know it all that counts. John Wooden Key Terms and Review (Chapter 6) Enhancing.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Operational Data Tools Chapter Eight. Copyright © Houghton Mifflin Company. All rights reserved.8–28–2 Chapter Eight Learning Objectives To learn database.
Digital Learning Spaces Leanne Shultz March 2006.
Copyright © 2014 Pearson Education, Inc. 1 It's what you learn after you know it all that counts. John Wooden Key Terms and Review (Chapter 5)
Search Engines and Information Retrieval Chapter 1.
Build a Free Website1 Build A Website For Free 2 ND Edition By Mark Bell.
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.
Web 2.0: Concepts and Applications 6 Linking Data.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Using Hyperlink structure information for web search.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Aquenergy Portal Elisabetta Zuanelli, University of Rome “Tor Vergata”, Italy E-Age 2014 Muscat december.
It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): Millions of pages available, many of them not indexed in.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
PEERSPECTIVE.MPI-SWS.ORG ALAN MISLOVE KRISHNA P. GUMMADI PETER DRUSCHEL BY RAGHURAM KRISHNAMACHARI Exploiting Social Networks for Internet Search.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Internet Architecture and Governance
Understanding Search Engines. Basic Defintions: Search Engine Search engines are information retrieval (IR) systems designed to help find specific information.
OWL Representing Information Using the Web Ontology Language.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Google search in general  Google Search, commonly referred to as Google Web Search or just Google, is a web search engine owned by Google Inc. It is.
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Knowledge management By Dhanalakshmi. Contents  Knowledge & knowledge management  Knowledge creation process  Knowledge management system  Knowledge.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Definition, purposes/functions, elements of IR systems Lesson 1.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Glencoe Introduction to Multimedia Chapter 2 Multimedia Online 1 Internet A huge network that connects computers all over the world. Show Definition.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Web 2.0: Concepts and Applications 6 Linking Data.
Data mining in web applications
Information Retrieval in Practice
Objectives Overview Identify the four categories of application software Describe characteristics of a user interface Identify the key features of widely.
Marking the Most of the Web’s Resources
Search Engine Architecture
The Internet Industry Week Two.
Chapter Five Web Search Engines
ICT Communications Lesson 1: Using the Internet and the World Wide Web
Business in a Connected World
Data-Intensive Computing
Defining Data-intensive computing
MANAGING KNOWLEDGE FOR THE DIGITAL FIRM
Information Retrieval
Defining Data-intensive computing
Cloud Programming Models
Web Mining Department of Computer Science and Engg.
Large Scale Distributed Computing
Web Mining Research: A Survey
Information Retrieval and Web Design
Word Co-occurrence Chapter 3, Lin and Dryer.
Presentation transcript:

Defining Data-intensive computing 11/28/2018 Defining Data-intensive computing B. Ramamurthy Ch. 1, 2 of AIW

Topics for Discussion Intelligence and scale of data 11/28/2018 Intelligence and scale of data Examples of data-intensive applications Basic elements of data-intensive applications Designing and building data-intensive applications Example (Chapter 2): Searching with Lucene Pagerank algorithm Large-scale computing constraints and solutions Project 1: Searching with Lucene on Google App Engine

Intelligence and Scale of Data 11/28/2018 Intelligence is a set of discoveries made by federating/processing information collected from diverse sources. Information is a cleansed form of raw data. For statistically significant information we need reasonable amount of data. For gathering good intelligence we need large amount of information. Thus intelligence applications are invariably data- heavy, data-driven and data-intensive. Very often the data is gathered from the web (public or private, covert or overt). That is the reason for choosing this text: Algorithms of the intelligent web

Intelligence Search for Extra Terrestrial Intelligence (seti@home project) http://planetary.org/html/UPDATES/seti/index.html The Wow signal http://www.bigear.org/wow.htm 11/28/2018 Copyright 2010 B. Ramamurthy

Characteristics of intelligent applications 11/28/2018 Google search: How is different from regular search in existence before it? It took advantage of the fact the hyperlinks within web pages form an underlying structure that can be mined to determine the importance of various pages. Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to go to CityGrille”? Learning capacity from previous data of habits, profiles, and other information gathered over time. Collaborative and interconnected world inference capable: facebook friend suggestion Large scale data requiring indexing …

Examples of data-intensive applications 11/28/2018 Search engines Recommendation systems: CineMatch of Netflix Inc. movie recommendations Amazon.com: book/product recommendations Biological systems: high throughput sequences (HTS) Analysis: disease-gene match Query/search for gene sequences Space exploration Financial analysis

Data-intensive application characteristics 11/28/2018 Algorithms (thinking) Data structures (infrastructure) AggregatedContent (Raw data) Reference Structures (knowledge)

Basic Elements 11/28/2018 Aggregated content: large amount of data pertinent to the specific application; each piece of information is typically connected to many other pieces. Ex: Reference structures: Structures that provide one or more structural and semantic interpretations of the content. Reference structure about specific domain of knowledge come in three flavors: dictionaries, knowledge bases, and ontologies Algorithms: modules that allows the application to harness the information which is hidden in the data. Applied on aggregated content and some times require reference structure Ex: MapReduce Data Structures: newer data structures to leverage the scale and the WORM characteristics; ex: MS Azure, Apache Hadoop, Google BigTable

More intelligent data-intensive applications 11/28/2018 Social networking sites Mashups : applications that draw upon content retrieved from external sources to create entirely new innovative services. http://www.ibm.com/developerworks/spaces/mashups Portals Wikis: content aggregators; linked data; excellent data and fertile ground for applying concepts discussed in the text Media-sharing sites Online gaming Biological analysis Space exploration

Building your own application 11/28/2018 Determine your functionality: UML model use case diagram is a very nice tool to use at this stage (read the book) Determine the source of your internal and external data Examine the data and its utilization in the application Methods for enhancing the application Web data Crawling and screen scraping RSS feeds RESTful services Web services

Functionality 11/28/2018 Use case diagram is a good tool to discover/define the functionality of your applications Questions to ask: What are the main functions? What kind of data? Structured? Unstructured? Where will be stored? Will it be shared? What are the sources of data? Does it deal with geographic locations (maps)? Does it share content? Does it have search? Any automatic decisions to be made based on rules? What is the security model?

Acquiring the DATA 11/28/2018 Example: Get the houses available from Craigslist and post it on Google maps Enabling technologies for acquiring data: Crawler: spiders, start with a URL and visit the links in the URL collecting data, depth of crawling is parameter (chapter 2) Screen scrappers: extract information that is contained in html pages. Biological sciences: High throughput sequencers Web services: APIs that facilitate the communication between applications. Organizations make available the relevant information as services REST and SOAP are two underlying pipes for WS

Algorithms 11/28/2018 Machine learning is the capability of the software system to generalize based on past experience and the use of these generalization to provide answers to questions related old, new and future data. Data mining Soft computing We also need algorithms that are specially designed for the emerging storage models and data characteristics.

Chapter 2 Searching with Lucene 11/28/2018 Searching with Lucene This is an exercise searching large scale data Information retrieval: index and search Word count is one of the fundamental IR technique Add to this link analysis, user click analysis, natural language processing Termed “deep web” information Pagerank is most successful link analysis algorithm Probabilistic techniques (Aside: bean shell)