Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

Slides:

Advertisements

Similar presentations

MY NCBI (module 4.5). MODULE 4.5 PubMed/How to Use MY NCBI Instructions - This part of the: course is a PowerPoint demonstration intended to introduce.

Advertisements

Chapter 5: Introduction to Information Retrieval

Web indexing ICE0534 – Web-based Software Development July Seonah Lee.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

POST WEBSITE OVERVIEW Training Managers Workshop Computer Services Bureau 9/28/2014.

Information Retrieval in Practice

Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.

Category I: Tools for the teachers. Timing your students is amazing!

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Chapter 5 Searching for Truth: Locating Information on the WWW.

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

14.1 Vis_04 Data Visualization Lecture 14 Information Visualization : Part 2.

1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2006 Microsoft Corporation.

Overview of Search Engines

POST WEBSITE OVERVIEW Area 8 Training Managers Workshop Computer Services Bureau 4/5/2012.

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.

Projects for Online Advertising. 2 AD BEHAVIOR IN PANDORA PROJECT 1 Arindam Paul du

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 9.1 Chapter 9 : Social Networks What is a social.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Chapter 5 Searching for Truth: Locating Information on the WWW.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

WEB TERMINOLOGIES. Page or web page: a file that can be read over the world wide web Pages or web pages: the global collection of documents associated.

Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.

Advanced Software Engineering PROJECT. 1. MapReduce Join (2 students)  Focused on performance analysis on different implementation of join processors.

PubMed/How to Search, Display, Download & (module 4.1)

CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Social Networking Algorithms related sections to read in Networked Life: 2.1,

Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Unit 15 Webpage Creator. Outlines Introduction Starter Listening Language Work Work study Speaking Writing.

Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.

Limits From the initial (HINARI) PubMed page, we will click on the Limits search option. Note also the hyperlinks to Advanced search and Help options.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%

IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.

The new European Toolkit EC-CHM Miruna Bădescu EEA contractor: Eau de Web.

IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.

IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:

The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.

The School Portal and New and Improved IFAP Tools for Our Partners Today’s Focus: What is a Portal? (general definitions) What is the School Portal? How.

CREATE, IMPLEMENT AND ENJOY! Blogs,Wikis & RSS Readers.

1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

IENG 423 Design of Decision Support Systems Internet as a Decision Support Tool 6/8/ Internet as a Decision Support Tool.

[xxxx] SEO Online Marketing for Business Catalyst Websites

Career Spot Videos The Menu Bar Easily update your information through these quick links Click on the icons to join us on Facebook & Twitter and get immediate.

General Architecture of Retrieval Systems 1Adrienn Skrop.

CSE6339 DATA MANAGEMENT AND ANALYSIS FOR COMPUTATIONAL JOURNALISM CSE6339, Spring 2012 Department of Computer Science and Engineering, University of Texas.

Business Catalyst SEO Business Catalyst SEO Online Marketing for BC Sites.

How to use Library Kindle Books

How to use Library Kindle Books

INF 103 MART Successful Learning/inf103mart.com

Information Retrieval

CS & CS Capstone Project & Software Development Project

Searching for Truth: Locating Information on the WWW

Searching for Truth: Locating Information on the WWW

Searching for Truth: Locating Information on the WWW

Designing a Web Site.

Presentation transcript:

Web Information Retrieval Projects Ida Mele

Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be published on my web site. Usually the project discussion is the same day of the written exam. Students who register for the first exam call can present the software project in the first or in the second exam call The project score is from 0 to 10. The professor decides the final mark The same project can be assigned to max 2 groups For any question/doubt/problem, send me an Ida MeleProjects1

Project Request Students have to send me an with object: WebIR - project request specifying: Name and last name of each student in the group Title of the project and dataset the students intend to use Short description of what the students intend to do (up to 250 words) Important: all the members of the group should be cc-ed in the If everything is OK, you will receive a confirmation There is no deadline for the request of the project Ida MeleProjects2

Project Delivery The presentation of the project takes 15 minutes The presentation should contain: the description of the problem and of the dataset the most important issues related to the implementation, and how they have been addressed the results achieved Students can use slides for their presentations and if they want they can realize a demo as well Deadline and more instructions about the project delivery will be published on my web site Ida MeleProjects3

List of Projects 1)Analyze the link structure of a large graph from the Web 2)Find circles in a social network through link analysis 3)Find communities in a network of users 4)Classification of online reviews 5)Topic classification of tweets 6)Personalized ranking of query results 7)Hadoop implementation of a link-based ranking algorithm 8)Hadoop implementation of an inverted index Ida MeleProjects4

1) Analyze the link structure of a large graph from the Web Create the web graph and analyze its link structure by computing degree, in-degree, out-degree, PageRank, TruncatedPageRank, edge reciprocity, graph assortativity, number of triangles, etc. Plot the distributions of the features List of datasets you can use:  use one of the graphs available in Section Larger crawls  use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame)  use the graph representing subdomains Ida MeleProjects5

2) Find circles in a social network through link analysis Create the graph of the users of a popular social network (e.g., Twitter, Facebook, or Google+). Analyze the network and apply link-based features to identify circles. Check if the circles you get match the ones obtained from the analysis of common features List of datasets you can use:  use one of the ego graphs available in Section Social networks: ego-Facebook, ego- Gplus, or ego-Twitter. Each dataset is made of the ego network, the set of circles for the ego node, and the connections among ego networks. You can use the file with the set of circles as a ground- truth Ida MeleProjects6

3) Find communities in a network of users Create a graph where nodes are people and a link between two people represents the fact that they have something in common. For example, they are collaborators (DBLP co-authorship network) or they have bought the same product (Amazon product co- purchasing network), etc. Use this graph to find communities of people and check the results with the ground-truth provided in the dataset List of datasets you can use:  use one of the graphs available in Section Networks with ground-truth communities (e.g., com-DBLP, com-Amazon, com-YouTube, com- Friendster) Ida MeleProjects7

4) Classification of online reviews Given a set of user reviews about products (food, wine, etc.), analyze the text and other features for creating a classification of reviews. Some possible classifications are dividing reviews for kind/brand of product, for judgment (positive/neutral/negative), for helpfulness, etc. List of datasets you can use:  use data available in in Section Online Reviews (e.g., CellarTracker, Amazon reviews, Fine Foods, Movies) Ida MeleProjects8

5) Topic classification of tweets Given a set of english tweets, implement a topic- classification algorithm which divides tweets into categories. Possible categories are personal updates, news, politics, economics, sports, music, gossip, etc. You can also use ODP categories ( for creating the list of possible topicshttp:// List of datasets you can use: Send me an , and I will give you the link to the dataset you can download Ida MeleProjects9

6) Personalized ranking of query results Create a system for query-result personalization. The users of the system can specify their interests by selecting them from a list of keywords (e.g., gossip, sport, politics, …). You can use a HTML form for the registration to the system. Crawl a portion of the web (e.g., news websites) and create the corresponding webgraph. Use a personalized ranking algorithm, for example, Topic-Specific PageRank, for ranking the pages according to user interests and compare the personalized ranking against the not- personalized one. Ida MeleProjects10

Projects 7) Hadoop implementation of a link-based ranking algorithm Given a web graph, where nodes represent web pages and the edge between two nodes u and v represents the link from the source page u to the target page v, implement in Hadoop a ranking algorithm (PageRank or HITS) to computes the scores of the nodes. Plot and analyze the distribution of the obtained scores List of datasets you can use:  use one of the graphs available in Section Larger crawls  use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame) Ida MeleProjects11

Projects 8) Hadoop implementation of an inverted index Given a large collection of documents, create the inverted index, which is made of a dictionary and the posting lists. The dictionary contains indexed terms (remove stop-words and use stemming for preprocessing). For each term in the dictionary, the posting list contains information about documents where the term appears. Each posting has the ID of the document, the frequency of the term in the document, and the positions of the occurrences of the term in the document List of datasets you can use : Gutenberg project ( offers free ebooks that can be used for creating the document collectionhttp:// Ida MeleProjects12

Important Information Students can choose one of the projects in the list, or they can propose a different project There are no constraints on the datasets to use: The students can use the datasets suggested in the list of projects or different datasets available on the Web, or they can even create a new dataset for their project Links to other dataset sources: Ida MeleProjects13

Important Information There are no constraints on programming languages, libraries, and tools to use Links to some tools/libraries for working with graphs: Graph visualization: Gephi ( Graphviz ( Large-graph partitioning: METIS ( Java Library: WebGraph ( JUNG ( Python library: NetworkX ( Ida MeleProjects14