Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

Web Information Retrieval Projects Ida Mele

Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be published on my web site. Usually the project discussion is the same day of the written exam. Students who register for the first exam call can present the software project in the first or in the second exam call The project score is from 0 to 10. The professor decides the final mark The same project can be assigned to max 2 groups For any question/doubt/problem, send me an email Ida MeleProjects1

Project Request Students have to send me an email with object: WebIR - project request specifying: Name and last name of each student in the group Title of the project and dataset the students intend to use Short description of what the students intend to do (up to 250 words) Important: all the members of the group should be cc-ed in the email If everything is OK, you will receive a confirmation email There is no deadline for the request of the project Ida MeleProjects2

Project Delivery The presentation of the project takes 15 minutes The presentation should contain: the description of the problem and of the dataset the most important issues related to the implementation, and how they have been addressed the results achieved Students can use slides for their presentations and if they want they can realize a demo as well Deadline and more instructions about the project delivery will be published on my web site Ida MeleProjects3

List of Projects 1)Analyze the link structure of a large graph from the Web 2)Find circles in a social network through link analysis 3)Find communities in a network of users 4)Classification of online reviews 5)Topic classification of tweets 6)Personalized ranking of query results 7)Hadoop implementation of a link-based ranking algorithm 8)Hadoop implementation of an inverted index Ida MeleProjects4

1) Analyze the link structure of a large graph from the Web Create the web graph and analyze its link structure by computing degree, in-degree, out-degree, PageRank, TruncatedPageRank, edge reciprocity, graph assortativity, number of triangles, etc. Plot the distributions of the features List of datasets you can use: http://law.di.unimi.it/datasets.php  use one of the graphs available in Section Larger crawls http://law.di.unimi.it/datasets.php http://snap.stanford.edu/data/index.html  use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame) http://snap.stanford.edu/data/index.html http://webdatacommons.org/hyperlinkgraph/  use the graph representing subdomains http://webdatacommons.org/hyperlinkgraph/ Ida MeleProjects5

2) Find circles in a social network through link analysis Create the graph of the users of a popular social network (e.g., Twitter, Facebook, or Google+). Analyze the network and apply link-based features to identify circles. Check if the circles you get match the ones obtained from the analysis of common features List of datasets you can use: http://snap.stanford.edu/data/index.html  use one of the ego graphs available in Section Social networks: ego-Facebook, ego- Gplus, or ego-Twitter. Each dataset is made of the ego network, the set of circles for the ego node, and the connections among ego networks. You can use the file with the set of circles as a ground- truth http://snap.stanford.edu/data/index.html Ida MeleProjects6

3) Find communities in a network of users Create a graph where nodes are people and a link between two people represents the fact that they have something in common. For example, they are collaborators (DBLP co-authorship network) or they have bought the same product (Amazon product co- purchasing network), etc. Use this graph to find communities of people and check the results with the ground-truth provided in the dataset List of datasets you can use: http://snap.stanford.edu/data/index.html  use one of the graphs available in Section Networks with ground-truth communities (e.g., com-DBLP, com-Amazon, com-YouTube, com- Friendster) http://snap.stanford.edu/data/index.html Ida MeleProjects7

4) Classification of online reviews Given a set of user reviews about products (food, wine, etc.), analyze the text and other features for creating a classification of reviews. Some possible classifications are dividing reviews for kind/brand of product, for judgment (positive/neutral/negative), for helpfulness, etc. List of datasets you can use: http://snap.stanford.edu/data/index.html  use data available in in Section Online Reviews (e.g., CellarTracker, Amazon reviews, Fine Foods, Movies) http://snap.stanford.edu/data/index.html Ida MeleProjects8

5) Topic classification of tweets Given a set of english tweets, implement a topic- classification algorithm which divides tweets into categories. Possible categories are personal updates, news, politics, economics, sports, music, gossip, etc. You can also use ODP categories (http://www.dmoz.org/) for creating the list of possible topicshttp://www.dmoz.org/ List of datasets you can use: Send me an email, and I will give you the link to the dataset you can download Ida MeleProjects9

6) Personalized ranking of query results Create a system for query-result personalization. The users of the system can specify their interests by selecting them from a list of keywords (e.g., gossip, sport, politics, …). You can use a HTML form for the registration to the system. Crawl a portion of the web (e.g., news websites) and create the corresponding webgraph. Use a personalized ranking algorithm, for example, Topic-Specific PageRank, for ranking the pages according to user interests and compare the personalized ranking against the not- personalized one. Ida MeleProjects10

Projects 7) Hadoop implementation of a link-based ranking algorithm Given a web graph, where nodes represent web pages and the edge between two nodes u and v represents the link from the source page u to the target page v, implement in Hadoop a ranking algorithm (PageRank or HITS) to computes the scores of the nodes. Plot and analyze the distribution of the obtained scores List of datasets you can use: http://law.di.unimi.it/datasets.php  use one of the graphs available in Section Larger crawls http://law.di.unimi.it/datasets.php http://snap.stanford.edu/data/index.html  use graphs in Section Web graphs (e.g., web-Google, web-Stanford, web-NotreDame) http://snap.stanford.edu/data/index.html Ida MeleProjects11

Projects 8) Hadoop implementation of an inverted index Given a large collection of documents, create the inverted index, which is made of a dictionary and the posting lists. The dictionary contains indexed terms (remove stop-words and use stemming for preprocessing). For each term in the dictionary, the posting list contains information about documents where the term appears. Each posting has the ID of the document, the frequency of the term in the document, and the positions of the occurrences of the term in the document List of datasets you can use : Gutenberg project (http://www.gutenberg.org/) offers free ebooks that can be used for creating the document collectionhttp://www.gutenberg.org/ Ida MeleProjects12

Important Information Students can choose one of the projects in the list, or they can propose a different project There are no constraints on the datasets to use: The students can use the datasets suggested in the list of projects or different datasets available on the Web, or they can even create a new dataset for their project Links to other dataset sources: http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm http://www.trustlet.org/wiki/Repositories_of_datasets http://www-personal.umich.edu/~mejn/netdata/ Ida MeleProjects13

Important Information There are no constraints on programming languages, libraries, and tools to use Links to some tools/libraries for working with graphs: Graph visualization: Gephi (http://gephi.org/), Graphviz (http://www.graphviz.org/)http://gephi.org/http://www.graphviz.org/ Large-graph partitioning: METIS (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview)http://glaros.dtc.umn.edu/gkhome/metis/metis/overview Java Library: WebGraph (http://webgraph.di.unimi.it/), JUNG (http://jung.sourceforge.net/)http://webgraph.di.unimi.it/http://jung.sourceforge.net/ Python library: NetworkX (http://networkx.github.io/)http://networkx.github.io/ Ida MeleProjects14

Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

Similar presentations

Presentation on theme: "Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

Similar presentations

Presentation on theme: "Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be."— Presentation transcript:

Similar presentations

About project

Feedback