Download presentation
Presentation is loading. Please wait.
Published byRussell Jones Modified over 9 years ago
1
PageRank + Inverted Index
2
Un Motor de Búsqueda
3
“obama”
4
PageRank Model: Final Version The Web: a directed graph Vertices (pages) Edges (links) fa eb dc
5
Input Structure 31.5 million edges 960,109 nodes document-with-linkdocument-linked
6
Step 0. Start Downloading Datasets http://aidanhogan.com/teaching/cc5212- 1/mdp-lab9-data/ http://aidanhogan.com/teaching/cc5212- 1/mdp-lab9-data/ – page_links_es_f.tsv.gz – wiki_abstracts_es.tsv.gz – http://aidanhogan.com/teaching/cc5212-1/mdp- lab9.zip http://aidanhogan.com/teaching/cc5212-1/
7
Step 1. Dictionary Encode Links Strings difficult to fit in memory Encode strings as OIDs (object ids = integers) Input line: http://es.wikipedia.org/wiki/Ciencia_ficción http://es.wikipedia.org/wiki/Robot Output line: 1203952673 Dictionary: 12039http://es.wikipedia.org/wiki/Ciencia_ficción … 52673http://es.wikipedia.org/wiki/Robot … OIDCompress -i [folder]/page_links_es_f.tsv.gz -igz -o [folder]/page_links_es_f.oid.gz -ogz -d [folder]/page_links_es_f.dict.gz -dgz
8
Step 2. Copy PageRank Code Copy PageRankGraph.java from mdp-lab8 to mdp-lab9 (same package) – Use your code to be marked on it! – Marked from 20 for this lab If you weren’t here last week, copy PageRankGraph.java from http://aidanhogan.com/cc5212-1/mdp-lab9-data/ – Marked from 10 for this lab
9
Step 3. Rank and sort full data Run ranking ( PageRankGraph.java) – 50 iterations: ITERS = 50 -i [folder]/page_links_es_f.oid.gz -igz -o [folder]/page_ranks_es_f.oid.tsv.gz –ogz Sort ranks by rank score ( SortByRank.java ) -i [folder]/page_ranks_es_f.oid.tsv.gz -igz -o [folder]/page_ranks_es_f_s.oid.tsv.gz –ogz
10
Step 4. Make Predictions & Bets Which will be the highest ranked articles in Wikipedia according to PageRank?
11
Step 5. Decode the ranks Decode the file ( OIDDecompress.java ) -d [folder]/page_links_es_f.dict.gz -dgz -i [folder]/page_ranks_es_f_s.oid.tsv.gz -igz -n 0 - o [folder]/page_ranks_es_f_s.tsv Open the output in a text editor and have a look
12
Step 6. Copy Inverted Index Code Copy IndexTitleAndAbstract.java and SearchIndex.java from mdp-lab7 into mdp-lab9 (if you were here) Otherwise grab them from http://aidanhogan.com/cc5212-1/mdp-lab9- data/ http://aidanhogan.com/cc5212-1/mdp-lab9- data/
13
Step 7. Rebuild Inverted Index IndexTitleAndAbstract.java -i [folder]/wiki_abstracts_es.tsv.gz -igz -o [folder]/es_wiki_index/ Try searches using SearchIndex.java – Copy the top 10 results for 5 searches including ‘ obama ’ and ‘ universidad ’ into a text file somewhere
14
Step 8. Add in the boost values Open BoostRanks.java Follow the board to code Run: -o [folder]/es_wiki_index/ -i [folder]/page_ranks_es_f_s.tsv
15
Step 9. Profit Re-run the same five queries as before over the boosted index and see if the results improve http://www.lucenetutorial.com/lucene-query- syntax.html http://www.lucenetutorial.com/lucene-query- syntax.html
17
Course Marking 45% for Weekly Labs (~3% a lab!) 35% for Final Exam 20% for Small Class Project
18
Class Project Done in pairs (Except Alejandro/Mauricio :P) Goal: Use what you’ve learned to do something cool (basically) Expected difficulty: More than a lab’s worth – But from scratch / without my help! Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness – Ambition is appreciated, even if you don’t succeed: feel free to bite off more than you can chew! Process: – Pair up (default random) by Wednesday – Decide on a topic (by June 9 th ) or let me assign one – If you need data or get stuck, I will (try to) help out Deliverables: 10 minute presentation (June 23 rd ) & 4-page report – 2 weeks!
19
Groups Pairings: Catalina Espinoza y Felipe Quintanilla Eduardo Acha y Jaime Salas Francisca Concha y Nicolás Miranda Lone agents: Alejandro Infante Mauricio Quezada
20
Topics Let’s talk topics – Catalina Espinoza y Felipe Quintanilla – Eduardo Acha y Jaime Salas – Francisca Concha y Nicolás Miranda – Mauricio Quezada What’s the idea? What will be the result of your project? How much data will you process/where will you source it? Which techniques from the class will you use? How cool is it?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.