Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Slides:



Advertisements
Similar presentations
Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
Advertisements

Almaden Research Center © 2006 IBM Corporation IOP 06 Open Source Intelligence Lesson Learned.
Prof. Natalia Kussul, PhD. Andrey Shelestov, Lobunets A., Korbakov M., Kravchenko A.
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Bringing It All Together: An Academic Viewpoint (What is needed and what is likely to come next?) Association of Information and Dissemination Centers.
”Big Data” Initiative as an IT Solution for Improved Operation and Maintenance of Wind Turbines Zsolt János Viharos, Csaba István Sidló, András A. Benczúr,
Distributed search for complex heterogeneous media Werner Bailer, José-Manuel López-Cobo, Guillermo Álvaro, Georg Thallinger Search Computing Workshop.
1 Zdenek Zdrahal Knowledge Media institute The Open University 26 October 2012 From Bletchley to CORE New Challenges in Open Access.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Slovak University of Technology Department of Computer Science and Engineering Bratislava, Slovakia Pavol Návrat, Mária Bieliková {navrat,
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
IVITA Workshop Summary Session 1: interactive text analytics (Session chair: Professor Huamin Qu) a) HARVEST: An Intelligent Visual Analytic Tool for the.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
AceMedia Personal content management in a mobile environment Jonathan Teh Motorola Labs.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
Overview of Web Data Mining and Applications Part I
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Web Science and Web Archive L3S Wolfgang Nejdl L3S Research Center Hannover, Germany.
Data / Information / Knowledge Presentation by Pauline Lake Modifications by Rick Mercer Acknowledgment and Disclaimer: This presentation is supported.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Sept. 5, 2012 Kevin T. Gallagher and Linda C. Gundersen September 5, 2012 CDI Science.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Subtask 1.8 WWW Networked Knowledge Bases August 19, 2003 AcademicsAir force Arvind BansalScott Pollock Cheng Chang Lu (away)Hyatt Rick ParentMark (SAIC)
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
MTA SZTAKI Department of Distributed Systems The problems of persistent identifiers in the context of the National Digital Data Archives of Hungary András.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.
Iana Atanassova Research: – Information retrieval in scientific publications exploiting semantic annotations and linguistic knowledge bases – Ranking algorithms.
MICROSOFT SEMANTIC ENGINE Unified Search, Discovery and Insight.
B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – Research projects:
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
IoT Meets Big Data Standardization Considerations
Realtime Financial Monitoring and Analysis System May 2010 Lietu Search Engine.
Info Start-up company founded by academicians and graduate students from Sabanci University. We offer social media analysis tools and services including.
Data Mining in Germany IIM Conference, Oct. 24, 2012 Gottfried Schwarz, DLR > Lecture > Author Document > Datewww.DLR.de Chart 1.
1 Intelligent Information System Lab., Department of Computer and Information Science, Korea University Semantic Social Network Analysis Kyunglag Kwon.
Big Data – Lendület kutatócsoport Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Information Retrieval in Practice
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Sentiment analysis algorithms and applications: A survey
Research on Knowledge Element Relation and Knowledge Service for Agricultural Literature Resource Xie nengfu; Sun wei and Zhang xuefu 3rd April 2017.
School of Computer Science & Engineering
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Extraction, aggregation and classification at Web Scale
Topics Covered in COSC 6340 Data models (ER, Relational, XML (short))
Topics Covered in COSC 6340 Data models (ER, Relational, XML)
Peggy van der Kreeft Deutsche Welle
CSE 635 Multimedia Information Retrieval
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Overseas Business Director
FashionBrain: Understanding Europe’s Fashion Data Universe
Web archives as a research subject
Idiap Research Institute University of Edinburgh
Open Source SUMMA Platform
Presentation transcript:

Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös University Budapest 14 June 2013Web and Social Media TÁMOP C-11/1/KONV

Informatics Laboratory Data Mining and Search Group o Zsolt Fekete, head Data Warehouse and Business Intelligence group o Csaba Sidlo, head Groups within the lab o Lajos Ronyai, Theory of Computing group o Daniel Marx, ERC Starting Grant winner, Parameterized Complexity o Andras Kornai, Human Language Technologies Hardware 50-node old dual core Hadoop 5-node new Hadoop/HBASE 260TB net Isilon Big Data – „Momentum” group Awarded by President of Hungarian Academy of Sciences in 2012

SZTAKI Text Mining Center Funded by the President of the Hungarian Academy of Sciences Led by Prof. Laszlo Monostori, Research Laboratory on Engineering & Management Intelligence o Informatics Laboratory (András Benczúr) o Laboratory of Parallel and Distributed Systems (Péter Kacsuk) o Internet Technology Department (István Tétényi) o Department of Distributed Systems (László Kovács) Topics: o trend monitoring; novelty recognition; concept-flow, concept-mapping; o analysis, monitoring and visualization of theme, professional relation, joint authorship, citations, etc. o opinion extraction; semantic annotation; domain ontology development; o identification and resolution of names of persons and organization; o plagiarism detection

Connection to FuturICT.hu Work Plan Science of Science o SZTAKI Text Mining Center o Web classification o Metadata extraction o SZTAKI Plagiarism Detection toolkit Fully Distributed Learning (and Networks) o Recommender systems o Distributed and streaming architectures o Network influence in recommender systems

Automatic metadata extraction Articles in pdf form Extracting o Title o Authors o References o Etc Used techniques o Computing features (text, visual info) o Machine Learning: SVM, CRF

Save resources, select quality and topic Legal regulation (porn, illicit content) Web scale data (Test: ClueWeb09 25TB – 0.5 Billion English language docs) JulienPhilippe MasanesRigaux Internet Memory Paris Cross-Lingual Web Spam Classification. Garzó, Daróczy, Kiss, Siklósi, Benczúr. WebQuality 2013 The classification power of Web features. Erdelyi, Benczur, Daroczy, Garzo, Kiss, Siklosi Internet Mathematics, under revision Crosslingual Web Classification

Expensive human labeling task language by language? How can models be “translated”? Terms in the English model translated into Portuguese to classify in the target language. Strongest positive and negative predictions are used for training a model in the target language. Crosslingual Web Classification

KopFIRE: Technology in the cloud BonFIRE FP7 Future Internet Research and Experimentation testbed KOPI: A plagiarism detection toolkit o o Translation plagiarism (English and Hungarian) o Now serving Wikipedia o Service puts very heavy load on search index (sentence based checks, existing suboptimal code) o Index ported to several distributed key-value stores o New alpha version now fed with Web data

Search for events in time

SZTAKI Full Text Search Technology

Trend analysis Temporal data (eg. blogs) Visualizing trends o Words o Groups of words Challenges o Big data techniques o Temporal text indexing

Network Influence in Recommenders

Mobility Data Stream processing (Orange D4D)

Stream Processing Architecture Overview Goal is to hide Storm details from user Streaming infrastructure pluggable (could combine with Stratosphere) Persistence layer pluggable

Conclusions SZTAKI covers a chain of research topics o Web data acquisition o cleansing and metadata extraction o search, temporal analytics o influence detection o recommendation Science of Science o SZTAKI Text Mining Center o Multilingual classification for quality, genre, spam o Metadata extraction from pdf publications over the Web o SZTAKI Plagiarism Detection toolkit Fully Distributed Learning and Networks o Distributed and streaming architectures o Network influence in recommender systems

Recent publications Pálovics,Benczúr. Temporal influence over the Last.fm social network. IEEE ASONAM 2013 Garzó et al., Cross-Lingual Web Spam Classification. WebQuality 2013 Erdélyi et al., The classication power of Web features. Internet Mathematics, under revision L. Kocsis, A. György, A. N. Bán., BoostingTree: Parallel Selection of Weak Learners in Boosting, with Application to Ranking. Machine Learning, to appear. Garzo et al., Real-time streaming mobility analytics. NetMob 2013 Göbölös-Szabó, Prytkova, Spaniol, Weikum. Cross-Lingual Data Quality for Knowledge Base Acceleration across Wikipedia Editions. QDB 2012 Eom, Frahm, Benczur, Shepelyansky. Time evolution of Wikipedia network ranking. Arxiv, C. Sidló, A. Garzó, A. Molnár, A.A. Benczúr, Infrastructures and Bound for Distributed Entity Resolution, in Proc. QDB in conj. VLDB 2011.

Questions? Zsolt Fekete Head, Data Mining and Search member of the “Big Data” lab 14 June 2013Web and Social Media