Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös.

Similar presentations


Presentation on theme: "Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös."— Presentation transcript:

1 Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös University Budapest benczur@sztaki.mta.hu http://datamining.sztaki.hu 14 June 2013Web and Social Media TÁMOP-4.2.2.C-11/1/KONV-2012-0013

2 Informatics Laboratory Data Mining and Search Group o Zsolt Fekete, head Data Warehouse and Business Intelligence group o Csaba Sidlo, head Groups within the lab o Lajos Ronyai, Theory of Computing group o Daniel Marx, ERC Starting Grant winner, Parameterized Complexity o Andras Kornai, Human Language Technologies Hardware 50-node old dual core Hadoop 5-node new Hadoop/HBASE 260TB net Isilon Big Data – „Momentum” group Awarded by President of Hungarian Academy of Sciences in 2012

3 SZTAKI Text Mining Center Funded by the President of the Hungarian Academy of Sciences Led by Prof. Laszlo Monostori, Research Laboratory on Engineering & Management Intelligence o Informatics Laboratory (András Benczúr) o Laboratory of Parallel and Distributed Systems (Péter Kacsuk) o Internet Technology Department (István Tétényi) o Department of Distributed Systems (László Kovács) Topics: o trend monitoring; novelty recognition; concept-flow, concept-mapping; o analysis, monitoring and visualization of theme, professional relation, joint authorship, citations, etc. o opinion extraction; semantic annotation; domain ontology development; o identification and resolution of names of persons and organization; o plagiarism detection

4 Connection to FuturICT.hu Work Plan Science of Science o SZTAKI Text Mining Center o Web classification o Metadata extraction o SZTAKI Plagiarism Detection toolkit Fully Distributed Learning (and Networks) o Recommender systems o Distributed and streaming architectures o Network influence in recommender systems

5 Automatic metadata extraction Articles in pdf form Extracting o Title o Authors o References o Etc Used techniques o Computing features (text, visual info) o Machine Learning: SVM, CRF

6 Save resources, select quality and topic Legal regulation (porn, illicit content) Web scale data (Test: ClueWeb09 25TB – 0.5 Billion English language docs) JulienPhilippe MasanesRigaux Internet Memory Paris Cross-Lingual Web Spam Classification. Garzó, Daróczy, Kiss, Siklósi, Benczúr. WebQuality 2013 (@WWW) The classification power of Web features. Erdelyi, Benczur, Daroczy, Garzo, Kiss, Siklosi Internet Mathematics, under revision Crosslingual Web Classification

7 Expensive human labeling task language by language? How can models be “translated”? Terms in the English model translated into Portuguese to classify in the target language. Strongest positive and negative predictions are used for training a model in the target language. Crosslingual Web Classification

8 KopFIRE: Technology in the cloud BonFIRE FP7 Future Internet Research and Experimentation testbed KOPI: A plagiarism detection toolkit o http://kopi.sztaki.hu/ http://kopi.sztaki.hu/ o Translation plagiarism (English and Hungarian) o Now serving Wikipedia o Service puts very heavy load on search index (sentence based checks, existing suboptimal code) o Index ported to several distributed key-value stores o New alpha version now fed with Web data

9 Search for events in time

10

11

12 SZTAKI Full Text Search Technology

13 Trend analysis Temporal data (eg. blogs) Visualizing trends o Words o Groups of words Challenges o Big data techniques o Temporal text indexing

14 Network Influence in Recommenders

15 Mobility Data Stream processing (Orange D4D)

16 Stream Processing Architecture Overview Goal is to hide Storm details from user Streaming infrastructure pluggable (could combine with Stratosphere) Persistence layer pluggable

17 Conclusions SZTAKI covers a chain of research topics o Web data acquisition o cleansing and metadata extraction o search, temporal analytics o influence detection o recommendation Science of Science o SZTAKI Text Mining Center o Multilingual classification for quality, genre, spam o Metadata extraction from pdf publications over the Web o SZTAKI Plagiarism Detection toolkit Fully Distributed Learning and Networks o Distributed and streaming architectures o Network influence in recommender systems

18 Recent publications Pálovics,Benczúr. Temporal influence over the Last.fm social network. IEEE ASONAM 2013 Garzó et al., Cross-Lingual Web Spam Classification. WebQuality 2013 Erdélyi et al., The classication power of Web features. Internet Mathematics, under revision L. Kocsis, A. György, A. N. Bán., BoostingTree: Parallel Selection of Weak Learners in Boosting, with Application to Ranking. Machine Learning, to appear. Garzo et al., Real-time streaming mobility analytics. NetMob 2013 Göbölös-Szabó, Prytkova, Spaniol, Weikum. Cross-Lingual Data Quality for Knowledge Base Acceleration across Wikipedia Editions. QDB 2012 Eom, Frahm, Benczur, Shepelyansky. Time evolution of Wikipedia network ranking. Arxiv, 2013. C. Sidló, A. Garzó, A. Molnár, A.A. Benczúr, Infrastructures and Bound for Distributed Entity Resolution, in Proc. QDB in conj. VLDB 2011.

19 Questions? Zsolt Fekete Head, Data Mining and Search member of the “Big Data” lab http://datamining.sztaki.hu/ zsfekete@sztaki.mta.huzsfekete@sztaki.mta.hu 14 June 2013Web and Social Media


Download ppt "Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences and Eötvös."

Similar presentations


Ads by Google