B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials… Specialised in (semi|un)structured data – We like text, text and more text – Especially on the noisy/dirty Web Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…
Drugbank Twitter API Couch DB Cleaning Normalisation Cleaning Normalisation RSS Forum Trends Analysis Trends Analysis Correlation Analysis Correlation Analysis Novelty Detection Pharmacovigilance on Big Social Media Data Dynamic and Real Time Data Analysis 26’000 per day 19’000 drug names checked each 10 mn 7 M of docs in 9 months
Managing the data deluge for proteins annotation 40’000 concepts [Big-scale Multiclass Multilabel Classifier] Lazy learning ! articles Proteins annotation based on litterature by curators annotated articles GOA Manual annotation planned for 2045 ! (Baumgartner et al) Machine Learning based on Information Retrieval methods Assisting curators Assisting curators Macro reading of litterature Profiling any textual content
Patent retrieval 4 The real situation (0.5-1 TB) Experiments Database 13 millions of patents Extraction 33 days XML patents Tb Normalization 33 days XML patents + metadata Tb Indexing 5 days Index 0.1 Tb Database A sample of 1 million of patents Extraction 2.5 days XML patents 17 Gb Normalization 2.5 days XML patents + metadata 18 Gb Indexing 10 hours Index 3 Gb