Tomáš Jurníček, Jakub Jůza, Lenka Kmeťová Big text data mining Tomáš Jurníček, Jakub Jůza, Lenka Kmeťová
Introduction Text data analysis Sophisticaded analytic methods Information extraction from data
Big data and data mining datasets of large size and complexity Companies have large amounts of data Data needs to be analyzed Problem: natural language Data mining Data cleaning Data integration Data selection Mining methods Evaluating results
Methods Information extraction Categorization Clustering Visualization Key phrases and relations Unstructured text Categorization Assign categories to documents Clustering Using clusters Visualization Present data in a form understable for humans Summarization Long documents Expressing only core information
Tools Large companies like Facebook or LinkedIn work on open-source projects. For example: Apache Hadoop - for data-heavy distributed applications Apache S4- for continuous processing of data streams Storm (Twitter) - for streaming distributed data Open source tools for Big Data Mining: Apache Mahout, R, MOA,…
Nursing records A specific area of use for Big data mining Electronic Medical Record (EMR) = information about patients This data is not used to its full potential. information is written in an unstructured style expressions are highly subjective -> Data mining is more complicated
Nursing records Result analyzed by KeyGraph associations and frequent terms that represent basic concepts in the data
Future There are a lot of challanges: Statistical significance – quality of statistical resultst for large sets of data Distributed mining – more parallelize methods Time evolving data - data is changing in conjuction with time Hidden big data – a lot of data is unlabeled and unstructured. Currently, only 3% of data is usable for data mining!
Conclusion We are at the beginning of a new era, when Big text data mining will allow to discover new, currently unknown, knowledge.