PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014
Agenda Introduction Motivation Model Structure Progressive Learning Use Cases –Automate MS Annotation (Multi-label Classification) –Latent Semantic Discovery Conclusion IEEE Big Data 2014
Introduction Probabilistic graphical models (PGM) consist of a structural model and a set of conditional probabilities. Graphical models can be classified into two major categories: –(1) directed graphical models (Bayesian networks) – (2) undirected graphical models (Markov Random Fields) IEEE Big Data 2014
Motivation MS1 MS2 MS ,979,334 Frag1Frag 2.. GOG1 GOG2 … MS1 MS * 2,979,334 = 3,873,134, * 2,979,334 = 3,873,134,200 MS3 IEEE Big Data 2014
Model Structure GOG1 GOG2 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 MS1 MS2 MS3 P(GOG1 | F1,F3,F7) = P(GOG1|F1) * P(GOG1|F3) * P(F3|F7)) = 50/50 * 20/60 * 10/25 IEEE Big Data 2014
Progressive Learning This learning technique is very attractive in the big data age for the following reasons: – Training the model does not require processing all data upfront. –It can easily learn from new data without the need to re-include the previous training data in the learning. –The training session can be distributed instead of doing it in one long-running session. IEEE Big Data 2014
Automate MS Annotation (Multi-label Classification) Data Set Includes: ItemCount Scan1974 Peak Edges10743 Root450 MS2 Fragment Node5983 MS3 Fragment Node201 IEEE Big Data 2014
Results IEEE Big Data 2014
Results IEEE Big Data 2014
Results
Latent Semantic discovery Java Developer.NET Developer Nurse Health Care Java J2EE C# Care giver RN Senior Home P(Java,J2EE| Java Developer) = P(Java|Java Developer) * P(J2EE|Java Developer) = 5/7 * 10/10 P(Java,C#|Java Dev,.NET Dev) = P(Java|Java Dev)*P(Java|.NET Dev) * P(C#|Java Dev) * P(C#|.NET Dev) IEEE Big Data 2014
Results IEEE Big Data 2014
Conclusion we propose an efficient and scalable probabilistic graphical model for massive hierarchical data (PGMHD). we successfully applied PGMHD to the bioinformatics domain to automatically classify and annotate high-throughput mass spectrometry data. we successfully applied this model to large-scale latent semantic discovery by using 1.6 billion search log entries provided by CareerBuilder.com within a Hadoop Map/Reduce framework. IEEE Big Data 2014
Questions IEEE Big Data 2014