CMSC423: Bioinformatic Algorithms, Databases and Tools

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Online Social Networks and Media
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

CMSC423: Bioinformatic Algorithms, Databases and Tools Data clustering

Why data clustering? What does this mean? CMSC423 Fall 2012

Data clustering... CMSC423 Fall 2012 >F4BT0V001CZSIM rank=0000138 x=1110.0 y=2700.0 length=57 ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACGTCTG >F4BT0V001BBJQS rank=0000155 x=424.0 y=1826.0 length=47 ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAA >F4BT0V001EDG35 rank=0000182 x=1676.0 y=2387.0 length=44 ACTGACTGCATGCTGCCTCCCGTAGGAGTCGCCGTCCTCGACNC >F4BT0V001D2HQQ rank=0000196 x=1551.0 y=1984.0 length=42 ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCGTCCCTCGAC >F4BT0V001CM392 rank=0000206 x=966.0 y=1240.0 length=82 AANCAGCTCTCATGCTCGCCCTGACTTGGCATGTGTTAAGCCTGTAGGCTAGCGTTCATC CCTGAGCCAGGATCAAACTCTG >F4BT0V001EIMFX rank=0000250 x=1735.0 y=907.0 length=46 ACTGACTGCATGCTGCCTCCCGTAGGAGTGTCGCGCCATCAGACTG >F4BT0V001ENDKR rank=0000262 x=1789.0 y=1513.0 length=56 GACACTGTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG >F4BT0V001D91MI rank=0000288 x=1637.0 y=2088.0 length=56 ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG >F4BT0V001D0Y5G rank=0000341 x=1534.0 y=866.0 length=75 GTCTGTGACATGCTGCCTCCCGTAGGAGTCTACACAAGTTGTGGCCCAGAACCACTGAGC CAGGATCAAACTCTG >F4BT0V001EMLE1 rank=0000365 x=1780.0 y=1883.0 length=84 ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAATGCTGCATGCTGC TCCCTGAGCCAGGATCAAACTCTG CMSC423 Fall 2012

Data clustering... Given a collection of data-points can we identify any patterns? Data-points: DNA sequences Gene expression levels Organism abundances in an environment Vitals Patterns: do certain points group together? CMSC423 Fall 2012

Example 16S rRNA sequences from gut microbiome How many types of organisms are there? Assumption: organisms that are similar have similar 16S sequences Operational Taxonomic Unit – sequences within 2% distance from each other Next step: given OTU abundances for a number of patients, do the patients cluster by age/disease/etc? CMSC423 Fall 2012

Types of clustering algorithms Agglomerative Start with single observations Group similar observations into the same cluster Divisive All datapoints start in the same cluster Iteratively divide cluster until you find good clustering Hierarchical Build a tree – leaves are datapoints, internal nodes represent clusters CMSC423 Fall 2012

Measures of goodness of clustering Homogeneity All points in a cluster must be similar Separation Points in different clusters are disimilar CMSC423 Fall 2012

Microarray (gene abundance) clustering Each gene can be viewed as an array of numbers expression of gene at different time-points expression of gene in different conditions (normal, variants of a disease, etc.) Each time-point or tissue sample can also be viewed as an array of numbers expression levels for all genes Basic idea: cluster genes and/or samples to highlight genes involved in disease CMSC423 Fall 2012

Hierarchical clustering Need: definition of distance between data-points (e.g. individual genes). Some measures: Euclidean distance Manhattan distance Pearson correlation Angle between vectors (centered Pearson correlation) Clustering algorithm group together data-points that are most similar repeat... 𝐷 𝑥,𝑦 = 𝑖 𝑥 𝑖 − 𝑦 𝑖 2 𝐷 𝑥,𝑦 = 𝑖 ∣ 𝑥 𝑖 − 𝑦 𝑖 ∣ 𝐷 𝑥,𝑦 = 𝐸 𝑥− μ 𝑥 𝑦− μ 𝑦 σ 𝑥 σ 𝑦 CMSC423 Fall 2012

Hierarchical clustering Key element: how do you compute distance between two clusters, or a point and a cluster ? UPGMA/average neighbor (average linkage) average distance between all genes in the two clusters Furthest neighbor (complete linkage) largest distance between all genes in clusters Nearest neighbor (single linkage) smallest distance between all genes in clusters Ward's distance inter-cluster distance is variance of inter-gene distances CMSC423 Fall 2012

Hierarchical clustering...cont Irrespective of distance choice, algorithm is the same 1. compute inter-gene/cluster distances 2. join together pair of genes/clusters with smallest distance 3. recompute distances to include the newly created cluster 4. repeat until all points in one cluster Output of program is a tree Cluster sets – defined by “cut” nodes – any subset of internal tree nodes defines a set of clusters – the sets of leaves in the corresponding subtrees Choice of cut can be tricky – usually problem-specifi Note: all this can be done easily in R CMSC423 Fall 2012

Example: gut microbiome in children CMSC423 Fall 2012

k-means clustering Goal: split data into exactly k clusters Basic algorithm: Create k arbitrary clusters - pick k points as cluster centers and assign each other point to the closest center Re-compute the center of each cluster Re-assign points to clusters Repeat Another approach: pick a point at and see if moving it to a different cluster will improve the quality of the overall solution. Repeat! CMSC423 Fall 2012

K-means clustering... K=2 CMSC423 Fall 2012

Other clustering approaches Principal component analysis Identify a direction (vector V) such that the projection of data on V has maximum variance (first principal component) repeat (vector V' != V such that project of data on V' has maximum variance) Usually plot the first 2 or 3 principal components CMSC423 Fall 2012

Other clustering approaches Self-organizing maps Neural-network based approach Output layer of network are points in a low-dimensional space Graph theoretic Points are connected by edges representing strength of "connection" (e.g. similarity or dissimilarity) Pick clusters such that number of "similar" edges spanning boundaries is minimized, or number of "dissimilar" edges within each cluster is minimized Markov chain clustering basic idea – a random walk through a graph will stay within a local strongly connected region CMSC423 Fall 2012

Self organizing map of genomes http://www.jamstec.go.jp/esc/esc/publication/journal/jes_vol.6/pdf/JES6_22-Abe.pdf CMSC423 Fall 2012

Questions Compare nearest, farthest, average, and Ward's cluster distances in terms of how compact the resulting clusters might be. CMSC423 Fall 2012