Unsupervised Learning: Clustering & Model Fitting.

Unsupervised Learning: Clustering & Model Fitting

Administrivia Reminder: office hours truncated tomorrow “whenever I get in” until noon HW3 due: Dec 2 Have an excellent Turkey Day!

The art of presentations Goals of your presentation: Similar to goals of final paper Tell us: What was your problem? Why is it important? Why should we care about it? What has been done before (briefly!). Why was it inadequate/how could it be extended? What did you do that’s new? How do you know your approach works? Math, experiments, expert evaluation, etc.

The art of presentations Do NOT tell us: Every detail of every experiment Choose the parts to show us carefully Each thing you show us should be informative about your conclusions Place for excruciating detail is the paper Every step of all the math Every background reference Focus on the “big picture” and the “take home message” Listeners will take home ~10 bytes. Make sure they’re the right 10 bytes!

The unsupervised problem Given: Set of data points Find: Good description of the data

Typical tasks Given: many measurements of flowers What different breeds are there? Given: many microarray measurements, What genes act the same? Given: bunch of documents What topics are there? How are they related? Which are “good” essays and which are “bad”? Given: Long sequences of GUI events What tasks was user working on? Are they “flat” or hierarchical?

General unsup tasks Given, data matrix Which points are similar? How do points cluster together? How many groups are there? Statistical description of distribution of data?

5 minutes of bioinformatics Gene microarray (a.k.a., genechip, DNA chip, etc.) Measure thousands (10s or 100s of thousands) of genes simultaneously Critical tool in bioinformatics Understand function of genes, networks of gene activity, response to stimuli, etc. Leads to some very nasty analysis problems...

5 minutes of bioinformatics Back to the cell (& biology 101)... chromasome transcription messenger RNA (mRNA) translation protein product http://nobelprize.org/medicine/laureates/1993/illpres/dna-rna.html

Only mRNA can be (easily measured) When gene is “activated”, mRNA is produced Can be “upregulated” or “downregulated” to produce diff. concentrations of mRNA Can be active or inactive under different conditions: External stimuli (food, ph, temperature, viral infection, etc.) Internal metabolic processes (cell cycles, pathways, etc.) mRNA measurements correlated with cell activity 5 minutes of bioinformatics

measuring many mRNA... Population A of cells Population B of cells mRNA pool A mRNA pool B

5 minutes of bioinformatics Pool A Pool B uarray slide many wells w/ complementary DNA fragments DNA hybridization

Irradiation 5 minutes of bioinformatics Imaging [ x 1, x 2,..., x d ] Data vector

5 minutes of bioinformatics Measure populations over time Monitor development of cell, metabolic processes, response to introduction of stimulus, etc. Time series of data # timepoints # genes Can consider either rows or columns to be “points”, depending on what you want to know

Analysis of uarray data Huge amount of info in individual or series microarray data Key questions: Which genes “act” the same? Which genes are correlated w/ stimulus? Instantaneously or over time What metabolic “stages” are there? When are genes in one stage or another?

Similarity & distance Most clust. algorithms based on distances between points Recall: distance (metric) function d(x 1,x 2 ) : Symmetry: d(x 1,x 2 )=d(x 2,x 1 ) Identity: d(x 1,x 1 )=0 Triangle inequality: d(x 1,x 3 )<=d(x 1,x 2 )+d(x 2,x 3 ) E.g., Euclidean distance, kernel distance, etc. Sometimes have a natural similarity function instead Can usually convert to a metric or semi- metric

Agglomerative clustering Group clusters by mutual distance “Bottom-up” method: start w/ points and combine into groups, combine groups, etc.

The agglom clust alg. function agglom_cluster Input: data matrix X Output: Cluster tree T Initialization: C= {x i } ∈ X Repeat { [c1, c2]=nearest_neighbors(C) cnew=merge(c1,c2) cnew.children=[c1,c2] C.remove(c1,c2) C.add(cnew) } until (C.size==1) T =first(C) return T

Dist between clusters? Problem: We have distance between pairs of points Agglomerative clustering requires distance between pairs of clusters A number of measures are possible: c1 c2

Dendrograms Agglomerative clustering produces a binary tree of pairs of clusters Often called a dendrogram (Gr. “tree-writing”) Originally used for taxonomies (biology)

Microarray dendrogram

Unsupervised Learning: Clustering & Model Fitting.

Similar presentations

Presentation on theme: "Unsupervised Learning: Clustering & Model Fitting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Learning: Clustering & Model Fitting.

Similar presentations

Presentation on theme: "Unsupervised Learning: Clustering & Model Fitting."— Presentation transcript:

Similar presentations

About project

Feedback