Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder.

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion Dr. Edward Bellion

Outline b Motivation and goal of the research b SUBDUE knowledge discovery system b Proteins and PDB b Methods and results b Discussion and conclusion b Future research

Motivation and Goal b Explosive amount of molecular biology info need to be analyze to help understanding the underlining structure-function relationship in protein and other macromolecules. b Apply SUBDUE to the Brookhaven Protein Data Bank (PDB) to identify biologically meaningful patterns

SUBDUE knowledge discovery system b SUBDUE discovers patterns (substructures) in structural data sets b SUBDUE represent data as a labeled graph b Inputs: vertices and edges b Outputs: discovered patterns and instances

Example object triangle object square on shape Vertices: objects or attributes Edges: relationships 4 instances of

SUBDUE’s search algorithm b Minimum Description Length (MDL) principle: The best theory to describe a set of data is the one that minimizes the DL of the entire data set b DL of the graph: the number of bits necessary to completely describe the graph b Search for the substructure that results in the maximum compression

Inexact graph match approach Find instances with a slight distortion: insertion, deletion, and substitution of edges/vertices. Threshold parameter: specify amount of distortion allowed.

Overview of proteins b most important biomolecule b composed from 20 amino acids b structural hierarchy b very diverse structure and function

Structural hierarchy in proteins b Primary structure (sequence of protein) b Secondary structure (helix, sheet, random) b Tertiary structure (3-D)

Primary Structure of proteins b Average 100-150 residues (a.a.) linked in head to tail b N-terminus and C-terminus b Peptide bond, alpha-carbon H 3 N - C  1 - C - N - C  2 - C - O R1 O H R2 O N-terminusC-terminus + - peptide bond first a.a second a.a

Secondary structure elements b Ordered backbone arrangement: helix and sheet b Helix (0 % to 90 %; average 11 a.a; several types) b Sheet (2 to 15 strands per sheet; parallel and anti-parallel; average 6 a.a. per strand)

Tertiary Structure of protein b Highly complicated 3-D arrangement b Folding of its secondary structure elements

Brookhaven Protein Data Bank (PDB) b Brookhaven National Laboratory b Over 6000 Experimentally determined 3-D structure of biomolecules b Majority: protein structures

Contents of PDB b SEQRES: sequence of a.a. (three letter code) b HELIX: starting, ending, and type b SHEET: starts, ends, sense b ATOM: (x, y, z) coordinates for each atoms in protein

Applications of SUBDUE to PDB - Methods and Results b July 1997 PDB TM release (6000 PDB) b Global data set (4000 PDB) b Category data sets hemoglobin Myoglobin Ribonuclease A

Flowchart of Research Preprocessing Application Brookhaven PDB Graphic representation Inputs to SUBDUE Patterns in Category Patterns in Global others Instance mapping

Preprocessing b compile PDB list for each category b model.c: extract first model b seq.c: extract sequence info convert to graphic format b secondary.c: extract secondary structure info and convert to graphic format b coor.c: extract 3D coordinates convert to grahic format

Primary structure and its representation b Sample PDB lines: SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU 1ASH 140 b Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU b SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA residue v 2 ASN - - - ASN residue e 1 2 bond - - - a peptide bond between ALA and ASN

Secondary structure and its representation -HELIX b Sample PDB lines (starting, ending, type): HELIX 1 ASN 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 b vertex: h_type_length b Helix Length: Hlength = SeqNum(last a.a.) - SeqNum(first a.a.) b SUBDUE graphic input: v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - helix 2, type 1, length 16

Secondary structure and its representation - SHEET b Sample PDB lines (sense, length): SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 THR 294 - 1 b vertex: s_sense_length b SUBDUE graphic input: v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2

Overall secondary structure representation b PDB line: SUBDUE graphic input HELIX 1 THR 3 MET 13 1 v 1 h_1_10 HELIX 2 ASN 24 ASN 34 1 v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 v 3 s_0_7 e 2 3 sh SHEET 1 LYS 41 HIS 48 0v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 THR 87 -1v 5 s_-1_8 e 4 5 sh b sequential relationship is represented as edge “sh” b Visualization: N-terminus C-terminus

Tertiary structure and its representation b Sample PDB lines: X Y Z ATOMCAALA110.3690.99710.519 ATOMCAASN26.6910.2399.830 b vertex: backbone carbon; edge: distance (vs, s) b Distance (Å): distance = ((x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 + (z 2 - z 1 ) 2 ) 1/2 b v 1 CA_ALA v 2 CA_ASN e 1 2 vs- - - very short distance

Rationale for representation choice -Criteria b Patterns identified by SUBDUE must be representative for each category b Patterns discovered by SUBDUE should discriminate one category from others

Primary sequence b vertex - a.a. residue name b edge - peptide bond e 1 2 bond e 2 3 bond ARGGLUALA bond v 1 ARGv 2 GLUv 3 ALA

Secondary structure elements b Type of the helix b starting and ending points (a.a name and seq number) Helix 1 1 12 ASN … HIS type length starts ends N-terminus C-terminus

Other ways of representing helix b Separate type and length b combine type and length Helix 1 1 12 Helix_1_12 type length

Tertiary structure b (x, y, z) coordinates vary with different origin choice b avoid numeric number, use vs (  4 Å), s (4 Å < dist  6 Å) 10.46.7 1.0 C1 C2 0.2 10.59.8 x y vs y z z

Results: Primary structure patterns Ribonuclease_A_sequence: GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SER LYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL Hemo_seq (63/65) Hemo_sequence: THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYS VAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SER THR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALA SET VAL SER THR VAL LEU THR SER LYS TYR Myo_seq (67/103) Myoglo_sequence: VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG Ribo_A (59/68)

Primary structure patterns b Unique to each sample category b hemoglobin and myoglobin proteins share little sequence similarity

Results: Hemo secondary structure patterns 1 : h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 7 : h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Results: Myo secondary structure patterns 1 : h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

Results: Ribo_A secondary structure patterns 1 : h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3 10 : h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6

Results: Tertiary structural patterns b SUBDUE finds small patterns (2 or 3 a.a.) b not unique for each category of proteins b not biologically meaningful

Visualization of secondary structure patterns -hemoglobin complete hemoglobin 2 instances of pattern structure N-terminus C-terminus

Visualization of secondary structure patterns -myoglobin complete myoglobin 1 instance of pattern structure N-terminus C-terminus

Visualization of secondary structure patterns -ribonuclease_A complete ribonuclease_A 1 instance of pattern structure N-terminus C-terminus

Discussion -Hemoglobin b Hemoglobin: A, B, C, D chains b Two types of patterns identified by SUBDUE One for A, C chains, the other for B, D chains b Patterns exist in a majority of hemoglobin proteins b No instances of the best hemoglobin pattern found in other proteins in the global data set

Occurrence of hemo patterns

Occurrence of hemo patterns - continued

Discussion -Myoglobin b Myoglobin: one chain b One dominant pattern identified by SUBDUE b Patterns exist in most of myoglobin proteins b No instances of the best myoglobin pattern found in other proteins in the global data set

Discussion: -Hemoglobin and Myoglobin b Similar secondary structure patterns Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20 Myoglobin chain (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25 Hemoglobin A, C chains (from N- to C-terminus) h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Discussion: -Hemoglobin and Myoglobin b Consistent with the genetic studies b Hemoglobin and myoglobin share one ancestral gene b Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin. b The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow conformational change

Discussion: -ribonuclease A proteins b All patterns have three helices of the same size b Several strands appear twice indicating participation in two sheet formation. b Ribonuclease S protein (S-protein fragment) also has the pattern.

Conclusion of the results b Secondary structure patterns discovered by SUBDUE are representative to each category b Secondary structure patterns discovered by SUBDUE are distinct for each category b SUBDUE has the ability to discover biologically interesting patterns from PDB and other similar MB data bases

Comparison with other related studies b Different graphic representation b predefined patterns with exact or inexact graph match b Not applied systematically to PDB or other DB b SUBDUE would perform similar task if the inexact graph match routine is incorporated

Conclusions of the study b Abstraction over 3D structure to its secondary structural elements is suitable for discovery b SUBDUE discovered secondary structure patterns for each category can be used as a signature for its class b Inexact graph match is useful for finding similar patterns b SUBDUE is suitable for knowledge discovery in MB structural DB

Future Research b More consistent and detailed description of secondary structure b Add relative positions of the secondary structural elements to represent spatial relationship b Investigate alternative representation: more suitable 3D coordinates representation; weighting on different edges b Inexact graph match in predefined substructure b More collaboration with domain scientists

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder.

Similar presentations

Presentation on theme: "Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder.

Similar presentations

Presentation on theme: "Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder."— Presentation transcript:

Similar presentations

About project

Feedback