このページは http://www.slideshare.net/Med_KU/20131019-journal-club の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

byMed_KU

約3年前 (2013/10/20)にアップロードinテクノロジー

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.

DNABind: A hybrid a...

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.

DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches.

Liu R, Hu J.

- DNABind: A hybrid algorithm for structure-based prediction of DN

A-binding residues by combining machine learning- and template-

based approaches. Proteins. 2013 Jun 5.

20131019

生物物理若手関西支部 Journal Club - Topics

Prediction of protein-DNA binding residues

Statistics of network

Machine learning - Result: DNABind, a hybrid method of machine learning and template-based

approaches showed excel ent performance on predicting DNA-binding residue

s.

Machine learning

Template

DNABind

CprK (3E6C:C)

EcoRV(1RVE:A)

True positive residues.

DNABind improves classification.

Query protein, Template protein, TP, FP,

FN - Aim

Protein-DNA interactions is important for cell biology.

Its determination by experiments is time- and cost-consuming.

Computational approaches are desirable. - Computational approaches

Data bank (PDB)

Binding residues characters

Exposed solvents

Higher electrostatics potential

More conserved

Hotspots as clusters of conserved residues

Structural properties (DNA-binding residue vs surface)

Packing density

Surface curvature

B-factor

Residue fluctuation

Hydrogen bond donor

http://www.rcsb.org/pdb/home/home.do - Computational algorithms

Feature-based

Extract effective features

Template-based

Align template and retrieve the best match

Template!! - Computational algorithms

Feature-based

Extract effective features

Template-based

Align template and retrieve the best match

Template!! - Computational algorithms

Feature-based

Extract effective features

Template-based

Align template and retrieve the best match

Template!! - Features used in machine learning

Structure-based

PSSM (position specific scoring matrix)

Evolutionally conservation

Solvent accessibility

Local geometry (depth and protrusion index)

Topological features

degree, closeness, betweenness, clustering coefficient

Relative position (distance to centroid)

Statistical potential (Boltzmann distribution)

Sequence-based (more difficult than structure)

Amino acid identity

Residue physicochemical properties

polarity, secondary structure, molecular volume, codon diversity, electrostatic charge

Predicted structure (Not need 3D structure !!) - Features used in machine learning

Structure-based

PSSM

Relative solvent accessibility Sequence-based

Depth and protrusion index

PSSM

Topological features

Predicted structures

Distance to centroid

Amino acid indices

Statistical potentials

Statistical potentials

Construct machine learning (SVM) - Template-based approach

Used in image recognition, etc…

Recognition of faces in the camera.

Template!! - Template-based approach

Used in image recognition, etc…

Recognition of faces in the camera.

Match!!

Template!! - Template-based prediction

Template-based

Structural alignment and statistical potential

The binding residue prediction will be conducted only if the ta

rget protein was considered as a DNA-binding protein.

312 templates were selected. - Network

Degree is a commonly used measure to reflect the local

connectivity of a node.

Closeness is a global centrality metric used to determine

how critical a residue is in a residue interaction network.

Betweenness of residue i is defined to be the sum of the

fraction of shortest paths between al pairs of residues

that pass through residue i.

Clustering coefficient (transitivity) quantifies how close

Motif, hub, and community

its neighbors are to being a clique. Probability that the

are also important…

adjacent vertices of a vertex are connected. - Network sample; human protein interactome

Scale-free

Smal -world

Cluster

Power law (Pareto distribution)

Bioinformatics. 2012 Jan 1;28(1):84-90. - Machine learning

Example; spam

4601 samples, 57 parameters.

Classification; spam or nonspam - Machine learning

Support vector machine (SVM)

Decision tree

RandomForest

Logistic regression

LASSO (Elastic net and Ridge)

Neural networks (Deep learning)

Evolutionary algorithm

Gaussian processing

k nearest neighbor

Clustering

Bayesian networks

Association rule learning

Inductive logic programming (ILP) - Support vector machine (SVM)

Make hyperplane to divide groups.

Kernel method; non-linear to linear

Easy to do.

Much computational time.

Tuning is very difficult. - Decision tree

Make many trees.

Easy to understand graphical y.

Performance is not so good. - RandomForest

Make many decision trees.

Much precise.

A little time consumer. - Logistic regression

Many medical researchers use…

Easy to use but tuning is very difficult.

(to tell the truth…) - LASSO, Elastic net, and Ridge regression

Least Absolute Shrinkage and Selection Operator

LASSO

Elastic Net

Ridge - Neural networks

Artificial mammal brain (perceptron).

Hidden multi-layer.

Deep learning is hot topic!!

(hard to understand…)

http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html - n-fold cross validation

To evaluate how the results of a statistical analysis will g

eneralize to an independent data set. - n-fold cross validation

To evaluate how the results of a statistical analysis will g

eneralize to an independent data set.

Train data

Test - n-fold cross validation

To evaluate how the results of a statistical analysis will g

eneralize to an independent data set.

Train data

Test - n-fold cross validation

To evaluate how the results of a statistical analysis will g

eneralize to an independent data set.

Train data

Test

To evaluate how the results of a statistical analysis will g

eneralize to an independent data set.

Train data

Test

To evaluate how the results of a statistical analysis will g

eneralize to an independent data set.

Train data

Test- n-fold cross validation

To evaluate how the results of a statistical analysis will g

eneralize to an independent data set.

Train data

Test 1

One-leave out CV - Performance

SVM

Tree

RandomForest LASSO Elastic net Ridge Logistic

nnet

Recal

0.917 0.872

0.927

0.894

0.892

0.852

0.893

0.930

Precision 0.948 0.914

0.954

0.932

0.926

0.926

0.930

0.935

F

0.932 0.893

0.940

0.913

0.911

0.887

0.911

0.932

MMC

0.890 0.826

0.902

0.858

0.856

0.821

0.856

0.888 - Combine two approaches

if

and are determined by CV and ROC analysis. - Statistical features of structure

A: Binding residues are highly solvent

accessible.

B, C: Binding residues have low depth and

high protrusion.

D-G: Not so much difference in networks.

H: Binding residues are less distant to the

centroid. - Performance
- Performance

Higher TM score is required for good prediction.

TM-score is a measure of similarity between two protein structures with different tertiary

structures. < 0.2 is random relation and > 0.5 is highly related.

Proteins. 2004 Dec 1;57(4):702-10.

Nucleic Acids Res. 2005 Apr 22;33(7):2302-9. - Performance

Comparison among ML, TL, and DNABind.

Comparison between DNABind and other software. - Result: DNABind, a hybrid method of machine learning and template-based

approaches showed excel ent performance on predicting DNA-binding residue

s.

Machine learning

Template

DNABind

CprK (3E6C:C)

EcoRV(1RVE:A)

True positive residues.

DNABind improves classification.

Query protein, Template protein, TP, FP,

FN