Dominios de Proteinas y Homologia Remota

Transcription

Dominios de Proteinas y Homologia Remota
Dominios de Proteinas
y Homologia Remota
in the twilight zone of
protein sequence
analysis.
Master en Bioinformatica
UCM ­ 2013
Luis Sanchez Pulido
Department of Physiology, Anatomy & Genetics
The most important influences on my career are:
Who Am I?
2005 ­ PhD Title: Domain­Oriented Computational Protein Sequence Analysis
Since 2008
A Valencia
Between 1995 ­ 2008
MA Andrade­Navarro
Visitor at P. Bork and A. Tramontano Labs
Long­Term Fellowship
Chris Ponting
Department of Physiology, Anatomy & Genetics
The Areas of My Expertise.... ---> PROTEINS
Initially ­­­> Structural (Homology Modeling, Mutant interpretation, ...)
What if?
● Insight II ● MolMol
● Rasmol
●
PYMOL
●
Early I Fell in Love with Protein Sequence Analysis Blast
HHpred
HMMer
Pfam
&
SMART
Why do we analyse sequences?
Proteins with known sequence
Structure
Function
Both
?????
“There is no darkness but ignorance”
William Shakespeare
Data Overload!!!
Database growth by year
www.ebi.ac.uk/ena/about/statistics
Protein Sequence Databases are becoming every day BIGGER and more complex...
●
Protein Sequence Databases are becoming every day BIGGER...
●
Michael Y. Galperin and Eugene V. Koonin
From complete genome sequence to ‘complete’ understanding?
Trends Biotechnol. 28. 2010
Protein Sequence Databases are becoming every day MORE COMPLEX...
●
Nature of the protein universe. Michael Levitt. PNAS 2009
The analysis of the known and predicted context of each protein is becoming every day more difficult... every week is published a new High­throughput experiment...(cell localization, interactions, Function...)
●
Thanks to the recognition of
homology between proteins,
we can
TRANSFER INFORMATION
Structural and/or Functional
Homologues: two proteins with a common ancestor.
... dependent on the type of divergence they can be:
• orthologues - speciation
• paralogues – gene duplication
• xenologues – horizontal transference
Admiring the
amazing
life's diversity
GenBank
Copyright Cédric Notredame, 2000, all rights reserved
dbEST
Three Generations of Tools in Protein Sequence Analysis Reference Database
Sequence
MRTSRGH.....
First Generation .... 1987
●
Sequence versus Sequence – BLAST
Alignment ­> Profile
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
Second Generation .... 1997
●
Profile versus Sequence – PSSMs - PSI-BLAST & HMMer
Third Generation .... 2005
●
Profile versus Profile – HHpred
Detection of homologous protein sequences
Reference Database
Three Generations of Tools in Protein Sequence Analysis 1- Sequence versus Sequence (Blast)
2- Profile versus Sequence - PSSMs (PSI-BLAST & HMMer)
3- Profile versus Profile (HHpred)
Reciprocal!
Why do we analyse sequences??
because....
Thanks to the recognition of homology
between proteins,
we can
TRANSFER INFORMATION
•Structural
from HOMOLOGOUS proteins of known structure (X-Ray, NMR o EM)
•Functional
from experimentally characterised HOMOLOGOUS proteins
or their genomic or proteomic context
The Structure is better conserved than sequence!
D'Alfonso G, Tramontano A, Lahm A.
Structural conservation in single­domain proteins: implications for homology modeling.
J Struct Biol. 134, 246­56. (2001)
A Remote Homology example:
1NYN
1P9Q
SBDS Family
The Structure is better conserved than sequence!
Definiendo Homología Remota
true positives
Rost B. (1999) Twilight zone of protein sequence alignments.
Protein Eng. 12:85­94. true negatives
Comparisons between pairs of sequences with known structure
100
Identity
50
20%
Size
10 50 100 150 200
Twilight zone
Chothia & Lesk, 1986
Rost, 1999
= = Rmsd > 3A Rmsd < 3A INFORMATION TRANSFER
•Structural
from HOMOLOGOUS proteins of known structure (X-Ray, NMR o EM)
•Functional
from experimentally characterised HOMOLOGOUS proteins or their genomic
or proteomic context
¿FUNCTION?
These are homologous Proteins... Their role in the cell is
very different
But... All of them bind GTP
Key Points in Protein Function prediction:
* Few functional annotations are derived by experiments, and most functional annotations are automated. * Remote homology, Structural information, chromosomal location, phylogenetic information, expression and molecular interaction data... are all being used for function prediction.
* Different methods are better at predicting certain functional aspects. * Combined approaches of different methods are currently emerging (my favourite ­­> STRING)
The Transfer of Structural and/or Functional Information between homologous proteins is a Complex Task How is it done?
Divide each of the Tasks in
as many parts as is necessary
to solve the problem
Domain Definition
Domains are described, from a structural point of view, as
structurally compact units, locally independent in function and
folding and usually characterized by a well define hydrophobic
core.
From sequence analysis point of view, we describe domains as
evolutionary conserved regions that are present in different protein
families of diverse architecture.
“Hypothetical Domain”
REPEATS – In the limits of Domain Definition
every repeat is not structurally independent.......
LRR
HEAT
TPR
PFTA
beta­l
WD40
Very Low structural constraints allow high rates of sequence divergence
between repeats, making their detection by sequence similarity VERY VERY difficult.
Protein irregularities that hinder sequence analysis
RB Russell & CP Ponting, 1998
Low complexity regions
● Repeats, Trans­membrane and Coiled­coil regions (high mutation rates)
● and Fold irregularities, such as: Circular Permutations and Insertions
●
N term
C term
N term
C term
Evolución de Proteínas: el papel de los dominios
Barajado + Acreción
SPP1/SET1C
Familia CGBP
CGBP_HUMAN
Q03012_Yeast
CxxC
PHD
PHD
dPHD
dPHD
Q9W352_Drome
PHD
CxxC
dPHD
DATF1_HUMAN (DIDO-1)
TFS2M
PHD
SPOC
s_zf
dPHD
PHF3_HUMAN
TFS2M
PHD
s_zf
Q9VG78_Fly
SPOC
TFS2M
PHD
SPOC
dPHD BRK
YKA5_YEAST
PHD
TFS2M
SPOC
RBMF_HUMAN
RRM
Familia SPEN
RRM
RRM
SPOC
Q8IL17_Plasmodium
RRM
Q22855_Athaliana
RRM
RRM
RRM
SPOC
SPOC
Provoca aumento de la versatilidad funcional de las proteínas
Levitt M.
Nature of the protein universe.
Proc Natl Acad Sci U S A. 2009 Jul 7 Objetivos del
Analisis de Secuencia de Proteinas
La identificación de dominios a nivel de secuencia, evaluando su conservación y distribución entre diferentes familias de proteínas.
•
Racionalizar e interpretar la similitud de secuencia en términos funcionales comunes, tales como: interacciones con otras moléculas o proteínas, mecanismos de reacción y/o regulación coincidentes, etc.
•
Y en definitiva, aportar nuevas hipótesis de funcionalidad común entre diversas familias de proteínas homólogas, para su posterior verificación experimental.
•
METHODS
ON
DOMAIN
ORIENTED
SEQUENCE ANALYSIS
Common Name
ID or ACC or GI
Reference Database
SRS - EBI
Entrez – NCBI
Buscar comparando:
Sequence
MRTSRGH.....
Alignment ­> Profile
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
Secuencia contra Secuencias – BLAST
Secuencias contra Perfiles – Pfam
Buscar comparando:
Perfil contra Secuencias – PsiBlast o
HMMer
Perfil contra Perfiles – HHpred
Three Generations of Tools in Protein Sequence Analysis Reference Database
Sequence
MRTSRGH.....
First Generation .... 1987
●
Sequence versus Sequence – BLAST
Alignment ­> Profile
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
RTNMSDAQQ­­­­GSWYSDPK­­­REGWFYN
Second Generation .... 1997
●
Profile versus Sequence – PSSMs - PSI-BLAST & HMMer
Third Generation .... 2005
●
Profile versus Profile – HHpred
Detection of homologous protein sequences
Reference Database
Three Generations of Tools in Protein Sequence Analysis 1- Sequence versus Sequence (Blast)
2- Profile versus Sequence - PSSMs (PSI-BLAST & HMMer)
3- Profile versus Profile (HHpred)
Reciprocal!
http://en.wikipedia.org/wiki/BLAST
Is common to have a high value of G (around 10­15) and smaller for L (around 1­2)
Why?????
Two main Characterictics:
•Combining Multiple Alignment Methods
•Mixing Heterogenous Information
• AND
Admiring the
amazing
life's diversity
GenBank
Copyright Cédric Notredame, 2000, all rights reserved
dbEST
Sequence
Domain
Oriented
Sequence
Analysis
Flow-Chart
Sequence DataBases
HMMer
dbEST
GenBank
Domain Databases
“As you will never be sure which are the right problems to work on,
most of the time that you spend in the laboratory or at your desk will be wasted. If you want to be creative,
HHpred
ALIGNMENT
HMMer
then you will have to get used to spending most of your time DOMAIN
Hypothetical
Domain not being creative, to being becalmed on the ocean of scientific knowledge.”
Steven Weinberg
Biochemical
Knowledge
How Do the Pieces of
the Functional
Assignment
Puzzle Fit Together?
Functional Hypothesis
Epistemology is the branch of philosophy concerned David B. Searls (2003)
Pharmacophylogenomics: genes,
evolution and drug targets
Nature Reviews Drug Discovery. 2, 613­23
with the nature and scope of knowledge.
It questions what knowledge is, how it is acquired, and the possible extent to which a given subject can be known.
REAL-LIFE
EXAMPLES