Economía

Transcription

Economía
DOCUMENTO
DE
TRABAJO
:;i:_^^S.:ji;í;íí^^:.?ft¿¿¿ü¿-ii;
NÚMERO 356
GiNO SPINELLI, DAVID MAYER-FOULKES
A New Method to Study DNA Sequences:
The Languages of Evolution
DIVISIÓN DE
Economía
CIDE
NÚMERO .3 56
GiNO SPINELLI, DAVID MAYER-FOULKES
A New Method to Study DNA Sequences:
The Languages of Evolution
DICIEMBRE
2005
CIDE
www.cide.edu
Las colecciones de Documentos de Trabajo del CíDE representan
un medio para difundir los avances de la labor de investigación, y
para permitir que los autores reciban comentarios antes de su
publicación definitiva. Se agradecerá que los comentarios se hagan
llegar directamente al (los) autor(es).
• D.R. ® 2005. Centro de Investigación y Docencia Económicas,
carretera México-Toluca 3655 (km. 16.5), Lomas de Santa Fe,
01210, México, D.F.
Tel. 5727.9800 exts. 2202, 2203, 2417
Fax: 5727.9885 y 5292.1304.
Correo electrónico: publicaciones@cide.edu
v^ww.cide.edu
Producción a cargo del (los) autor(es), por lo que tanto el contenido
así como el estilo y la redacción son su responsabilidad.
Acknowledsments
This article was born during the 1(f*^ Annual Aieef/ng of the Society
for Chaos Theory in Psycholosy and Life Sciences, Philadelphia,
Pennsyívania, July 20th-23rd, 2000, when Cinc Spinelli asked David
fÁayer if the Correlation Dimensión Statistic could be applied to the
study of DNA sequences. The affirmative response besan a
relationship in which the main part of the worl<. was concluded
befare march 24th, 2003, when Ciño died from a heart attack.
Since then his companion Cristina Martín-Castellanos has helped to
complete the text.
Cinc, with this text we greet you and remember you.
Cristina Martín-Castellanos
David Mayer-Foulkes
Asradecimientos
Este artículo nace durante la Décima Reunión de Society for Chaos
Theory in Psychology and Life Sciences, en Filadelfia, Pensilvania,
del 20 al 23 de julio del 2000, cuando Cinc Spinelli pregunta a
David Mayer si se podría aplicar el Estadístico de Dimensión de
Correlación al análisis de secuencias de DNA. La respuesta
afirmativa lleva a una relación en la cual la parte principal del
trabajo concluye antes del 24 de marzo del 2003, fecha en que Ciño
muere o consecuencia de un ataque al corazón. A partir de
entonces su compañera Cristina Martín Castellanos ha ayudado a
completar el escrito.
Cinc, con este texto te saludamos y recordamos.
Cristina Martín Castellanos
David Mayer-Foull<es
Abstract
In the last years several authors have reported the fínding of deterministic
dynamics in the flux of genetic information. Such dynamics suggest that
evolution occurs with the emergence and maintenance of a fractal
landscape in DNA chains. In this work we examine the idea that the
repetition of motifs lies at the origin of these statistical properties of DNA.
To analyse such dynamics of repetition we apply a modification of the BDS
statistic, a method borrowed from economic statistics, and we adapt it to
DNA sequence analysis. We compare the statistical properties of naturally
occurring sequences along the evolutionary tree with simulated randomly
generated sequences and also simulated sequences with repetition motifs.
We provide a new method to analyse DNA information, which is able to
search for a structured signal in genetic information. To better understand
the graphic results, we also define a new statistic for a DNA sequence. On
the basis of a mathematical interpretation of repetition patterns, a specific
fingerprint of DNA sequences is proposed. With this new method we study
the statistical properties of exon and intron DNA sequences fínding specific
statistical differences. Moreover, by analysing DNA sequences of different
species from Bacteria to man, we estímate the evolution of these linguistic
DNA features along the evolutionary tree. The results are consistent with
the idea that the flux of DNA information is not random, but that it is
formed by patterns of repetitions along the evolutionary tree. The
implications for evolutionary theory will be discussed.
Resumen
En los últimos años varios autores han reportado resultados de dinámica
determinista en el flujo de la información genética. Tal dinámica sugiere que
la evolución ocurre con el surgimiento y mantenimiento de características
fractales en las cadenas DNA. En este trabajo examinamos la idea de que la
repetición de secuencias se encuentra en el origen de estas propiedades
estadísticas del DNA. Para analizar la dinámica de repeticiones aplicamos
una modificación del estadístico BDS, método aportado por la estadística
económica que adaptamos al análisis de secuencias de DNA. Comparamos
las propiedades estadísticas de secuencias que ocurren en forma natural a
lo largo del árbol de la evolución, con secuencias simuladas generadas al
azar y también secuencias simuladas con secuencias de repetición.
Proporcionamos un nuevo método para analizar la información del DNA, que
tiene la facultad de buscar señales estructuradas en la información
genética. Para entender mejor los resultados gráficos definimos también un
■
í
nuevo estadístico para una secuencia DNA. Sobre la base de una
interpretación matemática de patrones de repetición, este constituye una
huella específica de secuencias de DNA. Con este nuevo método estudiamos
las propiedades estadísticas de las secuencias DNA de tipo exon e intron, y
encontramos diferencias estadísticas específicas. Más aún, el analizar las
secuencias DNA de diferentes especies, desde bacterias hasta el hombre,
sugiere cómo evolucionan estas características lingüísticas del DNA a lo
largo del árbol de la evolución. Estos resultados son consistentes con la idea
de que el flujo de información en el DNA no es al azar, sino que está
formado por patrones de repeticiones a lo largo del árbol evolutivo. Se
discuten las implicaciones para la teoría evolutiva.
A New Method to Sfudy DNA Sequences...
Introduction
An enormous amount of sequence data is accumulating as the result of
extensive sequence projects of whole genomes from Bacteria, Archea and
Eukaryotes ranging from the simple Saccharomyces cerevisiae to Homo
sapiens. This provides a huge source of information along the evolutionary
tree.
Given that DNA information is constituted by only four nucleotides in a
polymeric polynucleotid chain, the information embedded in the genome is of
high complexity. Indeed, many methods have been developed to analyse this
data, but it is clear that not all possible information concerning structure,
specifics patterns and evolutionary trends has been extracted so far. In
particular, the interface between statistical mechanics and DNA structure
analysis has attracted great interest (1,2).
It is clear that such complexity reflects the complexity of the organism
itself, but many questions remain open concerning the different trends of
evolution of DNA information. In addition, such evolutionary trends have not
been explored extensively in the light of recent knowledge on emergent nonrandom phenomena at the edge of chaos, where order and structures emerge.
In particular, language evolution in DNA information - a sort of self-emerging
order in the flux of genetic information - is a promising field for addressing
basic questions in the light of biological physics. A first and well known
specialisation of DNA information is the compartmentalisation of coding and
non-coding regions of DNA, introns and exons. Such specialisation is already
present in Prokaryotes, but only a small fraction of prokaryotic genomes both
in Bacteria and Archea do not have coding information for proteins.
Differential statistic properties betv^een introns and exons have been reported
(3), but such reports are contradictory (4,5). However, at the present it seems
clear that DNA information does have statistical properties such as: linguistic
features (6,7), noise (8), and fractal landscape (9). This fact indicates that
the application of different mathematical methods may reveal different
statistic properties. Lately, long-range correlation in DNA sequences has
attracted great interest. This property, that has been studied extensively
(10), is not fully explained, and its ultimate meaning is unknov/n.
The concept of gene has been revisited by Li, as a sequence of DNA that
performs a specific function (11). Despite this broad definition, the functional
constraints on gene coding for protein are still important in determining
sequence evolution for this part of the genome. The statistical properties
bound to such functional constraints acting in the genome evolution are
currently under investigation. In fact there is a necessity for this type of gene
to maintain an open reading frame, obeying the genetic code rule. The
genetic code is a source of periodicity in the correlation structure of exons
DIVISIÓN DE ECONOMÍA
W
Gino Spinelli. David Mayer-Foulkes
(12), but their long-range correlation property seems not to be as robust as
for intron sequences. It is not clear at present if exons are less correlated
than introns or if they are not long-range correlated at all.
Such differences in statistical correlation between exons and introns are
likely influenced by the rate and mode of evolution between the coding and
non-coding part of the genome. The point should be explored by introducing
methods able to specifically address our knowledge of the causes of the
statistical differences, if any, between coding and non-coding parts of the
genomes and able to study the large-scale structure of the genomes. In this
sense, Spinelli recently proposed a new theory for heterochromatin (13) that
is formed by repetitive elements with a focus on their fractal nature. In this
work we consider the idea that at the basis of the statistical properties
described above there is a generalized repetition of motifs in the genome and
we introduce a new method to analyse DNA sequences. The aim is to interpret
the linguistic and fractal behaviour of genome information, to address their
possible causes and to try to explain such properties in the light of this new
statistical method of DNA analysis.
7." The algoríthm implementafíon
We modify the BDS statistic method, a known tool in economic statistic (14,
15, 18) and adapt it to analyse DNA sequences with an emphasis on their
repetitive contení.
Let S = {A, C, G, T} be the possible elements of genetic sequences. Take a
DNA sequence Si, $2, ..., SN, where Sj e S. Define the distance:
d(s, t) = 1 if s 5t t, d(s, t) = O if s = t
(1)
Let m-histories be the sets of ordered m-tuples
Hm,T = {Si.m.T = (Si, Si+x, ... , Si+(ni-l)T) S.t. 1 = 1,...,N - (m - 1
)T}
(2)
T > 1 allows US to skip elements of the sequence, but we shall usually work
with 1 = 1. Thus Si,m,T is the m-history starting at S\ with x-step sequencing. To
be able to count repetitions we define distances between m-histories:
d(Sj,ni,T, Sj.m.t) - d((Si, ... , Si+(m-1)t), (Sj, ... , Sj+(ni-l)T)) = d(Si, Sj) + ... +
d(Si+(m_1)t, Sj+(ni-i)t)
(3)
This says that the distance between two m-histories Si.m.t, Sj.m.t is the
number of times corresponding positions hold different letters. If the distance
is r, there are m-r repetitions. Now we count these repetitions. Let
Cm.r.T = #{(Í,j) s.t. i < j and d(Si.m.x, Sj.m.O = r}
CIDE
(4)
A New Mefhod fo Sfudy DNA Sequences...
For each m, r, x, Cm,r,T is the number of distinct pairs of m-histories (with
x-sequencing) which have m-r letters in the same positions. For each DNA
sequence we calcúlate:
Cm,r,T for 1 < m < 32 (where m is the dimensión)
(5)
and O < r < m. The estimation were repeated for T = 10 and x = 100. In
effect this is a cheap way of examining long structures using a lower
dimensional analysis only sampling repetitions once every x positions. An mhistory samples a possible structure of length xm at the following positions:
1, 1+x,..., x{m-1)+1.
(6)
2.- Resulfs
2.L- The sfudy of simuloied ond natura/ occurring DNA
sequences
At this point we compare the statistical properties of natural occurring
DNA sequences with those of simulated random sequences. This is done in
order to search for a structured signal that can be considered a deviation from
puré randomness in generating DNA information. If some kind of complex
language arising from deterministic dynamics is present in DNA chains, it
should be possible to distinguish this behaviour from random information
produced by a random flux of mutations. A system or a series of data that
describes a system is considered random if a time series analysis does not
produce any kind of pattern. A DNA chain can be considered a time series in
which it should be possible to distinguish regularity, distinct from
randomness, indicating that the flux of genetic information tends to genérate
ordered signáis. Many definitions address this type of phenomenon in natural
systems, searching for descriptive laws in many fields of science. In our
proposition the idea is to calcúlate the deviation from randomness and
compare it to prototypical forms of language emergence, or to puré
randomness. With this in mind for each of these cases, Cm,r,T was calculated
for 20 random sequences obtained by drawing from all of the letters at
random (with replacement). In all cases care was taken not to include mhistories running across a break in the original data, and the random
sequences kept the same breaks in the data. Finally a t-statistic was
calculated to measure how significantly different from the average of the
Cm.r.T of the random sequences the original Cm.r.t was.^ Very high t's were very
This is the coefficient for the DNA sample divided by the standard deviation of its 20 randomized controls.
DIVISIÓN DE ECONOMÍA
G/'no 5p/ne//í, David Mayer-Foulkes
often obtained, foUowing a very similar pattern for all of the DNA sequences
we analyzed. An example of this pattern can be seen in Figure la. The blue
indicates significantly negative results. The red, insignificant or zero. The
yellow, significantly positive results. Significantly positive (negative) means
that for the given Cm,r,T, the DNA had significantly more (less) repetitions than
if it were random.
We observe that there are more than an average number of comparisons
v/ith a high level of coincidence (low r) and a low level of coincidence (high
r), and therefore there are less than the average number of intermediate
levéis of coincidence. The non-random structure of DNA sequences is
represented by a higher presence of longer than average sequences
interspersed v/ith shorter than average sequences. This is especially observed
for T = 1 (see an example in Fig. 1a). At T = 10 this pattern of results is
typically becomes less significant (Fig. Ib). By T = 100 a different structure of
repetition is detected (Fig. 1c). In the case of the Humpdhal genome an
inordinately high number of long sequences is present. The different results
obtained for different T indicate that our method can be used to investígate
different aspects of complexity in the language information hidden in DNA
sequences.
To be sure that the results are not spurious, tv/o random DNA sequences
vvere constructed, one assigning a probability of 1/4 to each letter and
another assigning non-uniform probabilities to simúlate a sequence 8000 bases
long. In both cases it was very unlikely to obtain significant deviations from
the mean and the graphs are almost uniformly red (Figure l.d for T = 1; the
results for T = 10 and 100 are similar).
The question is now what the non-random structure is. We constructed
two artificial DNA sequences to try to reproduce the observed patterns of
significance in the (m,r) plañe. In the first sequence, called "Random Words",
we gave each of the following sequences an equal probability of appearing
next: A, C, G, T, CA, GA, GC, GT, AGC, ATG, TAC, TTG, TGAG, TCGA, TCCA,
ACGC; v/e called this sequence Level 1.
Random Words DNA chooses amongst these 16 'words' at random. In the
second simulated sequence, called "Random Sentences" we chose 8
sequences of 50 letters generated as Level 1 Random Words and gave them an
equal probability of appearing. The sentences, which are called Level 2, are
the following:
ACGCATGCGGCATGTCGACAAGCTACACATACGTCCAGATGTACATGATG
AGCTACATGTGAGTACCTGAGAGCAGCGTGTGAGATGTCCAATGGCTGAG
AACGCAGTGAGTTGACCAATGTTGTACGCCAGCCAGCTCGAGATGGCAGC
AACGCGTATGGCTCGAGATTGAGCACGCGTGATCGAGCAGCGCTCCAAGC
CIDE
A New Method to Study DNA Sequences.
TCCATCCATGAGTCGAGTTCCAATGTACATGAGCGATATGGGAGACGCTC
GAAGTGCTCGATCCATACTACATGAATGATGATGCTGAGTGACGCTCCAG
TAATGTCGATGAGATGGCACGCGATGAGAGCACGCTCCATGAGTATGAGT
ACGATCCAATGTAGTTGAGCCATGGAATCCATACTTGCAGTAGTGAGAAG
The idea of simulation is to introduce a bias in the distribution of the
words in the DNA chain. In fact in a génesis of a structure or a language it is
important to define a set of forbidden words. By excluding most of the
possibilities of the alphabet of 4 letters for generating sequence information
and forcing our system to repeat always the same words or sentences we
actually introduce a non-random bias in sequences simulation. We view this as
a language emergence simulation. One can think of these words as
representing different molecular structures generated from the DNA
sequence.
The Random Sentences DNA simulation chooses amongst these eight 50
base long "sentences" at random, constructing a DNA chain characterized by a
bias of repetition that mimics the basis of language emergence, or molecular
coding through sequences with specific functions, The results of the analysis,
shown in Figures 2, are very clear. In the case of t = 1, both artificial
sequences generated the same general pattern of significance observed in the
(m,r) plañe, as did the natural occurring DNA sequences (Figs. 2a, 2c). On the
other hand, in the case of x = 10 the Level 1 artificial DNA sequence gave
almost random results, while the Level 2 sequence preserved a similar nonrandom pattern, showing that its significance pattern in the (m,r) plañe is
preserved for larger m (Figs. 2b, 2d). As mentioned above, the Humphdal
example in Figs 1a, Ib, 1c is intermediate between some naturally occurring
DNA sequences that did and some that did not preserve this structure for x =
10. It is interesting that we can distinguish quite clearly between the artificial
DNAs generated as Random Words or as Random Sentences.
It is clear that there is a huge number of ways to produce a structure in
DNA sequence based on the language of repetition of motifs; we only simúlate
a very small part. It is intriguing that this simulated part resembles the
behaviour of natural occurring DNA sequences. This fact indicates that in
natural sequences there is the persistence of a deterministic dynamic based
on the repetition of words. It is likely that the language thus originated and
evidenced with our method shares common features with fractal landscapes
suggested by other authors (9). However, the methods used are too different
for a direct comparison. What we emphasize with our formalization is that it
focuses attention on a hypothesis of language formation. A language is formed
by words that can be detected in natural sequences, but not in randomly
simulated sequences. This hypothesis is confirmed by the fact that our
DIVISIÓN DE ECONOMÍA
G/no Spineíli. David Mayer-Foulkes
simulation experiments resemble the statistical properties of natural
occurring sequences.
2.2.- Infroductíon of o new sfafísf'ic tooí
In order to summarise the properties of graphic results we introduce another
kind of formalization, which is a statistic similar to the Lyapunov exponent.
With this new kind of formalization it is possible to obtain a one-dimensional
graph useful for evolutive comparisons. We cali this index the Spineíli and
Mayer index.
If C (m, m) represents the number of m-string comparisons with m
repetitions we introduce:
SM(m) = C(m+1,m+1)/C(m,m)
(7)
This is equivalent to study the diagonal of the graphs shown in Figures 1
and 2. The formalization is an intuitive statistic related to the idea of
structure in DNA chains. Its meaning can be expressed as the probability that
when two strings of DNA with m repetition are compared, the next entry, will
be also repeated. It can be predicted that the rapid decay of the onedimensional graph of our SM index indicates the absence of structure in the
DNA sample under investigation. Random DNA sequences are supposed to have
no sign of structure where high structured DNA will have a persistence of
pattern of repetition or a language structure. We actually simulated these
extreme points of randomness in our study as shown in Figures 1 and 2. The
SM index is a simple tool to discrimínate random signáis from structured ones.
in fact, we consider that the properties reportad as long-range correlation,
language structures and fractal landscapes have a similar origin in the
different distributions of words or sentences that we described in previous
section. Such a distribution is far from random when a preferred set of words
or sentences is chosen by evolution to genérate structured signáis in DNA
chains.
2.3.- Evolutive sfudy of DNA sequences
It is now time to apply our analysis to open questions that we formalize in this
way: a) is our method able to distinguish the differential statistical properties
between intron and exons suggested by several authors (3) b) Do exons have
analogous statistical properties to introns as suggested by other authors (4,5)
c) Can evolutionary trends of DNA languages from Bacteria to Man be studied
and if so, what is the information that can be extracted? A sample of
sequences of exons, introns and virus was extracted from the Gene Bank and
submitted to our analysis. The results for x = 1 and x = 10 are shown in Figure
CIDE
A New Mefhod to Study DNA Sequences.
3. It should be noted that it is possible to distinguish between exonic and
intronic sequences. AU intron sequences are clustered at high level of
repetition, for both valúes of x. This result is in perfect agreement with the
proposition of Peng and co-workers (3). Thus our approach actually detects
and explains the long-range correlations in intron sequences. Moreover, the
SM graph of a sample of exons extracted from the Gene Bank shows a pattern
of decay that lies approximately between that of uniformly random sequences
taken as a lower point of structured DNA chain, and Level 1 language. In any
case, we noted that a similar lower grade of language organization is present
in exons, as is apparent by comparison with random simulated sequences. This
fact may indícate that while introns are more prone to exhibit repeated
structures, this form of organization of genetic information is nevertheless
also present in exon sequences. Other methods have found the peculiarity of
language organization only in intron sequences. However, we suggest that
exons are on the average only less correlated than introns, indicating the
persistence of different evolutionary forces and functional constraints acting
on this type of sequences. The action of such forces allows intron sequences
to evolve linguistically while exon sequences are in some way restricted. Even
though the different methods used to study this phenomenon are not directly
comparable, it is very likely that the results do relate to the same statistical
aspect of DNA language, that is the intensity of repetitions in DNA. Our
proposition is that rather than considering strong differences between the
statistical properties of exons and introns, it is useful to consider a degree of
fractal landscape: less persistent in exons and sometimes totally absent, more
evident in introns. In fact, as shown in Figure 4a, the statistical decay of the
SM Índex of some of the exon sequences analysed is very cióse to the decay
observed in random simulated sequences. This fact is not observed for intron
sequences, indicating that language structure is more persistent in introns
than exons. In conclusión, exons exhibit a lower degree of language structure
than intron sequences. It has been suggested that exons do not exhibit
different statistical properties from introns. This has been a source of debate
between authors (3, 4, 5). Since our method directly measures the only
possible source of structure in DNA chains, that is, repetition of words or
sentences, our work furnishes a solution to this debate. The method proposed
here allows a direct measure of linguistic differences between exons and
introns, which can be useful to further analyse the large-scale structure of the
genome in evolution.
At this point we consider the idea of analysing sequences extracted from
different genomes of bacteria, archea and eukaryotes. In Figures 4a and 4b
the SM Índex for genomic sequences is shown with x = 1 and x = 10
respectively. The SM índex decay for the sequences consídered fell in
between the símulatíon results for random sequences ("aleat") and high
structured repetitive sequences (Level 2). This fact indicates that the
DIVISIÓN DE ECONOMÍA
WM
Gino Spinelli. David Mayer-Foulkes
evolutive patterns of language along the phylogenetic tree assume many
possiblities of language formation depending on the species considered. In
addition, the patterns express species-specific trends along the tree that is, in
most of the cases here considered, lie far from the behaviour of random
simulated sequences and far from the behaviour of highly repeated simulated
sequences. This may indicate that for establishing the evolution of natural
sequences, complex events must occur to genérate DNA language
connposition. Our simulation experiments suggest two ideal behaviours for
DNA sequence evolution. They are the tendency to a full random flux of
mutation and the tendency to genérate high repetition of motifs in DNA
sequences. Using only four single nucleotides, four dinucleotides, four
trinucleotides and four tetranucleotides generates the Level 1 simulation, The
results show a linear decay of the SM index for Level 1. This indicates a simple
way to genérate linguistic behaviour in a DNA sequence that is not reproduced
in any of the sequences analysed so far. Some differences are observed for x =
10 in the SM index decay, confirming that a different structure of repetition is
detected. More observations can be made on the analysis of genomic
sequences shovf in Figures 4a, 4b. The pattern of bacterial and viral genomic
sequences is more cióse to the pattern of random simulated sequence (Aleat),
in particular Borrellia burdgoferi, Micoplasma g^f^^^o^^s, and T4
Bacteriophage. This is consistent with the idea that genomes that contain a
high percentage of coding sequences are restricted in their potential to
exhibit high repetition of motifs as we shov/ also for exons in Figures 4a and
4b. This suggests that the flux of mutation in this kind of sequences has a
tendency to fix random mutations rather than duplicative events of any
nature, from single dinucleotides to a more complex repetition of motifs. The
only exception is the Methanococcus jannashii genome that has a quite
structured genome. This is consistent v^ith the fact that this archea shov/s a
similarity v^ith eukaryotic genome evolution as suggested in (16). The
situation of Pan paniscus mitocondrial DNA is interesting in that it shows a
rate of decay of SM index that is cióse to both random simulation and
Bacteria. This may indicate that mitochondrial DNA still evolves like bacteria
as suggested by endosymbiont hypothesis (17). This fact is not confirmed for
the chloroplast sequence, that shows a language structured genome
suggesting that different functional constrains act on chloroplast genome
evolution. The estimation of the SM index for Epstein Barr virus is also
interesting. This virus sequence shows a modular DNA composition very similar
to that of Level 2 simulation, indicating that this genome evolves by
accumulating a high percentage of repetition of motifs, as does the simulated
DNA sequence.
It has been suggested that fractal landscape may be connected with
isochore (9). in other words, the percentage of GC in a genome fraction is a
property connected with the observed long-range correlations in DNA
CIDE
A New Mefhod to Sfudy DNA Sequences...
sequences. We calculated the CG content of all the sequences subjected to
our analysis (not shown). There is no clear relationship between the decay of
the SM Índex and the GC content. It should be noted that our method simply
calculates the repetitiveness of a given DNA sequence. We think that this lies
at the basis of the fractal landscape or language suggested by other authors.
There are no very clear reasons to correlate the DNA base content to the
emergence of a fractal landscape, It nnight be surmised that some
evolutionary constraint is connected with isochores that allows emergence of
long-range correlation, but such connection is not evident in our approach.
DIVISIÓN DE ECONOMÍA
Gino Spinelli, David Mayer-Foulkes
Conclusions
There is a long standing interest in correlation in DNA sequences, several
aspects of which are currently under investigation. First, long-range
correlations are common in physical systems that occur cióse to critical
points, Second, the fact that DNA shares this property opens the way for a
new formalization of the flux of DNA information and evolution itself as a
critical self-organized phenomenon. In this connection the main paradigm for
evolution of DNA sequences is that the flux of mutation in DNA information is
basically random and natural selection acts as the solé forcé of order. Introns
are generally considered to evolve in the absence of natural selection
mechanisms or in the presence of less functional constraints as compared to
exons. There is a contradiction between the measures of deterministic
dynamical persistence in introns presented here and this evolutionary
paradigm. It seems indeed that the tendency to genérate order, expressed in
terms of language emergence, is observed when fewer DNA sequences under
selection are present. The consequence is that natural selection is not the
solé source of order: non-random structures in the flux of DNA information
may emerge from non-random flux patterns of DNA mutation that may
dominate random mutations. This scenario is consistent with the idea that
evolution may occur as a combination of self-organization and natural
selection mechanisms.
Instead of correlation, we can think of the biologic function of language
composition in DNA. At present there is no direct experimental evidence that
the fractal landscape of DNA sequences plays a biological role. It can be
speculated that more structured DNA furnishes more biological signáis for
gene regulation and for the tridimensional assembly of chromatin. This idea is
consistent with the fact that enhancers and regulation signáis are more often
localized in intron sequences than in exons. With our method we do not
directly measure enhancers or regulation signáis, but a general tendency of
evolution to genérate order in the DNA information. The biological basis of
such order and its ultimate meaning should be further investigated.
Considering that it has been proposed that most of the eukaryotic genome is
junk DNA devoid of canonical genetic function, it is intriguing that this noncoding part of the eukaryotic genome, including introns and heterochromatin,
shows a fractal landscape with properties resembling artificial DNA sequences
with a language composition. Is this source of regularity in the genome a clue
for non trivial code in "junk" DNA? What is the ultimate meaning of this code?
The study of this code and its evolution along phylogenetic tree may
contribute to elucidate the large scale structure of the genomes under the
light of a modified evolutionary theory including both self-organization and
natural selection.
C/DE
A New Method io Sfudy DNA Sequences...
References
Ben-Jacob, E.; Shochet, O.; Tenenbaum, A.; Cohén, I.; Czirtok, A.; and Vicsek, T.
(1994). "Generic modelling of cooperative growth pattems in bacterial
colom'es". Nature, 368, 46-49.
Brock, W. A.; Dechert, W. D.; Sheinkman, J. A.; and LeBaron, B. (1996). "A Test for
Independence Based on the Correlation Dimensión". Econometric Reviews,
15(3), 197-235.
Bult, C. J.; White, O.; Olsen, G. J.; Zhou, L; Fleischmann, R. D.; Sutton, G. G.;
Blake, J. A.; FitzGerald, L. M.; Clayton, R. A.; Gocayne, J. D.; Kerlavage, A. R.;
Dougherty, B. A.; Tomb, J. F.; Adams, M. D.; Reich, C. I.; Overbeek, R.;
Kirkness, E. F.; Weinstock, K. G.; Merrick, J. M.; Glodek, A.; Scott, J. L.;
Geoghagen, N. S.; and Venter, J. C. (1996). "Complete genome sequence of the
methanogenic archaeon, Methanococcus jannaschii". Science, 273, 1058-1073.
De Sousa Vieira, M. (1999). "Statistics of DNA sequences: a low frequency analysis",
Phys Rev E, 60, 5932-5937.
Herzel, H.; Trifonov, E. N.; Weiss, O.; and GrofSe, I. (1998). "Interpreting
correlations in biosequences". Physica A, 249, 449-459.
http://www.nsl1i-genetics.org/dnacorr/
LeBaron, B. (1997). "A Fast Algorithm for the BDS Statistic". Studies in Nonlinear
Dynamics Et Econometrics, 2, 53-59.
Li, W. H. (1997). Molecular Evolution. Sinauer Associates, Inc. Sunderland, MA. USA.
Mantegna, R. N.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Peng, C. K.; Simons,
M.; and Stanley, H. E. (1994). "Linguistic features of noncoding DNA
Sequences". Phys Rev Lett, 73, 3169-3172.
Margulis, L. (1970). Orisin of Eukaryotic Cells. Yale Univ. Press. New Haven, CT. USA.
Mayer-Foulkes, D. (2000). "A generalized fast algorithm for BDS-type Statistics".
Studies in Nonlinear Dynamics and Econometrics, 4(1), Algorithm 2.
http://www.bepress.com/snde/vol4/iss1/algorithm2
Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simons, M.;
and Stanley, H. E. (1992). "Long-range correlations in nucleotide sequences".
Nature, 356, 168-170.
Searls, D. B. (2002). 'The language of genes". Nature, 420, 211-217.
Spinelli, G. (2003). "Heterochromatin and complexity: a theoretical approach".
Nonlinear Dynamics, Psycholosy and Life Science, 7(4) October, 329-361.
Stanley, E. H.; Buldyrev, S. V.; Goldberg, A. L.; Havlin, S.; Mantegna, R. N.; Peng, C.
K.; and Simons, M. (1996). "Scale Invariant features of coding and noncoding
DNA sequences", in: lannaccone, P. M. and Khokha, M. (Eds.) Fractal Ceometry
in Biolosical Systems. CRC Press, Inc. pp. 249-266,
DIVISIÓN DE ECONOMíA
Gino Spinelli, David Mayer-Foulkes
Stanley, H. E.; Buldyrev, S. V.; Goldberger, A, L.; Havlin, S.; Peng, C. K.; and
Simons, M. (1999). "Scaling features of noncoding DNA". Physica A, 273, 1-18.
Voss, R. (1992). "Evolution of long-range fractal correlations and 1/f noise in DNA
base sequences". Phys Rev Lett, 68, 3805-3808.
Voss, R. (1994). "Long-range fractal correlations in DNA introns and exons". Fractals,
2, 1-6.
CIDE
■
A New Method fo Sfudy DNA Sequences.
Figure 1a. Signíricance of repetitions for
HUMPDHAL DNA sequence at tao - 1
1ilfm
M---- ptó
UPS£
:::
0
NMB
21
17
■ +-1— '. i
. J.J..
■4-r
; 29
25
F4R
:Jt::
-
13
9
...
:5-^:fi ■r ■ -ff"
HB
;
Figure 1b. Significance of repetitions fof
HUMPDHAL DNA sequence at tao ' 10
■i
fi i^ i»ii i
o
-
5
1
r< fs r>
Repetitions
Rq)etitions
|tj<2
t<-2
Figure 1c. Significance of repetitions for
HUMPDHAL DNA sequence at tao = 100
nt^2
Figure 1d. Significance of repetitions for a nonuniformly random DNA sequence, tao » 1
i
a.
o
Rep«titioas
Repetitions
m^
-2
iti<2
DIVISIÓN DE
nt^2
ECONOMIA
Gino Spinelli, David Mayer-Foulkes
Figure 2a. Sionificance of repetítíons for Level 1
artificial DNA sequence at tao - 1
Figure 2b. Significance of repetilions for Leve! 1
artiricial DNA sequence at tao - 10
Repctitions
Repetitions
|tt<2
t<-2
Figure 2c. Significance of repetitions for Level 2
artificial DNA sequence at tao - 1
::::::
iüllJ
iil^
IM
lín
Figure 2d. Sionificance of repetitions for Level 2
artificial DNA sequence at tao - 10
29 a
25 I
21
- 5
!S.
g
17
;;
13
m
i^^H^m
flnWWfttr--" i
nti2
9
5
Repetitions
t£-2
CIDE
A New Meíhod to Sfudy DNA Sequences.
Figure 3. SM Index for Various Genomes at Dimensión 7, tao » 1, 10
0.0014 j0.0012-0.001 --
"I 0.0008
^' 0.0004
fl"^ft
tO
N.
O
K
O
<
u>
▼"
o
<
1
s fí S-1í- ™:|h 1 1
1
^ 0.0006
1
1
' ™ 1
lO
eo
*o
*•*
<
' ™ 1
O
^▼•
^
<
1
1
(O
CM
■9
yrO
in
hT-
««
(O
Oí
T-
S'
<
z
o>
f»
<
Genome
U)
■A
O
O
o>
o
o
tn
o
s'
<
Z
1}
DIVISIÓN DE ECONOMÍA,
»-
sm
*
<
O)
o>
<o
C
o
CM
T-
▼•
§
Q
¿
fM
B tao = 1
■ tao = 10
Gino Spinelli, David Mayer-Foulkes
Figure 4a. SM index for sampte of natural and simulated DNA sequences
{tao = 1)
— Level2
— Levell
EBV
— Pombedbs
— Mmchr 01
— Lí ish 1
— Methano
— ArabilV
— Chorcll
— Homo 38
— Hogchol
Homo 37
— Simbisvi
- T4
YeastVn
Worml
An^
Burgd
Mgnita
— Panmtd
Unifonn
Figure 4b. SM index for sample of natural and simulated DNA sequences
(tac = 10)
1 TV—'~i—""I—'—""i—r-i—TT—I I I—I 1 I—n—n—i—n—i—i—n—i—r—i
Lcvcl 2
Level1
EBV
Pombedbs
Lcish 1
ArabüV
Methano
Homo 38
Homo 37
Mmclir 01
ChoreU
Worml
Burgd
Mgnita
Yeast \TI
Simbis\i
T4
Panmtd
Anq>
Hogchol
Unifoim
C/DE
Novedades
DIVISIóN DE ADMINISTRACIóN PúBLICA
Arellano Gault y Vera-Cortés Gabriela, Institutional Desisn and orsanization of
The Civil Protection National System in México: The case for a
decentralized and participative policy network, AP-162
Cabrero Enrique, López Liliana, Segura Fernando y Silva Jorge, Acción
Municipal y desarrollo local ¿Cuáles son las claves del éxito?, AP-163
Rowland, Allison, A Comparison of Federalism in México and the United States,
AP-164
Sour, Laura, Tax Compliance & Public Coods: Do They Really Get Alone?
AP-165
Sour, Laura, Crime, Punishment or... Reward? Testing the Crowding-ln Effect
in the Case of Tax Compliance, AP-166
Bonina Carla, Tecnologías de información y Nueva Gestión Pública:
experiencias de sobierno electrónico en México, AP-167
Merino Mauricio y Macedo Ignacio, La politica autista. Crítica a la red de
implementación municipal de la Ley de Desarrollo ..., AP-168
Merino Mauricio, El desafío de la transparencia. Una revisión de las normas de
acceso a la información pública... AP-169
Sour Laura y Girón Fredy, Evaluando al gobierno electrónico: realidades
concretas sobre el progreso del federalismo digital en México, AP-170
Sánchez Reaza Javier, Productividad ventajas comparativas reveladas y
competitividad sectorial en México. AP-171
DIVISIóN DE ECONOMíA
Hernández Fausto y Jarillo Brenda, Transferencias condicionadas federales en
países en desarrollo: el caso del FISM en México, E-336
Rosellón Juan, Regulation of Natural Cas Pricing in México, E-337
Rosellón Juan, Sophie Meritet, Alberto Elizalde, LNC in the Northwestern Coast
of México: tmpact on Prices of Natural Cas in Both Sides of the US-Mexico
Border, E-338
Mayer Foulkes, Peter Nunnenkamp, Do Multinational Enterprises Contribute to
Convergence or Divergence? A Disaggregated Analysis of US FDI, E-339
Smith Ramírez Ricardo, Introducing Soil Nutrient Dynamics in the Evaluation of
Soil Remediation Programs. Evidence form Chile, E-340
Coady David P., Parker Susan W., A Cost-Effectiveness Analysis of Demand-and
Supply-Side Education tnterventions: The Case of PROGRESA in México, E-341
Carreón-Rodríguez Víctor G., Jiménez Armando, Rosellón Juan, The Mexican
Electricity Sector: Economic, Legal and Political Issues, E-342
Musacchio Aldo, Gómez Galvarriato Aurora, Bonds, Foreign Creditors, and the
Costs of the Mexican Revolution, E-343
Cermeño Rodolfo y Vázquez Sirenia, Technological Backwardness in Asrícultura:
Is it Due to Lack of R&D Human Capital and Openness to international Trade?
E-344
Arellano Rogelio, Hernández Trillo Fausto, Challenses of Mexican Fiscal Policy,
E-345
DIVISIóN DE ESTUDIOS INTERNACIONALES
González Guadalupe, Minushkin Susan y Shapiro Robert (editores), Mexican
Public Opinión and Foreign Policy, EI-120
González Guadalupe, Minushkin Susan, Shapiro Robert y Hug Catherine
(editores), Comparing Mexican and American Public Opinión and Foreign
Policy, EI-121
González Guadalupe, Minushkin Susan y Shapiro Robert (editores). Opinión
pública y política exterior en México, EI-122
González Guadalupe, Minushkin Susan, Shapiro Robert y Hug Catherine
(editores). Opinión pública y política exterior en México y Estados
Unidos: un estudio comparado, EI-123
Meseguer Covadonga, What Role for Learnins? The Diffusion of Privatisation in
the OECD and Latín American Countries, EI-124
Sotomayor, Arturo, The Unintended Consequences of Peacekeepins
Participation in the Southern Cone of South America, El-125
Odell S. John and Ortiz Mena L.N. Antonio, Cetting to "No": Defending Against
Demands in NAFTA Energy Negotiations, El-126
López Farfán Fabiola y Schiavon Jorge A., La política internacional de las
entidades federativas mexicanas, EI-127
Sotomayor, Arturo, Tendencias y patrones de la cooperación internacional para
el desarrollo económico, EI-128
Sotomayor, Arturo, La participación en Operaciones de Paz de la ONU y el
control civil de las fuerzas armadas: los casos de Argentina y Uruguay,
EI-129
DIVISIóN DE ESTUDIOS JURíDICOS
Pasara Pazos, Luis, Reforma y desafíos de la justicia en Guatemala, EJ-3
Bergman S., Marcelo, Confianza y Estado de Derecho, EJ-4
Bergman S., Marcelo, Compliance with norms: The Case of Tax Complíance in
Latín America, EJ-5
Pasara, Luis, Cómo sentencian los jueces en el D. F. en materia penal, EJ-6
Pasara, Luis, Reformas del sistema de justicia en América Latina: cuenta y
Balance, EJ-7
Posadas, Alejandro, Canadá Trade Law & Policy after NAFTA and the WTO, EJ-8
Hernández, Roberto, Alcances del "juicio oral" frente a la Reforma Integral a
la Justicia Penal propuesta por presidencia, EJ-9
Magaloni, Ana Laura, El impacto en el debate sobre la reforma judicial de los
estudios empíricos del sistema de justicia: el caso del estudio del Banco
Mundial sobre le Juicio Ejecutivo Mercantil, EJ-10
Bergman, Marcelo, Do Audits Enhance Compliance? An Empirical Assessment of
VAT Enforcement, EJ-11
Pazos, María Inés, Sobre la semántica de la derrotabilidad de conceptos
jurídicos, EJ-12
DIVISIóN DE ESTUDIOS POLíTICOS
Bowman, Kirk, Fabrice Lehoucq, James Mahoney, Measuring Political
Democracy: Data Adequacy, Measurement Error, and Central America,
EP-169
Marván Laborde, Ignacio, ¿Cómo votaron los diputados constituyentes de
1916-1917. EP-170
Schedler Andreas & Sarsfield Rodolfo, Demacráis with Adjetives Linking Direct
and tndrect Measures of Democratic Support, EP-171
Langston, Joy, After the End: México's PRl in the Aftermath of the 2000
Presidential Defeat, EP-172
Schedler Andreas, Patterns of Interparty Competition in Electoral Autocracies,
EP-173
Schedler Andreas, Mapping Contingency, EP-174
Langston Joy, The Search for Principáis in the Mexican Legislature: The PRI's
Federal Deputies, EP-175
Lehoucq Fabrice, Gabriel Negretto, F. Javier Aparicio, Benito Nacif y Allyson
Benton, Political ¡nstitutions, Policymaldns Processes, and Policy Outcomes in
México, EP-176
Langston Joy, Why Hegemonic Parties Rupture, and Why Does it Matter?, EP-177
Murillo, María Victoria y Martínez Gallardo Cecilia, Policymaking Patterns:
Privatization of Latín American Public Utilities, EP-178
DIVISIóN DE HISTORIA
Sauter, Michael J., Clock Watchers and Stargazers: Berlín's Clock Betv/een
Sciencie, State and the Public Sphere at the Eíghteenth Century's End, H-26
Pipitone, Ugo, Desigualdades. (Segundo capítulo de Caos y Globalización^ H-27
Bataillon, Gilíes, Formas y prácticas de la guerra de Nicaragua en el siglo XX,H-IS
Meyer, Jean, Pro domo mea: "La Cristiada" a la distancia, H-29
Meyer, Jean, La iglesia católica en México 1929-1965, H-30
Meyer, Jean, Roma y Moscú 1988-2004, H-31
Pañi, Erika, Saving the Nation through Exclusión: The Alien and Sedition Acts
and México 's Expulsión of Spaniards, H-32
Pipitone, Ugo, El ambiente amenazado (Tercer capítulo de El Temblor...), H-33
Pipitone, Ugo, Aperturas chinas (1889,1919,1978), H-34
Meyer, Jean, El conflicto religioso en Oaxaca, H-35
Ventas
DIRECTAS:
57-27-98-00 Ext. 2603 y 2417
Fax: 57-27-98-85
INTERNET:
Librería virtual:
Página web:
e-mail:
www.e-cide.com
www.cide.edu
publicaciones@cide.edu
LIBRERíAS DONDE SE ENCUENTRAN DOCUMENTOS DE TRABAJO:
LIBRERÍAS GANDHI Tel. 56-6110-41
LIBRERÍA CIDE/F.C.E. Tel. 57-27-98-00 EXT. 2906
SIGLO XXI EDITORES S.A. DE C.V. TeU 56-58-75-55
UAM AZCAPOTZALCO Tel. 53-18-92-81
ARCHIVO GENERAL DE U NACIÓN Tel. 51-33-99-00
í5 ^
s°
5 '^
www.cide.edu
CENTRO DE INVESTIGACIÓN Y DOCENCIA ECONÓMICAS
CAS8ETESA MEXICO-iOLUCA 3655, COL LOMAS OE SASTA F£, 01210, MEHICO, D.f,
COKMUIAOOR: 5727'9800
EXTS.: 2202,2203 y 2417
FAX: 5727*9885 v 5292-1304
coíiiEO EiECTíONico: publicacionesiScide.edu