Economía
Transcription
Economía
DOCUMENTO DE TRABAJO :;i:_^^S.:ji;í;íí^^:.?ft¿¿¿ü¿-ii; NÚMERO 356 GiNO SPINELLI, DAVID MAYER-FOULKES A New Method to Study DNA Sequences: The Languages of Evolution DIVISIÓN DE Economía CIDE NÚMERO .3 56 GiNO SPINELLI, DAVID MAYER-FOULKES A New Method to Study DNA Sequences: The Languages of Evolution DICIEMBRE 2005 CIDE www.cide.edu Las colecciones de Documentos de Trabajo del CíDE representan un medio para difundir los avances de la labor de investigación, y para permitir que los autores reciban comentarios antes de su publicación definitiva. Se agradecerá que los comentarios se hagan llegar directamente al (los) autor(es). • D.R. ® 2005. Centro de Investigación y Docencia Económicas, carretera México-Toluca 3655 (km. 16.5), Lomas de Santa Fe, 01210, México, D.F. Tel. 5727.9800 exts. 2202, 2203, 2417 Fax: 5727.9885 y 5292.1304. Correo electrónico: publicaciones@cide.edu v^ww.cide.edu Producción a cargo del (los) autor(es), por lo que tanto el contenido así como el estilo y la redacción son su responsabilidad. Acknowledsments This article was born during the 1(f*^ Annual Aieef/ng of the Society for Chaos Theory in Psycholosy and Life Sciences, Philadelphia, Pennsyívania, July 20th-23rd, 2000, when Cinc Spinelli asked David fÁayer if the Correlation Dimensión Statistic could be applied to the study of DNA sequences. The affirmative response besan a relationship in which the main part of the worl<. was concluded befare march 24th, 2003, when Ciño died from a heart attack. Since then his companion Cristina Martín-Castellanos has helped to complete the text. Cinc, with this text we greet you and remember you. Cristina Martín-Castellanos David Mayer-Foulkes Asradecimientos Este artículo nace durante la Décima Reunión de Society for Chaos Theory in Psychology and Life Sciences, en Filadelfia, Pensilvania, del 20 al 23 de julio del 2000, cuando Cinc Spinelli pregunta a David Mayer si se podría aplicar el Estadístico de Dimensión de Correlación al análisis de secuencias de DNA. La respuesta afirmativa lleva a una relación en la cual la parte principal del trabajo concluye antes del 24 de marzo del 2003, fecha en que Ciño muere o consecuencia de un ataque al corazón. A partir de entonces su compañera Cristina Martín Castellanos ha ayudado a completar el escrito. Cinc, con este texto te saludamos y recordamos. Cristina Martín Castellanos David Mayer-Foull<es Abstract In the last years several authors have reported the fínding of deterministic dynamics in the flux of genetic information. Such dynamics suggest that evolution occurs with the emergence and maintenance of a fractal landscape in DNA chains. In this work we examine the idea that the repetition of motifs lies at the origin of these statistical properties of DNA. To analyse such dynamics of repetition we apply a modification of the BDS statistic, a method borrowed from economic statistics, and we adapt it to DNA sequence analysis. We compare the statistical properties of naturally occurring sequences along the evolutionary tree with simulated randomly generated sequences and also simulated sequences with repetition motifs. We provide a new method to analyse DNA information, which is able to search for a structured signal in genetic information. To better understand the graphic results, we also define a new statistic for a DNA sequence. On the basis of a mathematical interpretation of repetition patterns, a specific fingerprint of DNA sequences is proposed. With this new method we study the statistical properties of exon and intron DNA sequences fínding specific statistical differences. Moreover, by analysing DNA sequences of different species from Bacteria to man, we estímate the evolution of these linguistic DNA features along the evolutionary tree. The results are consistent with the idea that the flux of DNA information is not random, but that it is formed by patterns of repetitions along the evolutionary tree. The implications for evolutionary theory will be discussed. Resumen En los últimos años varios autores han reportado resultados de dinámica determinista en el flujo de la información genética. Tal dinámica sugiere que la evolución ocurre con el surgimiento y mantenimiento de características fractales en las cadenas DNA. En este trabajo examinamos la idea de que la repetición de secuencias se encuentra en el origen de estas propiedades estadísticas del DNA. Para analizar la dinámica de repeticiones aplicamos una modificación del estadístico BDS, método aportado por la estadística económica que adaptamos al análisis de secuencias de DNA. Comparamos las propiedades estadísticas de secuencias que ocurren en forma natural a lo largo del árbol de la evolución, con secuencias simuladas generadas al azar y también secuencias simuladas con secuencias de repetición. Proporcionamos un nuevo método para analizar la información del DNA, que tiene la facultad de buscar señales estructuradas en la información genética. Para entender mejor los resultados gráficos definimos también un ■ í nuevo estadístico para una secuencia DNA. Sobre la base de una interpretación matemática de patrones de repetición, este constituye una huella específica de secuencias de DNA. Con este nuevo método estudiamos las propiedades estadísticas de las secuencias DNA de tipo exon e intron, y encontramos diferencias estadísticas específicas. Más aún, el analizar las secuencias DNA de diferentes especies, desde bacterias hasta el hombre, sugiere cómo evolucionan estas características lingüísticas del DNA a lo largo del árbol de la evolución. Estos resultados son consistentes con la idea de que el flujo de información en el DNA no es al azar, sino que está formado por patrones de repeticiones a lo largo del árbol evolutivo. Se discuten las implicaciones para la teoría evolutiva. A New Method to Sfudy DNA Sequences... Introduction An enormous amount of sequence data is accumulating as the result of extensive sequence projects of whole genomes from Bacteria, Archea and Eukaryotes ranging from the simple Saccharomyces cerevisiae to Homo sapiens. This provides a huge source of information along the evolutionary tree. Given that DNA information is constituted by only four nucleotides in a polymeric polynucleotid chain, the information embedded in the genome is of high complexity. Indeed, many methods have been developed to analyse this data, but it is clear that not all possible information concerning structure, specifics patterns and evolutionary trends has been extracted so far. In particular, the interface between statistical mechanics and DNA structure analysis has attracted great interest (1,2). It is clear that such complexity reflects the complexity of the organism itself, but many questions remain open concerning the different trends of evolution of DNA information. In addition, such evolutionary trends have not been explored extensively in the light of recent knowledge on emergent nonrandom phenomena at the edge of chaos, where order and structures emerge. In particular, language evolution in DNA information - a sort of self-emerging order in the flux of genetic information - is a promising field for addressing basic questions in the light of biological physics. A first and well known specialisation of DNA information is the compartmentalisation of coding and non-coding regions of DNA, introns and exons. Such specialisation is already present in Prokaryotes, but only a small fraction of prokaryotic genomes both in Bacteria and Archea do not have coding information for proteins. Differential statistic properties betv^een introns and exons have been reported (3), but such reports are contradictory (4,5). However, at the present it seems clear that DNA information does have statistical properties such as: linguistic features (6,7), noise (8), and fractal landscape (9). This fact indicates that the application of different mathematical methods may reveal different statistic properties. Lately, long-range correlation in DNA sequences has attracted great interest. This property, that has been studied extensively (10), is not fully explained, and its ultimate meaning is unknov/n. The concept of gene has been revisited by Li, as a sequence of DNA that performs a specific function (11). Despite this broad definition, the functional constraints on gene coding for protein are still important in determining sequence evolution for this part of the genome. The statistical properties bound to such functional constraints acting in the genome evolution are currently under investigation. In fact there is a necessity for this type of gene to maintain an open reading frame, obeying the genetic code rule. The genetic code is a source of periodicity in the correlation structure of exons DIVISIÓN DE ECONOMÍA W Gino Spinelli. David Mayer-Foulkes (12), but their long-range correlation property seems not to be as robust as for intron sequences. It is not clear at present if exons are less correlated than introns or if they are not long-range correlated at all. Such differences in statistical correlation between exons and introns are likely influenced by the rate and mode of evolution between the coding and non-coding part of the genome. The point should be explored by introducing methods able to specifically address our knowledge of the causes of the statistical differences, if any, between coding and non-coding parts of the genomes and able to study the large-scale structure of the genomes. In this sense, Spinelli recently proposed a new theory for heterochromatin (13) that is formed by repetitive elements with a focus on their fractal nature. In this work we consider the idea that at the basis of the statistical properties described above there is a generalized repetition of motifs in the genome and we introduce a new method to analyse DNA sequences. The aim is to interpret the linguistic and fractal behaviour of genome information, to address their possible causes and to try to explain such properties in the light of this new statistical method of DNA analysis. 7." The algoríthm implementafíon We modify the BDS statistic method, a known tool in economic statistic (14, 15, 18) and adapt it to analyse DNA sequences with an emphasis on their repetitive contení. Let S = {A, C, G, T} be the possible elements of genetic sequences. Take a DNA sequence Si, $2, ..., SN, where Sj e S. Define the distance: d(s, t) = 1 if s 5t t, d(s, t) = O if s = t (1) Let m-histories be the sets of ordered m-tuples Hm,T = {Si.m.T = (Si, Si+x, ... , Si+(ni-l)T) S.t. 1 = 1,...,N - (m - 1 )T} (2) T > 1 allows US to skip elements of the sequence, but we shall usually work with 1 = 1. Thus Si,m,T is the m-history starting at S\ with x-step sequencing. To be able to count repetitions we define distances between m-histories: d(Sj,ni,T, Sj.m.t) - d((Si, ... , Si+(m-1)t), (Sj, ... , Sj+(ni-l)T)) = d(Si, Sj) + ... + d(Si+(m_1)t, Sj+(ni-i)t) (3) This says that the distance between two m-histories Si.m.t, Sj.m.t is the number of times corresponding positions hold different letters. If the distance is r, there are m-r repetitions. Now we count these repetitions. Let Cm.r.T = #{(Í,j) s.t. i < j and d(Si.m.x, Sj.m.O = r} CIDE (4) A New Mefhod fo Sfudy DNA Sequences... For each m, r, x, Cm,r,T is the number of distinct pairs of m-histories (with x-sequencing) which have m-r letters in the same positions. For each DNA sequence we calcúlate: Cm,r,T for 1 < m < 32 (where m is the dimensión) (5) and O < r < m. The estimation were repeated for T = 10 and x = 100. In effect this is a cheap way of examining long structures using a lower dimensional analysis only sampling repetitions once every x positions. An mhistory samples a possible structure of length xm at the following positions: 1, 1+x,..., x{m-1)+1. (6) 2.- Resulfs 2.L- The sfudy of simuloied ond natura/ occurring DNA sequences At this point we compare the statistical properties of natural occurring DNA sequences with those of simulated random sequences. This is done in order to search for a structured signal that can be considered a deviation from puré randomness in generating DNA information. If some kind of complex language arising from deterministic dynamics is present in DNA chains, it should be possible to distinguish this behaviour from random information produced by a random flux of mutations. A system or a series of data that describes a system is considered random if a time series analysis does not produce any kind of pattern. A DNA chain can be considered a time series in which it should be possible to distinguish regularity, distinct from randomness, indicating that the flux of genetic information tends to genérate ordered signáis. Many definitions address this type of phenomenon in natural systems, searching for descriptive laws in many fields of science. In our proposition the idea is to calcúlate the deviation from randomness and compare it to prototypical forms of language emergence, or to puré randomness. With this in mind for each of these cases, Cm,r,T was calculated for 20 random sequences obtained by drawing from all of the letters at random (with replacement). In all cases care was taken not to include mhistories running across a break in the original data, and the random sequences kept the same breaks in the data. Finally a t-statistic was calculated to measure how significantly different from the average of the Cm.r.T of the random sequences the original Cm.r.t was.^ Very high t's were very This is the coefficient for the DNA sample divided by the standard deviation of its 20 randomized controls. DIVISIÓN DE ECONOMÍA G/'no 5p/ne//í, David Mayer-Foulkes often obtained, foUowing a very similar pattern for all of the DNA sequences we analyzed. An example of this pattern can be seen in Figure la. The blue indicates significantly negative results. The red, insignificant or zero. The yellow, significantly positive results. Significantly positive (negative) means that for the given Cm,r,T, the DNA had significantly more (less) repetitions than if it were random. We observe that there are more than an average number of comparisons v/ith a high level of coincidence (low r) and a low level of coincidence (high r), and therefore there are less than the average number of intermediate levéis of coincidence. The non-random structure of DNA sequences is represented by a higher presence of longer than average sequences interspersed v/ith shorter than average sequences. This is especially observed for T = 1 (see an example in Fig. 1a). At T = 10 this pattern of results is typically becomes less significant (Fig. Ib). By T = 100 a different structure of repetition is detected (Fig. 1c). In the case of the Humpdhal genome an inordinately high number of long sequences is present. The different results obtained for different T indicate that our method can be used to investígate different aspects of complexity in the language information hidden in DNA sequences. To be sure that the results are not spurious, tv/o random DNA sequences vvere constructed, one assigning a probability of 1/4 to each letter and another assigning non-uniform probabilities to simúlate a sequence 8000 bases long. In both cases it was very unlikely to obtain significant deviations from the mean and the graphs are almost uniformly red (Figure l.d for T = 1; the results for T = 10 and 100 are similar). The question is now what the non-random structure is. We constructed two artificial DNA sequences to try to reproduce the observed patterns of significance in the (m,r) plañe. In the first sequence, called "Random Words", we gave each of the following sequences an equal probability of appearing next: A, C, G, T, CA, GA, GC, GT, AGC, ATG, TAC, TTG, TGAG, TCGA, TCCA, ACGC; v/e called this sequence Level 1. Random Words DNA chooses amongst these 16 'words' at random. In the second simulated sequence, called "Random Sentences" we chose 8 sequences of 50 letters generated as Level 1 Random Words and gave them an equal probability of appearing. The sentences, which are called Level 2, are the following: ACGCATGCGGCATGTCGACAAGCTACACATACGTCCAGATGTACATGATG AGCTACATGTGAGTACCTGAGAGCAGCGTGTGAGATGTCCAATGGCTGAG AACGCAGTGAGTTGACCAATGTTGTACGCCAGCCAGCTCGAGATGGCAGC AACGCGTATGGCTCGAGATTGAGCACGCGTGATCGAGCAGCGCTCCAAGC CIDE A New Method to Study DNA Sequences. TCCATCCATGAGTCGAGTTCCAATGTACATGAGCGATATGGGAGACGCTC GAAGTGCTCGATCCATACTACATGAATGATGATGCTGAGTGACGCTCCAG TAATGTCGATGAGATGGCACGCGATGAGAGCACGCTCCATGAGTATGAGT ACGATCCAATGTAGTTGAGCCATGGAATCCATACTTGCAGTAGTGAGAAG The idea of simulation is to introduce a bias in the distribution of the words in the DNA chain. In fact in a génesis of a structure or a language it is important to define a set of forbidden words. By excluding most of the possibilities of the alphabet of 4 letters for generating sequence information and forcing our system to repeat always the same words or sentences we actually introduce a non-random bias in sequences simulation. We view this as a language emergence simulation. One can think of these words as representing different molecular structures generated from the DNA sequence. The Random Sentences DNA simulation chooses amongst these eight 50 base long "sentences" at random, constructing a DNA chain characterized by a bias of repetition that mimics the basis of language emergence, or molecular coding through sequences with specific functions, The results of the analysis, shown in Figures 2, are very clear. In the case of t = 1, both artificial sequences generated the same general pattern of significance observed in the (m,r) plañe, as did the natural occurring DNA sequences (Figs. 2a, 2c). On the other hand, in the case of x = 10 the Level 1 artificial DNA sequence gave almost random results, while the Level 2 sequence preserved a similar nonrandom pattern, showing that its significance pattern in the (m,r) plañe is preserved for larger m (Figs. 2b, 2d). As mentioned above, the Humphdal example in Figs 1a, Ib, 1c is intermediate between some naturally occurring DNA sequences that did and some that did not preserve this structure for x = 10. It is interesting that we can distinguish quite clearly between the artificial DNAs generated as Random Words or as Random Sentences. It is clear that there is a huge number of ways to produce a structure in DNA sequence based on the language of repetition of motifs; we only simúlate a very small part. It is intriguing that this simulated part resembles the behaviour of natural occurring DNA sequences. This fact indicates that in natural sequences there is the persistence of a deterministic dynamic based on the repetition of words. It is likely that the language thus originated and evidenced with our method shares common features with fractal landscapes suggested by other authors (9). However, the methods used are too different for a direct comparison. What we emphasize with our formalization is that it focuses attention on a hypothesis of language formation. A language is formed by words that can be detected in natural sequences, but not in randomly simulated sequences. This hypothesis is confirmed by the fact that our DIVISIÓN DE ECONOMÍA G/no Spineíli. David Mayer-Foulkes simulation experiments resemble the statistical properties of natural occurring sequences. 2.2.- Infroductíon of o new sfafísf'ic tooí In order to summarise the properties of graphic results we introduce another kind of formalization, which is a statistic similar to the Lyapunov exponent. With this new kind of formalization it is possible to obtain a one-dimensional graph useful for evolutive comparisons. We cali this index the Spineíli and Mayer index. If C (m, m) represents the number of m-string comparisons with m repetitions we introduce: SM(m) = C(m+1,m+1)/C(m,m) (7) This is equivalent to study the diagonal of the graphs shown in Figures 1 and 2. The formalization is an intuitive statistic related to the idea of structure in DNA chains. Its meaning can be expressed as the probability that when two strings of DNA with m repetition are compared, the next entry, will be also repeated. It can be predicted that the rapid decay of the onedimensional graph of our SM index indicates the absence of structure in the DNA sample under investigation. Random DNA sequences are supposed to have no sign of structure where high structured DNA will have a persistence of pattern of repetition or a language structure. We actually simulated these extreme points of randomness in our study as shown in Figures 1 and 2. The SM index is a simple tool to discrimínate random signáis from structured ones. in fact, we consider that the properties reportad as long-range correlation, language structures and fractal landscapes have a similar origin in the different distributions of words or sentences that we described in previous section. Such a distribution is far from random when a preferred set of words or sentences is chosen by evolution to genérate structured signáis in DNA chains. 2.3.- Evolutive sfudy of DNA sequences It is now time to apply our analysis to open questions that we formalize in this way: a) is our method able to distinguish the differential statistical properties between intron and exons suggested by several authors (3) b) Do exons have analogous statistical properties to introns as suggested by other authors (4,5) c) Can evolutionary trends of DNA languages from Bacteria to Man be studied and if so, what is the information that can be extracted? A sample of sequences of exons, introns and virus was extracted from the Gene Bank and submitted to our analysis. The results for x = 1 and x = 10 are shown in Figure CIDE A New Mefhod to Study DNA Sequences. 3. It should be noted that it is possible to distinguish between exonic and intronic sequences. AU intron sequences are clustered at high level of repetition, for both valúes of x. This result is in perfect agreement with the proposition of Peng and co-workers (3). Thus our approach actually detects and explains the long-range correlations in intron sequences. Moreover, the SM graph of a sample of exons extracted from the Gene Bank shows a pattern of decay that lies approximately between that of uniformly random sequences taken as a lower point of structured DNA chain, and Level 1 language. In any case, we noted that a similar lower grade of language organization is present in exons, as is apparent by comparison with random simulated sequences. This fact may indícate that while introns are more prone to exhibit repeated structures, this form of organization of genetic information is nevertheless also present in exon sequences. Other methods have found the peculiarity of language organization only in intron sequences. However, we suggest that exons are on the average only less correlated than introns, indicating the persistence of different evolutionary forces and functional constraints acting on this type of sequences. The action of such forces allows intron sequences to evolve linguistically while exon sequences are in some way restricted. Even though the different methods used to study this phenomenon are not directly comparable, it is very likely that the results do relate to the same statistical aspect of DNA language, that is the intensity of repetitions in DNA. Our proposition is that rather than considering strong differences between the statistical properties of exons and introns, it is useful to consider a degree of fractal landscape: less persistent in exons and sometimes totally absent, more evident in introns. In fact, as shown in Figure 4a, the statistical decay of the SM Índex of some of the exon sequences analysed is very cióse to the decay observed in random simulated sequences. This fact is not observed for intron sequences, indicating that language structure is more persistent in introns than exons. In conclusión, exons exhibit a lower degree of language structure than intron sequences. It has been suggested that exons do not exhibit different statistical properties from introns. This has been a source of debate between authors (3, 4, 5). Since our method directly measures the only possible source of structure in DNA chains, that is, repetition of words or sentences, our work furnishes a solution to this debate. The method proposed here allows a direct measure of linguistic differences between exons and introns, which can be useful to further analyse the large-scale structure of the genome in evolution. At this point we consider the idea of analysing sequences extracted from different genomes of bacteria, archea and eukaryotes. In Figures 4a and 4b the SM Índex for genomic sequences is shown with x = 1 and x = 10 respectively. The SM índex decay for the sequences consídered fell in between the símulatíon results for random sequences ("aleat") and high structured repetitive sequences (Level 2). This fact indicates that the DIVISIÓN DE ECONOMÍA WM Gino Spinelli. David Mayer-Foulkes evolutive patterns of language along the phylogenetic tree assume many possiblities of language formation depending on the species considered. In addition, the patterns express species-specific trends along the tree that is, in most of the cases here considered, lie far from the behaviour of random simulated sequences and far from the behaviour of highly repeated simulated sequences. This may indicate that for establishing the evolution of natural sequences, complex events must occur to genérate DNA language connposition. Our simulation experiments suggest two ideal behaviours for DNA sequence evolution. They are the tendency to a full random flux of mutation and the tendency to genérate high repetition of motifs in DNA sequences. Using only four single nucleotides, four dinucleotides, four trinucleotides and four tetranucleotides generates the Level 1 simulation, The results show a linear decay of the SM index for Level 1. This indicates a simple way to genérate linguistic behaviour in a DNA sequence that is not reproduced in any of the sequences analysed so far. Some differences are observed for x = 10 in the SM index decay, confirming that a different structure of repetition is detected. More observations can be made on the analysis of genomic sequences shovf in Figures 4a, 4b. The pattern of bacterial and viral genomic sequences is more cióse to the pattern of random simulated sequence (Aleat), in particular Borrellia burdgoferi, Micoplasma g^f^^^o^^s, and T4 Bacteriophage. This is consistent with the idea that genomes that contain a high percentage of coding sequences are restricted in their potential to exhibit high repetition of motifs as we shov/ also for exons in Figures 4a and 4b. This suggests that the flux of mutation in this kind of sequences has a tendency to fix random mutations rather than duplicative events of any nature, from single dinucleotides to a more complex repetition of motifs. The only exception is the Methanococcus jannashii genome that has a quite structured genome. This is consistent v^ith the fact that this archea shov/s a similarity v^ith eukaryotic genome evolution as suggested in (16). The situation of Pan paniscus mitocondrial DNA is interesting in that it shows a rate of decay of SM index that is cióse to both random simulation and Bacteria. This may indicate that mitochondrial DNA still evolves like bacteria as suggested by endosymbiont hypothesis (17). This fact is not confirmed for the chloroplast sequence, that shows a language structured genome suggesting that different functional constrains act on chloroplast genome evolution. The estimation of the SM index for Epstein Barr virus is also interesting. This virus sequence shows a modular DNA composition very similar to that of Level 2 simulation, indicating that this genome evolves by accumulating a high percentage of repetition of motifs, as does the simulated DNA sequence. It has been suggested that fractal landscape may be connected with isochore (9). in other words, the percentage of GC in a genome fraction is a property connected with the observed long-range correlations in DNA CIDE A New Mefhod to Sfudy DNA Sequences... sequences. We calculated the CG content of all the sequences subjected to our analysis (not shown). There is no clear relationship between the decay of the SM Índex and the GC content. It should be noted that our method simply calculates the repetitiveness of a given DNA sequence. We think that this lies at the basis of the fractal landscape or language suggested by other authors. There are no very clear reasons to correlate the DNA base content to the emergence of a fractal landscape, It nnight be surmised that some evolutionary constraint is connected with isochores that allows emergence of long-range correlation, but such connection is not evident in our approach. DIVISIÓN DE ECONOMÍA Gino Spinelli, David Mayer-Foulkes Conclusions There is a long standing interest in correlation in DNA sequences, several aspects of which are currently under investigation. First, long-range correlations are common in physical systems that occur cióse to critical points, Second, the fact that DNA shares this property opens the way for a new formalization of the flux of DNA information and evolution itself as a critical self-organized phenomenon. In this connection the main paradigm for evolution of DNA sequences is that the flux of mutation in DNA information is basically random and natural selection acts as the solé forcé of order. Introns are generally considered to evolve in the absence of natural selection mechanisms or in the presence of less functional constraints as compared to exons. There is a contradiction between the measures of deterministic dynamical persistence in introns presented here and this evolutionary paradigm. It seems indeed that the tendency to genérate order, expressed in terms of language emergence, is observed when fewer DNA sequences under selection are present. The consequence is that natural selection is not the solé source of order: non-random structures in the flux of DNA information may emerge from non-random flux patterns of DNA mutation that may dominate random mutations. This scenario is consistent with the idea that evolution may occur as a combination of self-organization and natural selection mechanisms. Instead of correlation, we can think of the biologic function of language composition in DNA. At present there is no direct experimental evidence that the fractal landscape of DNA sequences plays a biological role. It can be speculated that more structured DNA furnishes more biological signáis for gene regulation and for the tridimensional assembly of chromatin. This idea is consistent with the fact that enhancers and regulation signáis are more often localized in intron sequences than in exons. With our method we do not directly measure enhancers or regulation signáis, but a general tendency of evolution to genérate order in the DNA information. The biological basis of such order and its ultimate meaning should be further investigated. Considering that it has been proposed that most of the eukaryotic genome is junk DNA devoid of canonical genetic function, it is intriguing that this noncoding part of the eukaryotic genome, including introns and heterochromatin, shows a fractal landscape with properties resembling artificial DNA sequences with a language composition. Is this source of regularity in the genome a clue for non trivial code in "junk" DNA? What is the ultimate meaning of this code? The study of this code and its evolution along phylogenetic tree may contribute to elucidate the large scale structure of the genomes under the light of a modified evolutionary theory including both self-organization and natural selection. C/DE A New Method io Sfudy DNA Sequences... References Ben-Jacob, E.; Shochet, O.; Tenenbaum, A.; Cohén, I.; Czirtok, A.; and Vicsek, T. (1994). "Generic modelling of cooperative growth pattems in bacterial colom'es". Nature, 368, 46-49. Brock, W. A.; Dechert, W. D.; Sheinkman, J. A.; and LeBaron, B. (1996). "A Test for Independence Based on the Correlation Dimensión". Econometric Reviews, 15(3), 197-235. Bult, C. J.; White, O.; Olsen, G. J.; Zhou, L; Fleischmann, R. D.; Sutton, G. G.; Blake, J. A.; FitzGerald, L. M.; Clayton, R. A.; Gocayne, J. D.; Kerlavage, A. R.; Dougherty, B. A.; Tomb, J. F.; Adams, M. D.; Reich, C. I.; Overbeek, R.; Kirkness, E. F.; Weinstock, K. G.; Merrick, J. M.; Glodek, A.; Scott, J. L.; Geoghagen, N. S.; and Venter, J. C. (1996). "Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii". Science, 273, 1058-1073. De Sousa Vieira, M. (1999). "Statistics of DNA sequences: a low frequency analysis", Phys Rev E, 60, 5932-5937. Herzel, H.; Trifonov, E. N.; Weiss, O.; and GrofSe, I. (1998). "Interpreting correlations in biosequences". Physica A, 249, 449-459. http://www.nsl1i-genetics.org/dnacorr/ LeBaron, B. (1997). "A Fast Algorithm for the BDS Statistic". Studies in Nonlinear Dynamics Et Econometrics, 2, 53-59. Li, W. H. (1997). Molecular Evolution. Sinauer Associates, Inc. Sunderland, MA. USA. Mantegna, R. N.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Peng, C. K.; Simons, M.; and Stanley, H. E. (1994). "Linguistic features of noncoding DNA Sequences". Phys Rev Lett, 73, 3169-3172. Margulis, L. (1970). Orisin of Eukaryotic Cells. Yale Univ. Press. New Haven, CT. USA. Mayer-Foulkes, D. (2000). "A generalized fast algorithm for BDS-type Statistics". Studies in Nonlinear Dynamics and Econometrics, 4(1), Algorithm 2. http://www.bepress.com/snde/vol4/iss1/algorithm2 Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simons, M.; and Stanley, H. E. (1992). "Long-range correlations in nucleotide sequences". Nature, 356, 168-170. Searls, D. B. (2002). 'The language of genes". Nature, 420, 211-217. Spinelli, G. (2003). "Heterochromatin and complexity: a theoretical approach". Nonlinear Dynamics, Psycholosy and Life Science, 7(4) October, 329-361. Stanley, E. H.; Buldyrev, S. V.; Goldberg, A. L.; Havlin, S.; Mantegna, R. N.; Peng, C. K.; and Simons, M. (1996). "Scale Invariant features of coding and noncoding DNA sequences", in: lannaccone, P. M. and Khokha, M. (Eds.) Fractal Ceometry in Biolosical Systems. CRC Press, Inc. pp. 249-266, DIVISIÓN DE ECONOMíA Gino Spinelli, David Mayer-Foulkes Stanley, H. E.; Buldyrev, S. V.; Goldberger, A, L.; Havlin, S.; Peng, C. K.; and Simons, M. (1999). "Scaling features of noncoding DNA". Physica A, 273, 1-18. Voss, R. (1992). "Evolution of long-range fractal correlations and 1/f noise in DNA base sequences". Phys Rev Lett, 68, 3805-3808. Voss, R. (1994). "Long-range fractal correlations in DNA introns and exons". Fractals, 2, 1-6. CIDE ■ A New Method fo Sfudy DNA Sequences. Figure 1a. Signíricance of repetitions for HUMPDHAL DNA sequence at tao - 1 1ilfm M---- ptó UPS£ ::: 0 NMB 21 17 ■ +-1— '. i . J.J.. ■4-r ; 29 25 F4R :Jt:: - 13 9 ... :5-^:fi ■r ■ -ff" HB ; Figure 1b. Significance of repetitions fof HUMPDHAL DNA sequence at tao ' 10 ■i fi i^ i»ii i o - 5 1 r< fs r> Repetitions Rq)etitions |tj<2 t<-2 Figure 1c. Significance of repetitions for HUMPDHAL DNA sequence at tao = 100 nt^2 Figure 1d. Significance of repetitions for a nonuniformly random DNA sequence, tao » 1 i a. o Rep«titioas Repetitions m^ -2 iti<2 DIVISIÓN DE nt^2 ECONOMIA Gino Spinelli, David Mayer-Foulkes Figure 2a. Sionificance of repetítíons for Level 1 artificial DNA sequence at tao - 1 Figure 2b. Significance of repetilions for Leve! 1 artiricial DNA sequence at tao - 10 Repctitions Repetitions |tt<2 t<-2 Figure 2c. Significance of repetitions for Level 2 artificial DNA sequence at tao - 1 :::::: iüllJ iil^ IM lín Figure 2d. Sionificance of repetitions for Level 2 artificial DNA sequence at tao - 10 29 a 25 I 21 - 5 !S. g 17 ;; 13 m i^^H^m flnWWfttr--" i nti2 9 5 Repetitions t£-2 CIDE A New Meíhod to Sfudy DNA Sequences. Figure 3. SM Index for Various Genomes at Dimensión 7, tao » 1, 10 0.0014 j0.0012-0.001 -- "I 0.0008 ^' 0.0004 fl"^ft tO N. O K O < u> ▼" o < 1 s fí S-1í- ™:|h 1 1 1 ^ 0.0006 1 1 ' ™ 1 lO eo *o *•* < ' ™ 1 O ^▼• ^ < 1 1 (O CM ■9 yrO in hT- «« (O Oí T- S' < z o> f» < Genome U) ■A O O o> o o tn o s' < Z 1} DIVISIÓN DE ECONOMÍA, »- sm * < O) o> <o C o CM T- ▼• § Q ¿ fM B tao = 1 ■ tao = 10 Gino Spinelli, David Mayer-Foulkes Figure 4a. SM index for sampte of natural and simulated DNA sequences {tao = 1) — Level2 — Levell EBV — Pombedbs — Mmchr 01 — Lí ish 1 — Methano — ArabilV — Chorcll — Homo 38 — Hogchol Homo 37 — Simbisvi - T4 YeastVn Worml An^ Burgd Mgnita — Panmtd Unifonn Figure 4b. SM index for sample of natural and simulated DNA sequences (tac = 10) 1 TV—'~i—""I—'—""i—r-i—TT—I I I—I 1 I—n—n—i—n—i—i—n—i—r—i Lcvcl 2 Level1 EBV Pombedbs Lcish 1 ArabüV Methano Homo 38 Homo 37 Mmclir 01 ChoreU Worml Burgd Mgnita Yeast \TI Simbis\i T4 Panmtd Anq> Hogchol Unifoim C/DE Novedades DIVISIóN DE ADMINISTRACIóN PúBLICA Arellano Gault y Vera-Cortés Gabriela, Institutional Desisn and orsanization of The Civil Protection National System in México: The case for a decentralized and participative policy network, AP-162 Cabrero Enrique, López Liliana, Segura Fernando y Silva Jorge, Acción Municipal y desarrollo local ¿Cuáles son las claves del éxito?, AP-163 Rowland, Allison, A Comparison of Federalism in México and the United States, AP-164 Sour, Laura, Tax Compliance & Public Coods: Do They Really Get Alone? AP-165 Sour, Laura, Crime, Punishment or... Reward? Testing the Crowding-ln Effect in the Case of Tax Compliance, AP-166 Bonina Carla, Tecnologías de información y Nueva Gestión Pública: experiencias de sobierno electrónico en México, AP-167 Merino Mauricio y Macedo Ignacio, La politica autista. Crítica a la red de implementación municipal de la Ley de Desarrollo ..., AP-168 Merino Mauricio, El desafío de la transparencia. Una revisión de las normas de acceso a la información pública... AP-169 Sour Laura y Girón Fredy, Evaluando al gobierno electrónico: realidades concretas sobre el progreso del federalismo digital en México, AP-170 Sánchez Reaza Javier, Productividad ventajas comparativas reveladas y competitividad sectorial en México. AP-171 DIVISIóN DE ECONOMíA Hernández Fausto y Jarillo Brenda, Transferencias condicionadas federales en países en desarrollo: el caso del FISM en México, E-336 Rosellón Juan, Regulation of Natural Cas Pricing in México, E-337 Rosellón Juan, Sophie Meritet, Alberto Elizalde, LNC in the Northwestern Coast of México: tmpact on Prices of Natural Cas in Both Sides of the US-Mexico Border, E-338 Mayer Foulkes, Peter Nunnenkamp, Do Multinational Enterprises Contribute to Convergence or Divergence? A Disaggregated Analysis of US FDI, E-339 Smith Ramírez Ricardo, Introducing Soil Nutrient Dynamics in the Evaluation of Soil Remediation Programs. Evidence form Chile, E-340 Coady David P., Parker Susan W., A Cost-Effectiveness Analysis of Demand-and Supply-Side Education tnterventions: The Case of PROGRESA in México, E-341 Carreón-Rodríguez Víctor G., Jiménez Armando, Rosellón Juan, The Mexican Electricity Sector: Economic, Legal and Political Issues, E-342 Musacchio Aldo, Gómez Galvarriato Aurora, Bonds, Foreign Creditors, and the Costs of the Mexican Revolution, E-343 Cermeño Rodolfo y Vázquez Sirenia, Technological Backwardness in Asrícultura: Is it Due to Lack of R&D Human Capital and Openness to international Trade? E-344 Arellano Rogelio, Hernández Trillo Fausto, Challenses of Mexican Fiscal Policy, E-345 DIVISIóN DE ESTUDIOS INTERNACIONALES González Guadalupe, Minushkin Susan y Shapiro Robert (editores), Mexican Public Opinión and Foreign Policy, EI-120 González Guadalupe, Minushkin Susan, Shapiro Robert y Hug Catherine (editores), Comparing Mexican and American Public Opinión and Foreign Policy, EI-121 González Guadalupe, Minushkin Susan y Shapiro Robert (editores). Opinión pública y política exterior en México, EI-122 González Guadalupe, Minushkin Susan, Shapiro Robert y Hug Catherine (editores). Opinión pública y política exterior en México y Estados Unidos: un estudio comparado, EI-123 Meseguer Covadonga, What Role for Learnins? The Diffusion of Privatisation in the OECD and Latín American Countries, EI-124 Sotomayor, Arturo, The Unintended Consequences of Peacekeepins Participation in the Southern Cone of South America, El-125 Odell S. John and Ortiz Mena L.N. Antonio, Cetting to "No": Defending Against Demands in NAFTA Energy Negotiations, El-126 López Farfán Fabiola y Schiavon Jorge A., La política internacional de las entidades federativas mexicanas, EI-127 Sotomayor, Arturo, Tendencias y patrones de la cooperación internacional para el desarrollo económico, EI-128 Sotomayor, Arturo, La participación en Operaciones de Paz de la ONU y el control civil de las fuerzas armadas: los casos de Argentina y Uruguay, EI-129 DIVISIóN DE ESTUDIOS JURíDICOS Pasara Pazos, Luis, Reforma y desafíos de la justicia en Guatemala, EJ-3 Bergman S., Marcelo, Confianza y Estado de Derecho, EJ-4 Bergman S., Marcelo, Compliance with norms: The Case of Tax Complíance in Latín America, EJ-5 Pasara, Luis, Cómo sentencian los jueces en el D. F. en materia penal, EJ-6 Pasara, Luis, Reformas del sistema de justicia en América Latina: cuenta y Balance, EJ-7 Posadas, Alejandro, Canadá Trade Law & Policy after NAFTA and the WTO, EJ-8 Hernández, Roberto, Alcances del "juicio oral" frente a la Reforma Integral a la Justicia Penal propuesta por presidencia, EJ-9 Magaloni, Ana Laura, El impacto en el debate sobre la reforma judicial de los estudios empíricos del sistema de justicia: el caso del estudio del Banco Mundial sobre le Juicio Ejecutivo Mercantil, EJ-10 Bergman, Marcelo, Do Audits Enhance Compliance? An Empirical Assessment of VAT Enforcement, EJ-11 Pazos, María Inés, Sobre la semántica de la derrotabilidad de conceptos jurídicos, EJ-12 DIVISIóN DE ESTUDIOS POLíTICOS Bowman, Kirk, Fabrice Lehoucq, James Mahoney, Measuring Political Democracy: Data Adequacy, Measurement Error, and Central America, EP-169 Marván Laborde, Ignacio, ¿Cómo votaron los diputados constituyentes de 1916-1917. EP-170 Schedler Andreas & Sarsfield Rodolfo, Demacráis with Adjetives Linking Direct and tndrect Measures of Democratic Support, EP-171 Langston, Joy, After the End: México's PRl in the Aftermath of the 2000 Presidential Defeat, EP-172 Schedler Andreas, Patterns of Interparty Competition in Electoral Autocracies, EP-173 Schedler Andreas, Mapping Contingency, EP-174 Langston Joy, The Search for Principáis in the Mexican Legislature: The PRI's Federal Deputies, EP-175 Lehoucq Fabrice, Gabriel Negretto, F. Javier Aparicio, Benito Nacif y Allyson Benton, Political ¡nstitutions, Policymaldns Processes, and Policy Outcomes in México, EP-176 Langston Joy, Why Hegemonic Parties Rupture, and Why Does it Matter?, EP-177 Murillo, María Victoria y Martínez Gallardo Cecilia, Policymaking Patterns: Privatization of Latín American Public Utilities, EP-178 DIVISIóN DE HISTORIA Sauter, Michael J., Clock Watchers and Stargazers: Berlín's Clock Betv/een Sciencie, State and the Public Sphere at the Eíghteenth Century's End, H-26 Pipitone, Ugo, Desigualdades. (Segundo capítulo de Caos y Globalización^ H-27 Bataillon, Gilíes, Formas y prácticas de la guerra de Nicaragua en el siglo XX,H-IS Meyer, Jean, Pro domo mea: "La Cristiada" a la distancia, H-29 Meyer, Jean, La iglesia católica en México 1929-1965, H-30 Meyer, Jean, Roma y Moscú 1988-2004, H-31 Pañi, Erika, Saving the Nation through Exclusión: The Alien and Sedition Acts and México 's Expulsión of Spaniards, H-32 Pipitone, Ugo, El ambiente amenazado (Tercer capítulo de El Temblor...), H-33 Pipitone, Ugo, Aperturas chinas (1889,1919,1978), H-34 Meyer, Jean, El conflicto religioso en Oaxaca, H-35 Ventas DIRECTAS: 57-27-98-00 Ext. 2603 y 2417 Fax: 57-27-98-85 INTERNET: Librería virtual: Página web: e-mail: www.e-cide.com www.cide.edu publicaciones@cide.edu LIBRERíAS DONDE SE ENCUENTRAN DOCUMENTOS DE TRABAJO: LIBRERÍAS GANDHI Tel. 56-6110-41 LIBRERÍA CIDE/F.C.E. Tel. 57-27-98-00 EXT. 2906 SIGLO XXI EDITORES S.A. DE C.V. TeU 56-58-75-55 UAM AZCAPOTZALCO Tel. 53-18-92-81 ARCHIVO GENERAL DE U NACIÓN Tel. 51-33-99-00 í5 ^ s° 5 '^ www.cide.edu CENTRO DE INVESTIGACIÓN Y DOCENCIA ECONÓMICAS CAS8ETESA MEXICO-iOLUCA 3655, COL LOMAS OE SASTA F£, 01210, MEHICO, D.f, COKMUIAOOR: 5727'9800 EXTS.: 2202,2203 y 2417 FAX: 5727*9885 v 5292-1304 coíiiEO EiECTíONico: publicacionesiScide.edu