Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Comparing Nucleotide Composition and Codon Usage in Hyperthermophilic Species, Study Guides, Projects, Research of Biochemistry

A scientific article published in The Open Bioinformatics Journal in 2008. The authors, Subhash Mohan Agarwal and Atul Grover, discuss the nucleotide composition, codon usage, and amino acid content in hyperthermophilic species. They found that arginine, proline, valine, and tyrosine were the most abundant amino acids in hyperthermophilic proteomes, and similar biases were seen when dipeptidic composition of proteins was compared. The study also suggested that elevated growth temperature imposes selective constraints at all three molecular levels: nucleotide composition, codon usage, and amino acid content.

What you will learn

  • What are the most abundant amino acids in hyperthermophilic proteomes?
  • How does the dipeptidic composition of proteins differ between hyperthermophiles and mesophiles?

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 02/03/2022

ambau
ambau 🇺🇸

4.5

(11)

250 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Open Bioinformatics Journal, 2008, 2, 11-19 11
1875-03 62/08 2008 Be ntham Science Publ ishers Ltd.
Nucleotide Composition and Amino Acid Usage in AT-Rich
Hyperthermophilic Species
Subhash Mohan Agarwal1,* and Atul Grover2
1Bioinformatics Center, School of Information Technology, Jawaharlal Nehru University, New Delhi 110067, India and
2Department of Bioscience and Biotechnology, Banasthali University, Banasthali 304022, India
Abstract: Nucleotide composition, codon usage and amino acid content are important molecular signatures that vary in
different groups of organisms. AT-rich (or GC poor) hyperthermophiles have relatively been unexplored in these aspects.
In this study, we have examined the compositional characteristics of AT rich genomes viz. Methanococcus jannaschii,
Sulfolobus solfataricus, Sulfolobus tokodaii and Nanoarcheum equitans by their comparison with four mesophiles having
similar genomic GC content. The analysis revealed a significant increase in purine content of ORFs due to increase in
guanine content. Moreover, the influence of dinucleotide composition on protein thermostability was found even larger.
Accordingly, increased usage of codons that are constituted of dinucleotides RR was observed. Arginine, proline, valine
and tyrosine were most abundant amino acids in hyperthermophilic proteomes, and similar bias was seen when dipeptidic
composition of proteins was compared. Further amino acid composition analysis of alpha helices indicates an increased
usage of E, K, R and decreased usage of N and Q. Summing up, the study suggested that elevated growth temperature im-
pose selective constraints at all the three molecular levels- nucleotide composition, codon usage and amino acid content.
Keywords: Hyperthermophiles; nucleotide bias; codon usage; amino acid composition.
INTRODUCTION
Hyperthermophiles constantly face the challenge of
maintaining the stability of their genome. Increasing the
melting point of their DNA by keeping relatively higher GC
[1] is one of the methods they have constituted to address the
issue. However, the GC content of the genomes does not
correlate with optimal growth temperature (OGT) [2, 3].
Various additional attributes have been suggested that con-
tribute in maintaining the stability of genomic DNA of hy-
perthermophiles [2, 3]. Infact a number of hyperthermo-
philes have GC content of their DNA lesser than 40% [1].
On the other hand GC content of rRNA and tRNA show
strong correlation with optimal growth temperature [4, 5].
Various studies have established that these living organisms
are subject to a variety of selection pressures that act not
only at the level of global phenotype but at each level of the
cell’s organization i.e. DNA, RNA and proteins [6]. For ex-
ample, there is evidence that the proteins of thermophiles are
characterized by a distinct pattern of amino acids [7-10].
Moreover a difference in the pattern of synonymous codon
usage between thermophiles and mesophiles has been ob-
served [7, 10].
Although considerable studies have focused on under-
standing the mechanisms that makes life possible under these
conditions, it still remains unclear that whether it is due to
external conditions or natural selection [4, 7, 11-14]. In order
to infer the molecular mechanistic adjustments to the thermal
stress, it is desired to compare the genomic characteristics of
hyperthermophiles with mesophilic genera. Singer and
Hickey [14] made such an attempt considering the genera
that show optimal growth temperature (OGT) near or above
*Address correspondence to this author at the Bioinformatics Center, School
of Information Technology, Jawaharlal Nehru University, New Delhi
110067, Ind ia; E-mail: smagarwal@yahoo.co.in
50°C, while the AT-rich hyperthermophilic genomes were
ignored in their analysis. Das et al. [5] looked into some of
the hyperthermophilic genomes that had their GC content
lower than 50%. A shortcoming of this study was the broad
range of OGT (>13°C) over which the genera under study
varied. Thus, in order to minimize the ascertainment bias in
terms of codon usage and nucleotide composition between
different species we have picked up various mesophiles and
hyperthermophiles in an even narrower OGT range of 7.8°C
for comparative analysis among mesophiles and hyper-
thermophiles.
The hyperthermophilic archaebacteria, Nanoarchaeum
equitans is one of the interesting examples qualifying for this
kind of analysis. The archaebacteria is known to host small-
est non viral genome to date, which spans 490 Kb and is
constituted of 537 protein coding genes [15]. The genome
displays short intergenic regions, large number of split
genes, few pseudogenes, and lacks many of the vital meta-
bolic genes [15]. Further phylogenetic analysis suggested
that it diverged early in archaeal lineage even before the
emergence of Euryarchaeota and Crenarchaeota, represent-
ing basal archaeal lineage [16]. Considering Nanoarcheum
equitans to be one of the simplest genome of cellular organ-
isms and of course simplest among the genera under study,
genomic features of N. equitans have been dealt as a special
case within the hyperthermophilic group.
Thus, the present paper outlines comparisons of nucleo-
tide bias, codon usage patterns and amino acid bias drawn
between mesophiles and hyperthermophiles having average
GC content close to 31%.
MATERIALS AND METHODS
The coding sequences (CDS) and the corresponding
amino acid sequences for all of the eight genomes (Table 1)
were downloaded from ftp site of GenBank. Following CDS
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Comparing Nucleotide Composition and Codon Usage in Hyperthermophilic Species and more Study Guides, Projects, Research Biochemistry in PDF only on Docsity!

The Open Bioinformatics Journal, 2008, 2, 11-19 11

1875-0362/08 2008 Bentham Science Publishers Ltd.

Nucleotide Composition and Amino Acid Usage in AT-Rich

Hyperthermophilic Species

Subhash Mohan Agarwal

* and Atul Grover

1

Bioinformatics Center, School of Information Technology, Jawaharlal Nehru University, New Delhi 110067, India and

2 Department of Bioscience and Biotechnology, Banasthali University, Banasthali 304022, India

Abstract: Nucleotide composition, codon usage and amino acid content are important molecular signatures that vary in

different groups of organisms. AT-rich (or GC poor) hyperthermophiles have relatively been unexplored in these aspects.

In this study, we have examined the compositional characteristics of AT rich genomes viz. Methanococcus jannaschii ,

Sulfolobus solfataricus , Sulfolobus tokodaii and Nanoarcheum equitans by their comparison with four mesophiles having

similar genomic GC content. The analysis revealed a significant increase in purine content of ORFs due to increase in

guanine content. Moreover, the influence of dinucleotide composition on protein thermostability was found even larger.

Accordingly, increased usage of codons that are constituted of dinucleotides RR was observed. Arginine, proline, valine

and tyrosine were most abundant amino acids in hyperthermophilic proteomes, and similar bias was seen when dipeptidic

composition of proteins was compared. Further amino acid composition analysis of alpha helices indicates an increased

usage of E, K, R and decreased usage of N and Q. Summing up, the study suggested that elevated growth temperature im-

pose selective constraints at all the three molecular levels- nucleotide composition, codon usage and amino acid content.

Keywords: Hyperthermophiles; nucleotide bias; codon usage; amino acid composition.

INTRODUCTION

Hyperthermophiles constantly face the challenge of

maintaining the stability of their genome. Increasing the

melting point of their DNA by keeping relatively higher GC

[1] is one of the methods they have constituted to address the

issue. However, the GC content of the genomes does not

correlate with optimal growth temperature (OGT) [2, 3].

Various additional attributes have been suggested that con-

tribute in maintaining the stability of genomic DNA of hy-

perthermophiles [2, 3]. Infact a number of hyperthermo-

philes have GC content of their DNA lesser than 40% [1].

On the other hand GC content of rRNA and tRNA show

strong correlation with optimal growth temperature [4, 5].

Various studies have established that these living organisms

are subject to a variety of selection pressures that act not

only at the level of global phenotype but at each level of the

cell’s organization i.e. DNA, RNA and proteins [6]. For ex-

ample, there is evidence that the proteins of thermophiles are

characterized by a distinct pattern of amino acids [7-10].

Moreover a difference in the pattern of synonymous codon

usage between thermophiles and mesophiles has been ob-

served [7, 10].

Although considerable studies have focused on under-

standing the mechanisms that makes life possible under these

conditions, it still remains unclear that whether it is due to

external conditions or natural selection [4, 7, 11-14]. In order

to infer the molecular mechanistic adjustments to the thermal

stress, it is desired to compare the genomic characteristics of

hyperthermophiles with mesophilic genera. Singer and

Hickey [14] made such an attempt considering the genera

that show optimal growth temperature (OGT) near or above

*Address correspondence to this author at the Bioinformatics Center, School of Information Technology, Jawaharlal Nehru University, New Delhi 110067, India; E-mail: smagarwal@yahoo.co.in

50°C, while the AT-rich hyperthermophilic genomes were

ignored in their analysis. Das et al. [5] looked into some of

the hyperthermophilic genomes that had their GC content

lower than 50%. A shortcoming of this study was the broad

range of OGT (>13°C) over which the genera under study

varied. Thus, in order to minimize the ascertainment bias in

terms of codon usage and nucleotide composition between

different species we have picked up various mesophiles and

hyperthermophiles in an even narrower OGT range of 7.8°C

for comparative analysis among mesophiles and hyper-

thermophiles.

The hyperthermophilic archaebacteria, Nanoarchaeum

equitans is one of the interesting examples qualifying for this

kind of analysis. The archaebacteria is known to host small-

est non viral genome to date, which spans 490 Kb and is

constituted of 537 protein coding genes [15]. The genome

displays short intergenic regions, large number of split

genes, few pseudogenes, and lacks many of the vital meta-

bolic genes [15]. Further phylogenetic analysis suggested

that it diverged early in archaeal lineage even before the

emergence of Euryarchaeota and Crenarchaeota, represent-

ing basal archaeal lineage [16]. Considering Nanoarcheum

equitans to be one of the simplest genome of cellular organ-

isms and of course simplest among the genera under study,

genomic features of N. equitans have been dealt as a special

case within the hyperthermophilic group.

Thus, the present paper outlines comparisons of nucleo-

tide bias, codon usage patterns and amino acid bias drawn

between mesophiles and hyperthermophiles having average

GC content close to 31%.

MATERIALS AND METHODS

The coding sequences (CDS) and the corresponding

amino acid sequences for all of the eight genomes (Table 1 )

were downloaded from ftp site of GenBank. Following CDS

12 The Open Bioinformatics Journal, 2008, Volume 2 Agarwal and Grover

integrity check a number of genes were excluded from

analysis. For a CDS to be selected presence of a start and

stop codon at the beginning and end of each CDS respec-

tively, along with no detectable frameshift was required.

Moreover CDS that were smaller than 300 nucleotides were

removed. The genes thus shortlisted were analyzed for base

compositional bias by studying the prevalence of mononu-

cleotide bases and combinations of dinucleotide bases.

CodonW (available from http://bioweb.pasteur.fr/seqanal/

interfaces/codonw.html) was used for calculating number of

each codon and relative synonymous codon usage (RSCU)

for each of the gene within a genome. Similarly, amino acid

and dipeptide compositional bias was calculated in the pre-

dicted peptides. Secondary structures were predicted using

GOR(IV) [11] to study the bias in frequencies of amino acids

in three dimensional helical structures. Subsequently to iden-

tify patterns showing significant differences between the two

groups (mesophilic and hyperthermophilic) t-test was per-

formed. Initially mean values for 4 mesophilic and 3 hyper-

thermophilic genomes excluding N. equitans was evaluated,

to derive a general pattern of similarities and differences

between the groups. Later the mean value for thermophilic

genomes with N. equitans was calculated and compared with

mesophilic genomes mean to garner information regarding

N. equitans adaptation towards environment.

RESULTS AND DISCUSSION

NUCLEOTIDE COMPOSITIONAL BIAS AS THER-

MOPHILIC ADAPTATION

On comparison of occurrence of different nucleotides in

four mesophiles and four thermophiles, a significant de-

crease in the thymine content in thermophiles was noted

(Table 2 ). This was coupled with an overall increase in

purine content (A+G) in thermophilic genomes. Similarly, a

significant increase in the dinucleotide pairs AG, GA and

GG was also found in hyperthermophiles (Table 3 ) coupled

with a fall in the frequency of the pair TT. The long standing

hypothesis is that GC-richness of protein coding genes can

not be considered as thermophilic signatures [17]. The above

results confirm the findings of Paz et al. [13] who reported

abundance of polypurine tracts in thermophilic mRNA se-

quences, as purine loading of mRNAs is expected to reduce

RNA-RNA interactions and thus prevent formation of dou-

ble stranded RNA molecules [18]. Subsequently, frequency

of each nucleotide for each of the three codon positions is

analyzed. Although no significant difference is observed for

any of the nucleotide at the first and second codon position,

however, a marked decrease in the thymine content and a

corresponding significant increase in guanine content was

seen at the third codon position (Table 2 ). This may be con-

sidered as a means to increase the GC content at codon third

sites (GC3). Hurst and Merchant [17] suggested maintenance

of higher GC3 by thermophiles. However, Singer and

Hickey [14] reported a very significant increase in adenine at

all codon position and decrease in cytosine at first and sec-

ond codon position. Singer and Hickey [14] hypothesized the

significant increase in purine content due to the increase in

frequency of adenine. The same hypothesis does not hold

true when analyzed for GC poor genomes. On the other

hand, the increase in purine amount in these genomes was

found due to the increase in overall guanine content. Further,

purine richness of codons in terms of AG, GA and GG as

two of the three bases in codons is also likely to determine

the supercoiling of double stranded DNA which affects the

thermostability in the absence of nucleosome structures.

CODON USAGE AND RCSU BIAS IN HYPERTHER-

MOPHILES

There is a recent interest of scientific community to relate

the codon usage with OGT of an organism [19]. Obviously,

elevated growth temperatures impose selective constraints on

codon-anticodon interactions as well [20]. Our comparisons

on codon usage revealed changes in the absolute frequency

of 11 codons- significant increase in the frequency of five

codons (GAG, AAG, CCA, AGG and AGA) coding for glu-

tamic acid, lysine, proline and arginine respectively and a

significant decrease in the three codons (AAU, CAA and

CUU) coding for asparagine, glutamine and leucine in the

hyperthermophiles (Table 4 ). Including N. equitans in statis-

tical analysis led to fall in the frequencies of three additional

codons i.e., ATT, CGT and CGC encoding isoleucine and

Table 1. List of Organisms Studied in the Analysis

Species Name Abbreviation GC Content OGT (°C)

Mesophile

Campylobacter jejuni Cjej 31 43

Borrelia burgdorferi Bbur 28 37

Lactococcus lactis Llac 35.3 30

Rickettsia prowazekii Rpro 29 35

Hyperthermophile

Methanococcus jannaschii Mjan 31.3 85

Sulfolobus solfataricus Ssol 35.8 80

Sulfolobus tokodaii Stok 32.8 80

Nanoarchaeum equitans Nequ 31.6 90

14 The Open Bioinformatics Journal, 2008, Volume 2 Agarwal and Grover

AMINO ACID COMPOSITION IS RELATED WITH

OGT

The average proportion of each amino acid in the meso-

philes under study on one hand and in hyperthermophiles

under study on the other hand was analyzed to complement

codon usage data (Table 6 ). As expected, changes were ob-

served in the frequency of seven amino acids. The proteome

analysis indicated that the frequency of four amino acids

(Arginine, Proline, Valine, and Tyrosine) was markedly

higher while that of three amino acids (Asparagine, Phenyla-

lanine and Glutamine) was substantially lower. Earlier Klip-

can et al. [22] have associated seven amino acids with ther-

mophiles, the so-called class I amino acids. Among these

seven amino acids, valine did not find mention, but Suhre

and Claverie [23] recognized preference of valine in thermo-

philic proteomes. Similarly de Farias and Bonato [21] found

an increase in Glutamate and Lysine corresponded with an

equivalent fall in frequencies of glutamine and histidine and

thus maintaining (E+K)/(Q+H) ratio. On the other hand, pre-

dominance of glutamate and valine in thermophilic proteo-

mes was reported by Pasamontes and Garcia-Vallve [24].

The abundance of purine-rich codons is the possible rea-

son for high frequency of arginine in thermophilic proteomes

(Table 6 ). The skewness in the frequencies of the amino ac-

ids in hyperthermophilic proteomes has been suggested to be

related with the stability of the proteins under extremes of

temperature [14, 23, 25, 26]. Increased occurrence of proline

residues in loops are thought to enhance the thermostability

of proteins [27, 28]. Similarly, valine is known to provide

rigidity to the three dimensional structures of proteins caus-

ing smaller conformational entropy increase upon unfolding

[29]. Higher frequencies of tyrosine despite being encoded

by purine-poor codons (TAT and TAC) in hyperthermophilic

proteomes, however is explained due to its property of pro-

viding thermostability to protein structures [30]. On the other

hand, decrease in the asparagine and glutamine frequencies

reduces the potential deamination of proteins and thus con-

fers stability to thermophilic proteins [25].

Further the amino acid compositions of the helices in

mesophilic and hyperthermophilic genomes were also found

varied. It was observed that the amount of oppositely charg-

ed residues glutamate, lysine and arginine were higher

while asparagine and glutamine were found under-repre-

sented in hyperthermophilic genomes (Fig. 1 ). It has been

suggested that increase in charged residues is responsible for

increased number of salt bridges in hyperthermophiles and

thereby provides the thermostability to protein [25]. It is well

recognized that minute changes in local weak interactions

can bring about thermostability in proteins [31], while the

overall protein conformations may not see any changes. For

example, Goldstein [32] found measures of thermostability

Table 3. Distribution of Dinucleotide Frequencies

Nucleotide Cjej Bbur Rpro Llac Avg-Meso Mjan Ssol Stok Avg-Thermo Nequ Avg-thermo+ Nequ

AT 10.0 11.1 11.8 9.1 10.5 10.9 9.7 10.3 10.3 ns 11.3 10.6 ns

AG 6.7 6.4 6.5 5.7 6.3 8.5 8.5 8.2 8.4 *** 7.7 8.2 ***

AC 3.4 3.3 4.2 4.7 3.9 3.5 4.6 4.3 4.1 ns 4.0 4.1 ns

AA 16.1 16.7 14.2 13.5 15.1 15.3 11.6 12.8 13.2 ns 16.7 14.1 ns

TA 9.1 9.5 11.7 6.5 9.2 9.4 10.2 10.8 10.1 ns 11.5 10.5 ns

TG 6.5 6.0 5.8 7.5 6.4 6.8 5.6 5.7 6.0 ns 4.9 5.7 ns

TC 3.5 3.8 3.8 5.2 4.1 2.9 4.3 4.1 3.7 ns 2.6 3.5 ns

TT 13.9 14.1 11.5 11.6 12.8 10.8 9.1 10.2 10.0 * 10.1 10.1 *

CA 4.7 4.7 5.0 6.0 5.1 4.7 4.7 4.6 4.7 ns 4.6 4.6 ns

CT 4.9 4.6 4.8 5.3 4.9 3.8 5.2 5.3 4.8 ns 3.9 4.6 ns

CG 1.4 1.0 1.8 2.5 1.7 0.7 2.2 1.5 1.4 ns 1.4 1.4 ns

CC 1.7 1.8 1.7 2.6 2.0 2.0 2.8 2.4 2.4 ns 3.0 2.6 ns

GC 4.2 3.1 3.5 3.9 3.7 2.9 3.1 2.9 3.0 ns 3.3 3.0 ns

GT 4.1 3.6 4.7 4.7 4.3 4.3 5.2 4.9 4.8 ns 3.8 4.5 ns

GA 6.3 6.7 5.9 7.0 6.5 8.8 7.9 7.5 8.0 * 6.9 7.8 *

GG 3.4 3.6 3.1 4.2 3.6 4.8 5.4 4.5 4.9 * 4.3 4.7 *

The values shown are the percentage of dinucleotides in the complete coding sequences of each genome. Mean values for the mesophilic (Avg-meso) and hyperthermophilic (Avg- thermo; Avg-thermo+nequ) are shown. Also significance based on a t-test are shown. ns (p>0.05); * (p<0.05); *** (p<0.001).

Nucleotide Composition and Amino Acid Usage in AT-Rich The Open Bioinformatics Journal, 2008, Volume 2 15

Table 4. Number of Codons Per Thousand

Codon Cjej Bbur Rpro Llac Avg-Meso Mjan Ssol Stok Avg-Thermo Nequ Avg-thermo+ Nequ

GGG 5.8 7.8 5.6 7.8 6.8 10.4 9.7 7.3 9.1 ns (0.0790) 10.4 9.4 *

GAG 12.8 17.6 13.3 11.7 13.8 34.8 29.4 23.0 29.1 ** 18.9 26.5 *

AGG 2.7 6.4 3.4 1.4 3.5 9.8 17.5 11.9 13.1 ** 11.8 12.7 **

AGA 16.0 20.9 15.0 8.1 15.0 27.4 25.1 26.1 26.2 * 24.2 25.7 **

AAG 12.9 21.4 15.5 11.9 15.4 30.7 37.3 27.9 32.0 ** 18.5 28.6 *

AAU 54.0 59.2 56.5 41.5 52.8 15.5 32.9 34.9 27.8 * 35.1 29.6 **

AUU 43.7 59.6 51.9 53.6 52.2 48.6 33.7 40.3 40.8 ns (0.0850) 30.4 38.2 *

CGU 6.4 1.8 9.5 15.0 8.2 0.3 1.7 1.4 1.1 ns (0.0860) 0.7 1.0 *

CGC 3.8 0.9 1.9 3.9 2.6 0.1 0.6 0.4 0.4 ns (0.0520) 0.6 0.4 *

CAA 28.4 18.7 24.6 31.1 25.7 9.0 15.5 15.6 13.4 * 20.5 15.2 *

CUA 6.8 8.7 11.6 7.4 8.6 8.5 19.2 16.4 14.7 ns (0.0940) 18.4 15.7 *

CUU 32.1 30.5 20.4 25.5 27.1 9.1 15.2 18.6 14.3 * 6.4 12.3 **

CCA 8.7 9.0 10.8 15.6 11.0 22.5 16.1 17.3 18.6 * 18.8 18.7 *

The values shown are number of codons within each genome. The numbers are scaled to a total of 1000 for each genome. Only those codons that show significant differences are listed. Also significance based on a t-test are shown. ns (p>0.05); * (p<0.05); ** (p<0.01).

Table 5. Relative Synonymous Codon Usage

Codon Cjej Bbur Rpro Llac Avg-Meso Mjan Ssol Stok Avg-Thermo Nequ Avg-Thermo+ Nequ

GAG 0.36 0.52 0.46 0.34 0.42 0.81 0.87 0.66 0.78 ** 0.49 0.52 *

GAA 1.64 1.48 1.54 1.66 1.58 1.19 1.13 1.34 1.22 ** 1.51 1.49 *

AGG 0.53 1.2 0.61 0.23 0.64 1.55 2.25 1.74 1.85 ** 1.84 1.07 **

AAU 1.72 1.63 1.7 1.6 1.66 1.41 1.33 1.43 1.39 ** 1.33 1.50 ***

AAC 0.28 0.37 0.3 0.4 0.34 0.59 0.67 0.57 0.61 ** 0.67 0.50 ***

AUA 0.91 1.12 1.26 0.33 0.91 1.30 1.58 1.51 1.46 ns (0.078) 1.91 1.11 *

AUU 1.52 1.67 1.43 2.1 1.68 1.39 1.07 1.22 1.23 ns (0.065) 0.87 1.51 *

UAU 1.73 1.59 1.73 1.58 1.66 1.55 1.30 1.48 1.44 * 1.56 1.59 *

UAC 0.27 0.41 0.27 0.42 0.34 0.45 0.70 0.52 0.56 * 0.44 0.41 *

UUU 1.86 1.81 1.71 1.58 1.74 1.59 1.19 1.39 1.39 * 1.38 1.57 *

UUC 0.14 0.19 0.29 0.42 0.26 0.41 0.81 0.61 0.61 * 0.62 0.43 *

UCC 0.18 0.27 0.23 0.26 0.24 0.37 0.68 0.43 0.49 * 0.67 0.38 *

CGU 1.29 0.33 1.69 2.52 1.46 0.04 0.21 0.20 0.15 ns (0.06) 0.11 1.03 *

CGC 0.76 0.16 0.34 0.66 0.48 0.01 0.07 0.06 0.05 * 0.09 0.31 *

CUA 0.37 0.5 0.69 0.45 0.50 0.54 1.11 0.96 0.87 ns (0.075) 1.06 0.64 *

CUU 1.78 1.76 1.21 1.55 1.58 0.58 0.88 1.09 0.85 * 0.37 1.02 **

CCU 2.34 1.77 2.01 1.45 1.89 1.02 1.30 1.46 1.26 ns (0.051) 1.25 1.40 *

The values shown are the relative frequencies of synonymous codon usage within each codon group. Only those codons that show significant differences are listed. Also significance based on a t-test are shown. ns (p>0.05); * (p<0.05); ** (p<0.01); *** (p<0.001).

Nucleotide Composition and Amino Acid Usage in AT-Rich The Open Bioinformatics Journal, 2008, Volume 2 17

being increased number of charged residues. Das et al. [5]

reported these residues being positively charged and found a

positive correlation between the OGT and P/N ratio of amino

acids in proteome. Further, salt bridges were reported to be a

characteristic feature of mesophilic and psychrophilic protein

folds [32]. An observation that falls consistent with insignifi-

cant presence of cysteine residues in thermophiles. Thus,

simple amino acid substitutions can shift the balance towards

thermophilic adaptations. Klipcan et al. [22] suggested the

thermophilic adaptation of proteins is a sequence based phe-

nomenon in place of structure based phenomenon. The sug-

gested (E+K)/(Q+H) ratio [26] was calculated, which can be

used as an indicator for discriminating organisms according

to their OGT. The average ratio for hyperthermophilic ge-

nomes, ought to be higher than 4.5 [26], was found 4.7 and

5.1 respectively when calculated without and with N. equi-

tans. The reason for exhibiting higher ratio is the higher

abundance of purine tracts in hyperthermophiles compared

with mesophiles because the glutamic acid and lysine are

encoded only by pure-purinic codons.

Another discriminating factor between mesophilic and

hyperthermophilic genomes is the absolute difference be-

tween the frequency of charged and polar amino acid resi-

dues, CvP-bias [23, 26]. The variations in the use of charged

and polar residues have been related to large differences in

surface accessibilities of the proteins [23, 26, 33] and there-

fore the CvP bias is further analyzed (Fig. 2 ). It was ob-

served that N. equitans exhibited a strong bias for the use of

charged residues (Asp, Glu, Lys, Arg) at the expense of po-

lar residues (Asn, Gln, Ser, Thr).

The genomic and proteomic composition of N. equitans

is thus biased, like other hyperthermophilic organisms and

causes an increase in charged residues on the molecular sur-

face of the proteins that allows more ion pairs to be formed

and thus enhancing protein stability at temperature extremes

[25].

DIPEPTIDE COMPOSITION

The trends of occurrence of single amino acid are fol-

lowed at dipeptide level also (Table 7 ). For example, marked

increase in tyrosine (Y) content demonstrated its effect as all

the 14 dipeptides that exhibited significant difference had an

increased frequency of tyrosine, even when it occurred with

amino acid that show significant decrease (YN and NY).

Similarly, increase in arginine, valine and glutamic acid pro-

duced the same effect. On the other hand, the significant

decrease in glutamine leads to decrease in content of 14

dipeptides in hyperthermophiles even when it occurred with

amino acids that show increase (K, E, I). Amino acids that

show increased occurrence in hyperthermophiles frequently

occurred in tandem with lysine, which in itself did not ex-

hibit significant bias between mesophiles and hyperthermo-

philes. Thus there are certain dipeptides which significantly

differ in their frequency between mesophiles and hyperther-

mophiles including N. equitans and thus influencing the

thermostability of the protein. These trends support the hy-

pothesis put forward by Klipcan et al. [22] that thermophilic-

ity has been achieved at the level of sequence without bring-

ing about any significant changes in the conformation of

proteins. Conformational change in protein structures is ob-

viously undesired as this would affect the nature of vital

metabolic reactions a great deal. While preferential occur-

rence of amino acid residues adjacent to each other obvi-

ously affect intramolecular interactions and thus are

instrumental in adjustment of proteins to the growth

temperature of the organism [31].

CONCLUSION

The present study examined and analyzed the contribu-

tions of nucleotide, amino acid and synonymous codon us-

age pattern on the genomes of four GC-poor hyperthermo-

philic archaeal species. Nucleotide composition indicated

that the influence of dinucleotide composition on protein

Fig. (2). Plot of the sum of percentages of charged, polar amino acids and the difference between the two categories in mesophilic and hyper-

thermophilic genomes.

Charged Polar Charged-Polar

Percentage

Mesophilic

Hyperthermophilic

Hyperthermophilic including N. equitans

18 The Open Bioinformatics Journal, 2008, Volume 2 Agarwal and Grover

thermostability is larger than influence of mononucleotide

composition. Codon usage analysis pointed towards the

compositional constraint acting on the genome. Further, mi-

nor amino acid substitutions seemingly are sufficient for

thermo-adaptability in place of drastic structural or confor-

mational changes, and thus also maintain the intrinsic nature

of various metabolic reactions. Together, these minor ad-

justments in genomic and proteomic contents might be con-

sidered as the means that have guided the survival of hyper-

thermophiles under drastic environments.

REFERENCES

[1] R. M. Atlas, and R. Bartha, “Microbial ecology-fundamentals and applications”, Pearson Education (Singapore) Pte. Ltd, pp. 305- 311, 2005. [2] D. W. Grogan, “Hyperthermophiles and the problem of DNA in- stability”, Mol. Microbiol. , vol. 28, pp. 1043-1049, 1998. [3] R. J. Klein, Z. Misulovin, and S. R. Eddy, “Noncoding RNA genes identified in AT-rich hyperthermophiles”, Proc. Natl. Acad. Sci. USA , vol. 99, pp. 7542-7547, 2002. [4] N. Galtier, and J. R. Lobry, “Relationships between genomic G+C content, RNA secondary structures, and optimal growth tempera- tures in prokaryotes”, J. Mol. Evol. , vol. 44, pp. 632-636, 1997.

[5] S. Das, S. Paul, S. K. Bag, and C. Dutta, “Analysis of Nanoar- cheum equitans genome and proteome composition: indications for hyperthermophilic and parasitic adaptations”, BMC Genomics , vol. 7, pp. 186, 2006. [6] D. P. Kreil, and C. A. Ouzounis, “Identification of thermophilic species by the amino acid compositions deduced from their ge- nomes”, Nucleic Acids Res , vol. 29, pp. 1608-1615, 2001. [7] R. Schwartz, C. S. Ting, and J. King, “Whole proteome pI values correlate with subcellular localizations of proteins for organisms within the three domains of life”, Genome Res , vol. 11, pp. 703- 709, 2001. [8] J. R. Lobry, and D. Chessel, “Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria”, J. Appl. Genet. , vol. 44, pp. 235-261, 2003. [9] R. Friedman, J. W. Drake, and A. L. Hughes, “Genome-wide pat- terns of nucleotide substitution reveal stringent functional con- straints on the protein sequences of thermophiles”, Genetics , vol. 167, pp. 1507-1512, 2004. [10] K. U. Foerstner, C. von Mering, S. D. Hooper, and P. Bork, “Envi- ronments shape the nucleotide composition of genomes”, EMBO Rep. , vol. 6, pp. 1208-1213, 2005. [11] J. Garnier, J. F. Gibrat, and B. Robson, “GOR method for predict- ing protein secondary structure from amino acid sequence”, Meth- ods Enzymol. , vol. 266, pp. 540-553, 1996. [12] T. Kawashima, N. Amano, H. Koike, et al , “Archaeal adaptation to higher temperatures revealed by genomic sequence of Thermo-

Table 7. Dipeptides that Exhibit Significant Differences Between Mesophilic and Thermophilic Genomes

M A C D E F G H I K L N P Q R S T V W Y

M * + * - * - * +

A * -

C

D ** - * - * +

E * + ** + * + * - ** + * + * +

F * - * - ** -

G * + * + *** + * + *** +

H ** - ** -

I ** + ** - ** + ** +

K ** + * + *** + *** + * +

L * +

N *** - * - * - ** - * -

P * + ** + * + *** +

Q * - ** - ** - * - ** - ** - ** - ** -

R ** + *** + ** + *** + *** +

S ** - * + ** +

T

V * + * + **** + ** + ** +

W ** + * + ** +

Y ** + *** + *** + *** + * + * +

  • indicates an increase in content of particular dipeptide in the direction mesophilic to hyperthermophilic genomes; - indicates an decrease in content of particular dipeptide in the direction mesophilic to hyperthermophilic genomes; The significance based on a t-test are shown. * (p<0.05); ** (p<0.01), *** (p<0.001) and ** **( p<0.0001).