1й семестр

Mini-review of Bacteroides thetaiotaomicron genome:
protein length distribution and codon frequencies

Abstract

Bacteroides thetaiotaomicron is known for its role in humans and other mammals' intestinal microbiota. It is notable for its role in the digestion of various complex polysaccharides. Here we analyze genome features of B.thetaiotaomicron VPI-5482, such as coding sequence consistency of replicones, proteome characteristics, codon frequencies. We noticed a significant difference in codon bias in identified and hypothetical proteins and imply that the hypothetical protein group is heterogeneous.

1 Introduction

Taxonomy rank	Name
Domain	Bacteria
Kingdom	Pseudomonadati
Phylum	Bacteroidota
Class	Bacteroidia
Order	Bacteroidales
Family	Bacteroidaceae
Species	Bacteroides thetaiotaomicron

Table 1. Taxonomy of Bacteroides thetaiotaomicron [1].

Bacteroides thetaiotaomicron is a typical representative of the genus Bacteroides. It is known for its role in humans and other mammals' intestinal microbiota [2].

Bacteroides thetaiotaomicron is notable for its role in the digestion of various complex polysaccharides. This species is one of the main gut symbionts, making its study significant in the context of research on symbiotic relationships, gastrointestinal health, and the development of the immune system [3]. The ability to utilize various polysaccharides is ensured, in part, by the presence of clusters of termed polysaccharide utilisation loci (PULs) [4].

In this mini-review, we analyze the Bacteroides thetaiotaomicron genome. We investigated the coding regions composition of present replicones and several coding regions features, such as protein length and codon frequencies.

2. Materials and methods

We used genome feature tables and coding sequence files obtained from the NCBI database (1). In the first part of the analysis, we used Google Sheets(3) to examine the distribution of coding sequences across the bacterial replicons. With data from the file GCF_000011065.1_ASM1106v1_cds_from_genomic.fna (2), we generated a histogram of protein lengths. To analyze codon frequencies, we used data from the mentioned file(2) and processed them with Python(4). We estimated the difference in synonyms codon frequencies in identified and hypothetical proteins with two-proportion z-test:

z = p₁ – p₂ √ p(1–p) · ( 1 n₁ + 1 n₂ )

3 Results

3.1 Coding sequences of replicas of Bacteroides thetaiotaomicron

We examined the number of protein coding genes and different RNAs in both present replicones from the genome(1). The results are summarized in Table 1. In the genome of Bacteroides thetaiotaomicron, two replicons are present: one circular chromosome with a length of 6.26 Mb and one circular plasmid with a length of 33 kb. The majority of coding sequences in the main chromosome are protein-coding genes — 4,649 in total; additionally, it contains 70 tRNA genes and 15 ribosomal RNA genes. The second replicon, the plasmid p5482, contains 38 protein-coding genes, whose primary function is environmental sensing [5].

	Protein coding genes	tRNA	rRNA	tmRNA
Chromosome	4,649	70	15	1
p5482 plasmid	38	0	0	0

Table 1. Taxonomy of Bacteroides thetaiotaomicron [1].

3.2. Lengths of proteins encoded in the genome of the bacterium Bacteroides thetaiotaomicron

Analysis of the protein length distribution within the Bacteroides thetaiotaomicron genome (Figure 1) reveals a characteristic right-skewed profile, indicative of a high abundance of shorter polypeptides and a rapid decline in frequency with increasing length. The total amount of analyzed proteins is 4801 units. The distribution is unimodal, with a pronounced peak observed in the shortest length groups, where the number of proteins reaches a maximum of 440 proteins per bin. In particular, we can note that the peaks in protein abundance are located around the median value of 390 amino acids. It corresponds to our data in google sheets (2). The data demonstrates a long-tailed distribution, with a substantial number of length bins containing proteins, albeit at low frequencies (e.g., less then 50 proteins per bin beyond the mid-range), confirming the presence of a limited set of elongated proteins. This overall pattern, dominated by a high prevalence of short proteins, is consistent with genomic characteristics commonly observed in prokaryotic organisms [6].

Гистограмма длин белков — **Figure 1.** Protein length distribution histogram.

3.3. Frequencies of codones in B. thetaiotaomicron genome

We examined codon frequencies in bacterial coding sequences. We hypothesized that codon frequencies would differ across different groups of CDSs. Therefore, we divided them into three groups according to its description in CDS table (2): identified proteins — proteins with defined names in cds table (2) (4117 sequences), hypothetical proteins – defined as proteins without details (577 sequences), and pseudogenes (107 sequences).

This separation arises because codon usage in protein-coding sequences is subject to selective pressure, as codon frequencies constitute one of the mechanisms regulating translation [7]. In contrast, codons in pseudogene sequences are no longer constrained by selection due to the loss of their expression [8].

Hypothetical proteins were also separated from the set of identified proteins, as no reference sequences are available for them. Consequently, it is not possible to reliably determine whether they contain insertions such as mobile genetic elements or self-splicing introns, which present in numerous bacterial genomes [9].

3.3.1. Start codons in three groups of CDS

We identified the first codon (the first three nucleotides) of the identified proteins, hypothetical and pseudogene sequences. According to Figure 2 , the dominant start codon is the canonical methionine codon AUG. Both identified and hypothetical sequences also contain alternative start codons — CTG and TTG encode leucine, GTG encodes valine, and ATT, ATC, and ATA encode isoleucine.

In pseudogenes, however, we detected an additional 21 codons (5 of which also appear in genomic sequences) that are not known to function as start codons.

These ‘novel start codons’ in pseudogenes are most likely the result of accumulated neutral mutations [8]. The occurrence of various amino-acid-encoding codons at putative start positions in genomic proteins suggests that some of these sequences may also be pseudogenes, as pseudogenes are often misannotated as hypothetical proteins during initial genome annotation [10].

Start codones hist — **Figure 2.** Start codons frequency in three groups of cds, log scale.

3.3.2. Stop codons in three groups of CDS

We identified the terminal three nucleotides (stop codons) of the identified proteins, hypothetical proteins and pseudogene sequences. In both identified and hypothetical one sequences, only the canonical stop codons TAA, TGA, and TAG were observed, with TAA occurring approximately five times more frequently than TGA or TAG (Figure 3).

Pseudogenes also predominantly use these three stop codons, but we additionally detected 23 codons that encode amino acids. This likely reflects the mutational decay of pseudogenes, which can obscure their true stop codons [8].

Stop codones hist — **Figure 3.** Stop codons frequency in three groups of cds, log scale.

3.3.3. Amonoacid codones in three groups of CDS

We compared the frequencies of synonymous codons between the identified and hypothetical protein sets. This analysis revealed significant differences in codon usage for 10 amino acid codons , z-criterion > 2.59 (Table 3T; full dataset in (5)). The data further show that hypothetical proteins exhibit a weaker codon bias than identified proteins, with codon usage appearing more stochastic.

Based on these observations, we propose hypothetical proteins do not represent a single category of uncharacterized proteins but rather a heterogeneous set that likely includes horizontally acquired genes, pseudogenes, and genuinely functional proteins. These subgroups, and how they can be distinguished, are discussed in detail in the following section.

AA	Codon	Identified (%)	Hypothetical (%)	Z-criterion
A	GCC	27.645	22.715	2.497
D	GAC	40.928	34.313	3.0362
D	GAT	59.072	65.687	-3.0362
F	TTC	51.085	42.28	3.9616
F	TTT	48.915	57.72	-3.9616
G	GGG	9.936	12.739	-2.0762
I	ATC	39.171	28.64	4.8839
I	ATA	22.884	32.536	-5.079
L	CTG	32.142	23.993	3.9588
L	CTA	4.93	7.071	-2.1706

Table 2/ Amino acid codons frequency with statistically significant difference frequency.

4 Discussion

Our analysis generated a set of bioinformatic characteristics for one of the sequenced B. thetaiotaomicron genome, which can be applied to several practical tasks. In particular, frequencies of start, stop, and synonymous codons can help prioritize hypothetical proteins for further study. Hypothetical proteins that use canonical start and stop codons and display codon preferences similar to identified proteins are more likely to be functional.

For such candidates, calculating the Codon Adaptation Index (CAI)[12] may further indicate how closely they resemble known functional proteins. Conversely, sequences that diverge from these patterns are likely pseudogenes, potentially recently formed if they still retain features that led to their initial annotation as proteins[10].

Distinct codon biases may also indicate horizontal gene transfer, which can be assessed by comparing codon usage and GC content with other predominant species of the mammalian gut microbiome [13].

5. References

NCBI Taxonomy browser Bacteroides thetaiotaomicron https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=818
Wexler HM (October 2007). "Bacteroides: the good, the bad, and the nitty-gritty". Clinical Microbiology Reviews. 20 (4): 593–621. doi:10.1128/CMR.00008-07
Wexler HM 2007. Bacteroides: the Good, the Bad, and the Nitty-Gritty. Clin Microbiol Rev 20:. https://doi.org/10.1128/cmr.00008-07
Ndeh, D.A., Nakjang, S., Kwiatkowski, K.J. et al. A Bacteroides thetaiotaomicron genetic locus encodes activities consistent with mucin O-glycoprotein processing and N-acetylgalactosamine metabolism. Nat Commun 16, 3485 (2025). https://doi.org/10.1038/s41467-025-58660-2
Xu J, Bjursell MK, Himrod J, Deng S, Carmichael LK, Chiang HC, Hooper LV, Gordon JI. A genomic view of the human-Bacteroides thetaiotaomicron symbiosis. Science. 2003 Mar 28;299(5615):2074-6. doi: 10.1126/science.1080029. PMID: 12663928.
Tiessen A, Pérez-Rodríguez P, Delaye-Arredondo LJ. Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes. BMC Res Notes. 2012 Feb 1;5:85. doi: 10.1186/1756-0500-5-85. PMID: 22296664; PMCID: PMC3296660.
Moutinho AF, Eyre-Walker A. No Evidence that Selection on Synonymous Codon Usage Affects Patterns of Protein Evolution in Bacteria. Genome Biol Evol. 2024 Feb 1;16(2):evad232. doi: 10.1093/gbe/evad232. PMID: 38149940; PMCID: PMC10849182.
Anand, A., Olson, C.A., Yang, L. et al. Pseudogene repair driven by selection pressure applied in experimental evolution. Nat Microbiol 4, 386–389 (2019). https://doi.org/10.1038/s41564-018-0340-2
Martínez-Abarca F, Toro N. Group II introns in the bacterial world. Mol Microbiol. 2000 Dec;38(5):917-26. doi: 10.1046/j.1365-2958.2000.02197.x. PMID: 11123668.
Lerat E, Ochman H. Recognizing the pseudogenes in bacterial genomes. Nucleic Acids Res. 2005 Jun 2;33(10):3125-32. doi: 10.1093/nar/gki631. PMID: 15933207; PMCID: PMC1142405.
Plotkin, J., Kudla, G. Synonymous but not the same: the causes and consequences of codon bias. Nat Rev Genet 12, 32–42 (2011). https://doi.org/10.1038/nrg2899
Sharp PM, Li WH. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987 Feb 11;15(3):1281-95. doi: 10.1093/nar/15.3.1281. PMID: 3547335; PMCID: PMC340524.
Coyne MJ, Zitomersky NL, McGuire AM, Earl AM, Comstock LE. 2014. Evidence of Extensive DNA Transfer between Bacteroidales Species within the Human Gut. mBio 5:10.1128/mbio.01305-14. https://doi.org/10.1128/mbio.01305-14

Supplementary

Sourse of GCF_000011065.1_ASM1106v1_cds_from_genomic.fna and GCF_000011065.1_ASM1106v1_feature_table.txt https://matrix.bio.anl.gov/pub/CSGID/refseq_genomes/226186/
Sheets in google sheets: cds and protein histogramhttps://docs.google.com/spreadsheets/d/1XbahvMNXKQrrfqRPgOilaAQ4DVCzk24BECx4-EA_q-U/edit?gid=337291676#gid=337291676
Google Sheets: Feature table sortedhttps://docs.google.com/spreadsheets/d/1KIt_8OwgEM9qDl_4EO755V7a-_yZoyjifE0wmc7ysAo/edit?gid=20401416#gid=20401416
Google Collabhttps://colab.research.google.com/drive/1Gh0VCjafwtMCFGdQQOBF6O6HIbE53yGx#scrollTo=AtYA88VWUdIG
Amino acid codons frequencies full tablehttps://docs.google.com/spreadsheets/d/1QNqwS3keoe2XPftwZywPGcV52q6g_TAOi7uRfERHcVg/edit?usp=sharing

Abstract
1. Introduction
2. Materials and methods
3. Results
3.1. Coding sequences of replicas of Bacteroides thetaiotaomicron
3.2. Lengths of proteins encoded in the genome of the bacterium Bacteroides thetaiotaomicron
3.3. Frequencies of codones in B. thetaiotaomicron genome
3.3.1. Start codons in three groups of CDS
3.3.2. Stop codons in three groups of CDS
3.3.3. Amonoacid codones in three groops of CDS
4. Referencies
5. Supplementary