Mini-Review

Genome and proteome analysis of Campylobacter coli

Uliana I. Demeneva ^1*

^1*Faculty of Bioengineering and Bioinformatics, Lomonosov State University, Moscow, Russia

Contact: demenevauliana@fbb.msu.ru
Summary: This is an analysis of the genome and proteome of the microaerophilic bacterium Campylobacter coli using spreadsheet functionality, bioinformatics packages (UGENE) and the Python programming language. All results somehow correlate with the prevailing patterns and can be explained scientifically.

1. INTRODUCTION

Campylobacter coli (C. coli) is a gram-negative, S-shaped bacteria within the genus Campylobacter (Prescott LM, 2005). This type of bacteria is distinguished by its microaerophilicity: with prolonged exposure to air, they acquire a coccidoid or spherical shape. Thereby, the above-mentioned organism is extremely sensitive to any changes in the external environment. Aerobic conditions, fluctuating temperature and osmotic pressure, as well as hunger, put the bacterium into a state of stress. Nonetheless, it should be noted that in the genus Campylobacter C. coli is more aerotolerant than C. jejuni, the closest specimen to the one under study (Karki et al., 2019). Some research has shown that the optimal temperature for the normal activity is considered to be 42C (Allos, B. M., 2001).

C. coli causes campylobacteriosis in humans, the most commonly reported diarrheal foodborne illness (EFSA Journal, 2018). Representatives of the genus are thought to be primarily transmitted to humans by ingestion of contaminated fresh foods such as meat and milk. There are several virulence factors that determine the ability of C. coli to cause disease. These include adhesion, invasion and bacterial motility adherence. Campylobacter secrete cytolethal distending toxin (CDT), which is an AB toxin that has DNase activity that causes DNA double-strand breaks during the G2 phase of the cell cycle. This eventually leads to apoptosis in the cells (Prescott LM, 2005).

**Table 1.**Taxonomic position of the *Campylobacter coli* (Doyle, 1948) (Veron and Chatelain, 1973).

**Fig. 1.** *Campylobacter jejuni*, related species (De Wood, Pooley, USDA, ARS, EMU).

2. MATERIALS AND METHODS

Data on the genome and proteome of the studied bacterium were taken from the database of the National Center for Biotechnology Information (NCBI). Most of the research was done using spreadsheet functionality, the Python programming language, and the EMBOSS bioinformatics package installed on kodomo. All data was additionally verified using third-party programs, which include UGENE. This method was chosen exclusively to improve the quality and reliability of the results. Additional studies are based on the use of all the above methods.

2.1 Genome analysis

2.1.1 Replications origin, terminus prediction and GC composition

Basic data on the genome of the bacterium (nucleotide and amino acid sequences, table of features of the bacterium) were taken from the database of the National Center for Biotechnology Information (NCBI) (Supplementary materials 5,6). Further, the GenSkew program (Jennifer Lu, 2022) was used to plot the GC-skew diagram. Minimum and maximum cumulative GC skew have been analyzed using the online version of the Genskew program (Jennifer Lu, 2022). GenSkew calculates the incremental and the cumulative skew of two selectable nucleotides for a given sequence according to the formula: GC skew = (G - C) / (G + C) The GC composition of the chromosome and plasmid was analysed using the EMBOSS bioinformatics package, or rather, the geece command.

2.1.2 Open reading frames and restriction sites

As an additional study, we conducted a reading frame and site restriction analysis. For this, the Unipro UGENE program was used. All figures were also built in the above program.

2.1.3 Codon and nucleotide usage analysis

The basic sequence data was obtained from the NCBI (Supplementary materials 5, 6). Using a table of features of a bacterium (Supplementary material 1) we counted the number of genes (the count was made line by line, using a filter). To analyze the frequency of occurrence of codons encoding amino acids, we used a code written in Python (Supplementary materials 4, paragraph 1, 2). Codons were intentionally sorted alphabetically for easy interaction with the received data. For the frequency of occurrence of amino acids and individual nucleotides, the code written in the Python programming language was also used (Supplementary material 4, paragraph 3).

2.2 Proteome analysis

2.2.1 Protein length distribution estimation

To analyse the proteome, a table of features of the Campylobacter coli was imported into Google Spreadsheets. For convenience, a column of so-called “pockets” was placed in the table, indented by 30 amino acids (Supplementary material 2). Then it was calculated how many proteins with a length, the value of which falls into a given “pocket”, are contained in each interval (Fig. 7). For this, the standard function COUNTIFS (Excel) was used.

2.2.2 Protein distribution in + and - DNA strands and in leading

As one of the additional studies, it was decided to find out how many proteins are contained in + and - DNA strands respectively. As for the previous analysis, the built-in Excel functions were used: =COUNTIF(CDS!E:E; “+”), =COUNTIF(CDS!E:E; “-”), where CDS!E:E is a reference to the “strand” column in the table of the genome of the bacterium in question (Supplementary material 1).

2.2.3 Determination of the number of ribosomal proteins

To find ribosomal proteins, filters were applied on two columns: “#feature” and “name”, in which the values “CDS” and “ribosomal” were searched (Supplementary material 1).

2.2.4 Hypothetical proteins. Their number and contribution to the proteome

The number of hypothetical proteins was counted using the built-in Excel function: =IFERROR(VLOOKUP(“hypothetical”,B1,1)+). A table of features of the bacterium was also used (Supplementary material 1).

2.2.5 Number of transport proteins and their contribution to the proteome

Using the VLOOKUP function mentioned in the previous paragraph and additional rechecking of the obtained data using built-in filters (it was applied to a table of features of a bacterium; supplementary material 1), the number of transport proteins was calculated.

2.3 RNA gene statistics. Distribution of RNA by types

With the help of filters (“RNA”) and a table of features of the observed bacterium (Supplementary material 1) we counted the number of RNA genes in the proteome.

3. RESULTS AND DISCUSSION

3.1 Genome analysis

3.1.1 Replications origin, terminus prediction and GC composition

It was found that the length of the chromosome is 1678432 bp, the plasmid - 38104 bp. Further, these data were directly involved in the study. Talking about the chromosome, the maximum value corresponds to the region ter (607464 bp), in which replication is terminated, and the minimum value corresponds to oriC (1425996 bp), in which it begins.

**Fig. 2.1** Cumulative GC skew plot, created using the GenSkew program (for the chromosome) (Jennifer Lu, 2022).

**Fig. 2.2.** Cumulative GC skew plot, created using the GenSkew program (for the plasmid) (Jennifer Lu, 2022).

Using the EMBOSS bioinformatics package it was found that the proportion of guanine (G) and cytosine (C) among all nucleotide residues of the considered chromosome is 31,41%. The GC-content of the plasmid, in turn, is 28% (Table 2.1 and Table 2.2).

**Table 2.1.** Main points of genomic analysis (for chromosome).

**Table 2.2.** Main points of genomic analysis (for plasmid).

3.1.2 Open reading frames and restriction sites

Fig. 3 and 4 clearly illustrate the approximate location of open reading frames and restriction sites, respectively. Additionally, Table 3.1 is provided showing the number of relevant restriction sites.

**Fig. 3.** Approximate location of open reading frames (chromosome). Made with Unipro UGENE (Supplementary material 6).

**Fig. 4.** Approximate location of restriction sites (chromosome). Made with Unipro UGENE (Supplementary material 6).

**Table 3.1.** Restriction site map (chromosome). Made with Unipro UGENE (Supplementary material 6).

A restriction site is a sequence of approximately 6-8 base pairs (bp) of DNA that binds to a given restriction enzyme. Restriction enzymes are produced by bacteria in the course of evolution in order to destroy foreign DNA that can enter the cell and cause its transformation. Restriction sites are important for facilitating the insertion of target genes into various constructs such as plasmids. A large number of the following restriction sites were found in the genome: HindIII (2569 plots; serves as an enzyme that cleaves the palindromi sequence AAGCTT by hydrolysis), DraI (7053 plots; serves as an endonuclease for accelerated DNA (TTT^AAA) hydrolysis) (Erik K.R. Hanko, 2019).

The results are quite understandable. Bacteria have restriction sites containing large amounts of thymine (T) and adenine (A). In 2007, a study was conducted that showed that campylobacterophage DNA is practically not cleaved by enzymes whose recognition sites contain the bases cytosine (C) and guanine (G) due to an as yet unknown DNA modification (Hansen, V.M.; Rosenquist, H., 2007). In contrast, restriction endonucleases that recognize pure A/T sequences (eg, DraI) can be used to cut phage DNA and compare restriction patterns on standard agarose gels, allowing cost-effective and time-saving analysis. Open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. The resulting images clearly illustrate the data of the tables with the features of the genome of the bacterium observed. All the same was done for the plasmid, using the same algorithms:

**Table 3.2.** Restriction site map (plasmid). Made with Unipro UGENE (Supplementary material 6).

**Fig. 5.** Approximate location of restriction sites (plasmid). Made with Unipro UGENE (Supplementary material 6).

**Fig. 6.** Approximate location of open reading frames (plasmid). Made with Unipro UGENE (Supplementary material 6).

3.1.3 Codon and nucleotide usage analysis

The complete genome of the bacterium that is the object of study consists of a chromosome (1.678.432 bp) and a single plasmid (38.104 bp), as mentioned above. According to the table of features of the bacterium, we found that the total number of genes is 1778 (all kinds of genes are included).
The results of counting the frequency of occurrence of codons for chromosome and plasmid are shown in Table 5.1 and Table 5.2.
On right are tables showing the nucleotide composition of the chromosome and plasmid (Table 4.1 and Table 4.2).
The result obtained fully confirms the second Chargaff rule, which establishes the equality of the oligonucleotides that read the same in opposite directions, taking into account the replacement of nucleotides according to the complementarity rule: the amount of adenine (in our case, 570.530 nt) is approximately equal to the amount of thymine (580.654 nt), and guanine (262.980 nt) - to cytosine (264.268 nt); A=T, G=C. This can be confirmed by calculations: the percentage of A from the sum of A and T is 49,5; the percentage of G from the sum of C and G is 49,88. As we can see, the values (talking about the chromosome) are really very close to the “50/50” ratio.

**Table 4.1.** The frequency of occurrence of nucleotides in the chromosome. From left to right: nucleotide type, quantity, frequency of occurrence.

**Table 4.2.** The frequency of occurrence of nucleotides in the plasmid. From left to right: nucleotide type, quantity, frequency of occurrence.

**Table 5.1.** The number of codons (frequency of their occurrence) of a certain type in the chromosome (coding sequence) . The start codon is highlighted in red, and the stop codon is highlighted in orange (chromosome).

**Table 5.2.** The number of codons (frequency of their occurrence) of a certain type in the plasmid (coding sequence) . The start codon is highlighted in red, and the stop codon is highlighted in orange (plasmid).

3.2 Proteome analysis

3.2.1 Protein length distribution estimation

Below is a graph showing the distribution of proteins by their lengths, generated using spreadsheet methods.

It’s worth noting that the histogram has a positive asymmetry (there is a long right “tail”). The mean value for a distribution with positive asymmetry is known to be larger than the median, which is confirmed by another table indicating the main aspects of the protein composition of the molecule (Table 7). Thus, such a two-humped distribution of Campylobacter coli is characteristic of many archaea.
Interestingly, the minimum length of a protein is 30 and the maximum is 1750 amino acids.

**Table 7.** Descriptive statistics of proteins.

3.2.2 Protein distribution in + and - DNA strands and in leading

We found that there are 855 proteins on the forward strand and 785 - on the reverse. The result is presented in Fig. 8.
A different distribution is found in the plasmid: 11 proteins are located on the forward strand, 35 - on the reverse. The result is presented in Fig. 9.

**Fig. 8.** Number of chromosome proteins on the forward and reverse strain.

**Fig. 9.** Number of plasmid proteins on the forward and reverse strain.

3.2.3 Determination of the number of ribosomal proteins

Using the above spreadsheet methods it was found that there are 56 ribosomal proteins in the proteome.

3.2.4 Hypothetical proteins. Their number and contribution to the proteome

The result of the search for hypothetical proteins is shown in the diagram (Fig. 10).
As can be seen, the proteins whose existence was predicted, but for which there is no experimental evidence that it is expressed in vivo, are 13,9% of the proteome. It may be possible to predict the function of these proteins by searching for domain homology with different levels of confidence, or by using homology modelling, in which a hypothetical protein must correspond to a known protein sequence whose three-dimensional structure is known.

**Fig. 10.** Ratio of hypothetical proteins to non-hypothetical.

3.2.5 Number of transport proteins and their contribution to the proteome

138 transport proteins of various functions were found to be presented in the proteome. Their contribution to the proteome is shown in the diagram (Fig. 11).
As we can see, transport proteins make up a relatively small part of the proteome: their number correlates with the amount of all other proteins with a coefficient of 0,087. In terms of percentage, transport proteins occupy 4/25 of the entire proteome. The obtained data are quite consistent with the statistical distribution.

**Fig. 11.** The percentage of transport proteins in the total number of proteins.

3.3 RNA gene statistics. Distribution of RNA by types

We found that the genome contains 56 RNA genes: 44 tRNAs, 9 rRNAs, 2 ncRNAs and 1 tmRNA. The pie chart (Fig. 12) clearly shows the percentage of RNA genes in the total number of genes. The second pie chart (Fig. 13) reflects the percentage of RNA genes relative to each other.

**Fig. 12.** The percentage of RNA genes in the total number of genes.

**Fig. 13.** The percentage of RNA genes relative to each other.

SUPPLEMENTARY INFORMATION

REFERENCES

1. Allos, B. M. (2001). Campylobacter jejuni infections: Update on emerging issues and trends. Clinical Infectious Diseases, 32(8), 1201-1206.
2. Erik K.R. Hanko, Nigel P. Minton, Naglis Malys, Chapter Nine - Design, cloning and characterization of transcription factor-based inducible gene expression systems. Methods in Enzymology, Academic Press, Volume 621, 2019, pages 153-169.
3. Hansen, V.M.; Rosenquist, H.,; Baggesen, D.L.; Brown, S.; Christensen, B.B. Characterization of Campylobacter Phages Including Analysis of Host Range by Selected Campylobacter Penner Serotyopes. BMC Microbiol. 2007, 7, 90. [Google Scholar] [CrossRef] [PubMed]
4. Karki A.B., Wells H., Fakhr M.K. Retail liver juices enhance the survivability of Campylobacter jejuni and Campylobacter coli at law temperatures. Sci. Rep. 2019;9:2733. [PMC free article] [PubMed] [Google Scholar].
5. Prescott LM, Harley JP, Klein DA (2005). “Campylobacter”. Microbiology (6th ed.). pp. 430-433, 500.
6. SkewIT, https://journals.plos.org/ploscompbiol/article?id=10.1371/jou rnal.pcbi.1008439 (06.04.2022), SkewIT: The Skew Index Test for large scale GC Skew analysis of bacterial genomes, Steven L. Salzburg. Specific link: https://genskew.csb.univie.ac.at/webskew
7. The European Union One Health 2018 Zoonoses Report, EFSA Journal.