Yet another approach to whole-genome phylogenetics is the comparison of gene content. This technique works by predicting orthologues in pairs of organisms and then assigning a “”distance”" between each
pair based on the putative number of shared genes. This technique was originally proposed by Snel et al. [13] and was subsequently revisited with larger groups of organisms [14, 15]. However, horizontal gene transfer is a major complicating factor in using these methods to infer evolutionary relationships in prokaryotes [16]. Recently, a new subfield called pan-genomics Selleck PFT�� has become established as a framework for exploring the genomic relatedness of bacterial groups. Unlike the studies cited in the previous paragraph, pan-genomics does not involve inferring phylogeny from genome content; rather, it encompasses broad-based characterizations of gene- or protein-content relationships in a given group of organisms. Pan-genomics was introduced by Tettelin et al. [17], who sequenced several strains of the bacterium Streptococcus agalactiae and then analyzed Talazoparib order the genomic diversity of those isolates in terms of a “”core genome”" (genes present in all isolates) and a “”dispensable genome”" (genes not present in all isolates). Two more examples of pan-genomic analyses
are those done for Vibrio [18] and for Escherichia coli [19]. Review articles summarizing concepts and developments in microbial pan-genomics are also available [20, 21]. Despite the increasing interest in pan-genomics, we do not know of a study providing a general characterization and comparison of gene/protein content relationships in many different bacterial groups. To fill this gap, this study reports the results of several different analyses that compare the protein content of different bacteria. When beginning this study, we were faced with the choice of comparing either gene content or protein content. Both have been examined in previous work; for example, Tettelin et al. [17] studied both gene sets and predicted protein sets, whereas Rasko et al. [19] used
predicted proteins exclusively. For two reasons, we chose to explore protein content rather than gene content. First, since protein content is more directly related to function many and physiology than gene content, the use of protein content was more appropriate for relating pan-genomic properties to factors like habitats, environmental niches, and selective pressures. Second, since we perform comparisons across diverse genera, the lower level of variability in protein sequences compared to gene sequences (due to the degeneracy of the genetic code) may provide an advantage when using BLAST to compare the more divergent organisms. The popularity of tools such as tblastx [22, 23] also speaks to the desirability of comparing gene sequences via the corresponding proteins.