“
“Reverse complementary DNA sequences – sequences that are inadvertently given backwards with all purines and pyrimidines transposed – can affect sequence analysis detrimentally unless taken into account. We present an open-source, high-throughput software tool –v-revcomp (http://www.cmde.science.ubc.ca/mohn/software.html) – to detect and reorient reverse complementary entries of the small-subunit rRNA (16S) gene from sequencing datasets, particularly from environmental sources. The software supports sequence lengths ranging from full length down to the short reads that are characteristic of next-generation sequencing technologies. We evaluated the reliability of buy Dactolisib v-revcomp by screening all 406 781 16S sequences deposited
in release 102 of the curated SILVA database and demonstrated that the tool has a detection accuracy of virtually 100%. We subsequently used v-revcomp to analyse 1 171 646 16S sequences deposited in the International selleck inhibitor Nucleotide Sequence Databases and found that about 1% of these user-submitted sequences were reverse complementary. In addition, a nontrivial proportion of the entries were otherwise anomalous, including reverse complementary chimeras, sequences associated with wrong taxa, nonribosomal genes, sequences of poor quality or otherwise erroneous sequences without a reasonable match to any other entry in the database. Thus, v-revcomp is highly efficient in detecting and reorienting reverse
complementary 16S sequences of almost any length and can be used to detect various sequence anomalies. The bacterial and archaeal small-subunit rRNA (SSU rRNA, 16S) gene has emerged as the gold standard genetic marker for determining
the diversity and structure of prokaryotic communities in the environment and for the assessment of phylogenetic relationships within the microbial tree of life (reviewed in Tringe & Hugenholtz, 2008; Pace, 2009). Numerous international efforts to characterize microbial communities have led to an unparalleled accumulation of 16S sequences in the International Nucleotide Sequence Databases (INSDs, Sayers et al., 2010) and warranted the establishment of curated 16S reference databases such as SILVA from (Pruesse et al., 2007), RDP (Cole et al., 2007) and Greengenes (DeSantis et al., 2006). As per October 2010 release of SILVA version 104, close to 3 million 16S sequences are currently deposited in the INSDs, not counting the enormous number of short reads currently generated by massively parallel sequencing technologies (Margulies et al., 2005) and typically deposited as raw data in the Sequence Read Archive (Leinonen et al., 2011). The contribution of these data repositories to scientific progress is indisputable. However, as the number of public 16S sequences increases, so does the number of sequences exhibiting poor read quality, chimaerism and incomplete or incorrect taxonomic annotation (Bridge et al., 2003; Hugenholtz & Huber, 2003; Ashelford et al., 2005; Bidartondo et al.