Frontpage Data visualisation Parametizing data Directory structure R-package SQL Zotero Reproductibility Future endeavours Free research (Machine learning) CV Bibliography
Now, for the grand finale: we’ll use all previous knowledge we’ve gathered, and study an fully developed microbiology machine learning package in R: IDTAXA.
IDTAXA is part of the DECIPHER package, a package hosted on the bioconductor website. DECIPHER is a “software toolset that can be used for deciphering and managing biological sequences efficiently using R”. It contains functions for many different uses, including maintaining databases, aligning sequences, finding genes, and the main use we’ll be studying, analysing sequences for classification. (Murali, Bhargava, and Wright 2018)
IDTAXA contains 2 forms of identification: Taxonomy by organism, and taxonomy by functions. ITDAXA: classify organisms takes rRNA or an ITS sequence, and classifies it as a taxonomy of organisms. IDTAXA: classify functions takes a protein or coding sequence, and classifies it as a taxonomy of functions.
The site for IDTAXA, specifically this link here, contains a guide on how to perform an IDTAXA identification yourself in R. This guide will be performed on an A: pre-assembled mock data set, B: 3 16sRNA sequences from taken from an online database, and C: 16sRNA sequencces from a random published study.
The process goes as follows: First, store the location of the fasta data, and read it using readDNAStringSet(or RNAStringSet if it’s RNA data). Use “remove gaps” to remove any possible gaps found in the data
<-"./data.raw/HMP_16S.fas.txt"
fas
<-readDNAStringSet(fas)
seqs
<-RemoveGaps(seqs) seqs
Then, load in the training data set downloaded from the DECIPHER website.
(Please keep in mind, loading this data and performing the machine learning is a very CPU-heavy process. In order to make this .Rmd accessible to those without a strong computer, I’ve instead ran the code myself, and will only be showing the plots as output)
load("./data.raw/Contax_v1_March2018.RData")
Now, let the IdTaxa algorythm run, store the output, and use plot() to immdiately plot the output
<-IdTaxa(seqs,
ids
trainingSet,strand = "both",
threshold = 60,
processors = NULL)
##
|
| | 0%
|
|= | 1%
|
|== | 1%
|
|=== | 2%
|
|==== | 3%
|
|===== | 4%
|
|====== | 4%
|
|======= | 5%
|
|========= | 6%
|
|========== | 6%
|
|=========== | 7%
|
|============ | 8%
|
|============= | 8%
|
|============== | 9%
|
|=============== | 10%
|
|================ | 11%
|
|================= | 11%
|
|================== | 12%
|
|=================== | 13%
|
|==================== | 13%
|
|===================== | 14%
|
|====================== | 15%
|
|======================= | 15%
|
|======================== | 16%
|
|========================== | 17%
|
|=========================== | 18%
|
|============================ | 18%
|
|============================= | 19%
|
|============================== | 20%
|
|=============================== | 20%
|
|================================ | 21%
|
|================================= | 22%
|
|================================== | 23%
|
|=================================== | 23%
|
|==================================== | 24%
|
|===================================== | 25%
|
|====================================== | 25%
|
|======================================= | 26%
|
|======================================== | 27%
|
|========================================= | 27%
|
|=========================================== | 28%
|
|============================================ | 29%
|
|============================================= | 30%
|
|============================================== | 30%
|
|=============================================== | 31%
|
|================================================ | 32%
|
|================================================= | 32%
|
|================================================== | 33%
|
|=================================================== | 34%
|
|==================================================== | 35%
|
|===================================================== | 35%
|
|====================================================== | 36%
|
|======================================================= | 37%
|
|======================================================== | 37%
|
|========================================================= | 38%
|
|========================================================== | 39%
|
|============================================================ | 39%
|
|============================================================= | 40%
|
|============================================================== | 41%
|
|=============================================================== | 42%
|
|================================================================ | 42%
|
|================================================================= | 43%
|
|================================================================== | 44%
|
|=================================================================== | 44%
|
|==================================================================== | 45%
|
|===================================================================== | 46%
|
|====================================================================== | 46%
|
|======================================================================= | 47%
|
|======================================================================== | 48%
|
|========================================================================= | 49%
|
|========================================================================== | 49%
|
|============================================================================ | 50%
|
|============================================================================= | 51%
|
|============================================================================== | 51%
|
|=============================================================================== | 52%
|
|================================================================================ | 53%
|
|================================================================================= | 54%
|
|================================================================================== | 54%
|
|=================================================================================== | 55%
|
|==================================================================================== | 56%
|
|===================================================================================== | 56%
|
|====================================================================================== | 57%
|
|======================================================================================= | 58%
|
|======================================================================================== | 58%
|
|========================================================================================= | 59%
|
|========================================================================================== | 60%
|
|=========================================================================================== | 61%
|
|============================================================================================= | 61%
|
|============================================================================================== | 62%
|
|=============================================================================================== | 63%
|
|================================================================================================ | 63%
|
|================================================================================================= | 64%
|
|================================================================================================== | 65%
|
|=================================================================================================== | 65%
|
|==================================================================================================== | 66%
|
|===================================================================================================== | 67%
|
|====================================================================================================== | 68%
|
|======================================================================================================= | 68%
|
|======================================================================================================== | 69%
|
|========================================================================================================= | 70%
|
|========================================================================================================== | 70%
|
|=========================================================================================================== | 71%
|
|============================================================================================================ | 72%
|
|============================================================================================================== | 73%
|
|=============================================================================================================== | 73%
|
|================================================================================================================ | 74%
|
|================================================================================================================= | 75%
|
|================================================================================================================== | 75%
|
|=================================================================================================================== | 76%
|
|==================================================================================================================== | 77%
|
|===================================================================================================================== | 77%
|
|====================================================================================================================== | 78%
|
|======================================================================================================================= | 79%
|
|======================================================================================================================== | 80%
|
|========================================================================================================================= | 80%
|
|========================================================================================================================== | 81%
|
|=========================================================================================================================== | 82%
|
|============================================================================================================================ | 82%
|
|============================================================================================================================= | 83%
|
|=============================================================================================================================== | 84%
|
|================================================================================================================================ | 85%
|
|================================================================================================================================= | 85%
|
|================================================================================================================================== | 86%
|
|=================================================================================================================================== | 87%
|
|==================================================================================================================================== | 87%
|
|===================================================================================================================================== | 88%
|
|====================================================================================================================================== | 89%
|
|======================================================================================================================================= | 89%
|
|======================================================================================================================================== | 90%
|
|========================================================================================================================================= | 91%
|
|========================================================================================================================================== | 92%
|
|=========================================================================================================================================== | 92%
|
|============================================================================================================================================ | 93%
|
|============================================================================================================================================= | 94%
|
|============================================================================================================================================== | 94%
|
|================================================================================================================================================ | 95%
|
|================================================================================================================================================= | 96%
|
|================================================================================================================================================== | 96%
|
|=================================================================================================================================================== | 97%
|
|==================================================================================================================================================== | 98%
|
|===================================================================================================================================================== | 99%
|
|====================================================================================================================================================== | 99%
|
|=======================================================================================================================================================| 100%
##
## Time difference of 8.44 secs
print(ids)
## A test set of class 'Taxa' with length 118
## confidence name taxon
## [1] 77% Acinetobacter bau... Root; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Moraxellaceae; Acinetobacter
## [2] 77% Acinetobacter bau... Root; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Moraxellaceae; Acinetobacter
## [3] 77% Acinetobacter bau... Root; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Moraxellaceae; Acinetobacter
## [4] 77% Acinetobacter bau... Root; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Moraxellaceae; Acinetobacter
## [5] 77% Acinetobacter bau... Root; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Moraxellaceae; Acinetobacter
## ... ... ... ...
## [114] 58% Streptococcus mut... Root; unclassified_Root
## [115] 81% Streptococcus pne... Root; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
## [116] 81% Streptococcus pne... Root; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
## [117] 81% Streptococcus pne... Root; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
## [118] 81% Streptococcus pne... Root; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
plot(ids)
And easy as that, you’ve used IdTaxa machine learning to identify the given 16s RNA sequences.
That was for a training dataset that was already pre-assembled. But what if you’ve got a bunch of individual 16s RNA strands, of which you want to identify all of them? For that, we’ll need to manually take some 16s RNA sequences, and put them together into a single file via R.
I’ve taken 3 random bacterial DNA sequences:
<-"./data.raw/AB002521.1.fasta"
fas1<-"./data.raw/AB243005.1.fasta"
fas2<-"./data.raw/AB594754.1.fasta"
fas3
<-readDNAStringSet(c(fas1, fas2, fas3))
seqs
<-RemoveGaps(seqs)
seqs
<-IdTaxa(seqs,
ids
trainingSet,strand = "both",
threshold = 60,
processors = NULL)
##
|
| | 0%
|
|========================= | 17%
|
|================================================== | 33%
|
|============================================================================ | 50%
|
|===================================================================================================== | 67%
|
|============================================================================================================================== | 83%
|
|=======================================================================================================================================================| 100%
##
## Time difference of 0.84 secs
print(ids)
## A test set of class 'Taxa' with length 3
## confidence name taxon
## [1] 80% ENA|AB002521|AB00... Root; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
## [2] 81% ENA|AB243005|AB24... Root; Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus
## [3] 85% ENA|AB594754|AB59... Root; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Salmonella
plot(ids)