Share this post on:

In the raw reads of each genome. However, with sufficient coverage, genome assembly can increase the quality of the k-mer representation by eliminating sequencing errors. This reduces the number of unique k-mers and thus, the size of the feature space.Choosing the k-mer lengthConclusionsThe identification of PD98059 web genomic biomarkers is a key step towards improving diagnostic tests and therapies. In this study, we have demonstrated how machine learning canThe k-mer length is an important parameter of the proposed method. Exceedingly small values of k will yield k-mers that ambiguously map to multiple genomic loci. Yet, an exceedingly large k will yield very specific k-mers that only occur in few genomes. To our knowledge, a general protocol for selecting the k-mer length does not exist. We therefore propose two approaches to selecting an appropriate length. The first consists of using prior biological knowledge about the organism under study. For instance, the mutation rate is an important factor to consider. If it is expected to be high (e.g., viruses), small k-mers are preferable. Conversely, if the mutation rate is low, longer k-mers can be used, allowing the identification of additional genomic variations, such as DNA tandem repeats, which can be relevant for predicting the phenotype [58]. Extensive testingDrouin et al. BMC Genomics (2016) 17:Page 11 ofhas shown that k = 31 appears to be optimal for bacterial genome assembly [57] and recent studies have employed it for reference-free bacterial genome comparisons [19, 20]. Hence, this value was used in the current study. The second method is better suited for contexts where no prior knowledge is available. It consists of considering k as a hyperparameter of the learning algorithm and setting its value by cross-validation. In this case, the algorithm is trained using various values of k and the best value is selected based on the cross-validation score. This process is more computationally intensive, since the algorithm needs to be trained multiple times. However, it ensures that the k-mer length is selected based on the evidence of a good generalization performance. In this study, both approaches were PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/25768400 compared and shown to yield similar results. Indeed, no significant variation in accuracy was observed for the models obtained with k = 31 and with k selected from 15, 21, 31, 51, 71, 91 by cross-validation (Additional file 1: Appendix 3). This corroborates that k-mers of length 31 are well-suited for bacterial genome comparisons. Moreover, it indicates that cross-validation can recover a good k-mer length in the absence of prior knowledge.Applying the set covering machine to genomestypes of boolean-valued rules: presence rules and absence rules, which rely on the vectors (x) to determine their outcome. For each k-mer ki K, we define a presence def rule as pki ((x)) = I[ i (x) = 1] and an absence rule def as aki ((x)) = I[ i (x) = 0], where I[ a] = 1 if a is true and I[ a] = 0 otherwise. The SCM, which is detailed in Additional file 1: Appendix 1, can then be applied by using ((x1 ), y1 ), . . . , (xm ), ym ) as the set S of learning examples and by using the set of presence/absence rules defined above as the set R of boolean-valued rules. This yields a phenotypic model which explicitly highlights the importance of a small set of k-mers. In addition, this model has a form which is simple to interpret, since its predictions are the result of a simple logical operation.Tiebreaker functionWe r.

Share this post on:

Author: heme -oxygenase