Article Navigation
Article Contents
-
Abstract
-
1 Introduction
-
2 Materials and methods
-
3 Results
-
4 Discussion
-
Funding
-
References
- < Previous
- Next >
Journal Article
, Marek Kokot Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland Search for other works by this author on: Oxford Academic Maciej Długosz Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland Search for other works by this author on: Oxford Academic Sebastian Deorowicz Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland To whom correspondence should be addressed. Email: sebastian.deorowicz@polsl.pl Search for other works by this author on: Oxford Academic
Bioinformatics, Volume 33, Issue 17, September 2017, Pages 2759–2761, https://doi.org/10.1093/bioinformatics/btx304
Published:
04 May 2017
Article history
Received:
30 January 2017
Revision received:
19 April 2017
Accepted:
03 May 2017
Published:
04 May 2017
- Split View
- Views
- Article contents
- Figures & tables
- Video
- Audio
- Supplementary Data
-
Cite
Cite
Marek Kokot, Maciej Długosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, September 2017, Pages 2759–2761, https://doi.org/10.1093/bioinformatics/btx304
Close
Search
Close
Search
Advanced Search
Search Menu
Abstract
Summary
Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems.
Availability and implementation
Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
In many applications related to genome sequencing, e.g. de novo assembly, read correction, repeat detection and comparison of genomes, the first step is k-mer counting. This procedure consists in determining all unique k-symbol long strings (usually with counters) in the read collection.
From the conceptual point of view k-mer counting is quite simple task. Nevertheless, the problems appear when we deal with real data, which could be huge. There are many articles published in the recent years that discuss this problem, which shows that providing an efficient tool solving this task is far from trivial.
The obtained k-mer statistics are of course only a point of departure for following analyzes. In some of them various operations on sets of k-mers are necessary. We introduce a new version of KMC2 (Deorowicz et al., 2015), which is more memory frugal and much faster. We also introduce KMC tools for easy manipulation of sets produced by KMC. The usefulness of KMC tools is shown on a few literature case studies in which various utilities were replaced by operations from our package with significant gains in processing time and memory usage.
2 Materials and methods
KMC3 follows the same two-stage processing scheme as KMC2. In the first stage the reads are split into several hundred bins (disk files) according to the signatures (short m-symbol long substrings) of k-mers. The bins are then sorted one by one to remove duplicates in the second stage. A similar processing was used in several recently published algorithms for k-mer counting. Their first stages are similar to KMC2, but then the bins are handled in different ways, e.g. Gerbil (Erbert et al., 2017) employs hash tables, DSK2 (Rizk et al., 2013) uses multistage processing in a case of small disk space. Recently published, KCMBT (Mamun et al., 2016) applies burst tries for k-mer storage. Jellyfish (Marcais and Kingsford, 2011) utilizes thread-safe hash tables.
There are several main novelties in KMC3. Concerning the first stage, input files, especially in gzipped FASTQ format, are loaded faster due to better input/output (I/O) subsystem. Signatures are assigned to bins in an improved way, i.e. larger default signature length (nine instead of seven) is picked and larger part of input data is taken in the pre-processing stage in which the signatures are assigned to bins. These lead to better balance of bin sizes (at the cost of slightly larger total size of bins), which results in smaller main memory requirements and faster processing in the second stage. The most significant improvements are in the second stage. Both KMC (Deorowicz et al., 2013) and KMC2 used the least-significant-digit (LSD) radix sort. When the keys to sort are long (which happens when k > 32) the LSD radix sort requires unnecessarily many iterations over the data. Therefore, in KMC3 we implemented the fastest existing most-significant-digit radix sort (Kokot et al., 2016). We also improved the parallelization scheme of other routines (not directly related to sorting) at this stage.
The second part of the package consists of various tools for manipulation of sets containing k-mers. Figure 1 shows a general scheme of the package (complete description is given in the Supplementary Material). The filtering allows to extract from FASTQ files the reads satisfying some criteria, e.g. with sufficiently large number of k-mers occurrences. The family of transform operations allows to modify the KMC database, e.g. remove too frequent/rare k-mers, remove counters, build histogram. The family of simple operations is composed of several set operations involving more than one KMC database. For example, the user is able to intersect, union the databases. Finally, the complex operations allow to construct compound expressions taking as parameters KMC databases.
Fig. 1
Scheme of KMC 3 package
Open in new tabDownload slide
3 Results
We used six datasets for KMC3 evaluation. Table 1 shows the results for two representatives. The description of the datasets, the platform used for tests, the remaining results are given in the Supplementary Material. FASTQ files are almost always stored in compressed form. Thus, we provide the running times for both: uncompressed and gzipped FASTQ.
Table 1.
Open in new tab
Comparison of k-mers counting algorithms
Algorithm | k = 28 | k = 55 | ||||
---|---|---|---|---|---|---|
RAM | Disk | Time/gz-Time | RAM | Disk | Time/gz-Time | |
G. gallus (35 Gbases in total) | ||||||
DSK 2 | 14 | 39 | 783/1180 | 15 | 30 | 719/1123 |
Gerbil | 2 | 24 | 607/631 | 4 | 17 | 615/551 |
GTester4 | 74 | 0 | 2494/— | unsupported k | ||
Jellyfish 2 | 33 | 0 | 909/946 | 77 | 0 | 1048/919 |
KCMBT | 107 | 0 | 1335/— | unsupported k | ||
KMC 2 | 12 | 25 | 489/321 | 12 | 18 | 861/672 |
KMC 3 | 12 | 28 | 492/292 | 12 | 18 | 455/245 |
H. sapiens 3 (729 Gbases in total) | ||||||
Gerbil | 28 | 523 | 11 994/12 730 | 62 | 364 | 11 968/12 469 |
Jellyfish 2 | 84 | 251 | 38 338/20 284 | 104 | 636 | 31 783/31 345 |
KMC 2 | 64 | 551 | 10 777/9036 | 72 | 381 | 13 774/11 804 |
KMC 3 | 33 | 596 | 9631/5985 | 34 | 389 | 8750/5331 |
Algorithm | k = 28 | k = 55 | ||||
---|---|---|---|---|---|---|
RAM | Disk | Time/gz-Time | RAM | Disk | Time/gz-Time | |
G. gallus (35 Gbases in total) | ||||||
DSK 2 | 14 | 39 | 783/1180 | 15 | 30 | 719/1123 |
Gerbil | 2 | 24 | 607/631 | 4 | 17 | 615/551 |
GTester4 | 74 | 0 | 2494/— | unsupported k | ||
Jellyfish 2 | 33 | 0 | 909/946 | 77 | 0 | 1048/919 |
KCMBT | 107 | 0 | 1335/— | unsupported k | ||
KMC 2 | 12 | 25 | 489/321 | 12 | 18 | 861/672 |
KMC 3 | 12 | 28 | 492/292 | 12 | 18 | 455/245 |
H. sapiens 3 (729 Gbases in total) | ||||||
Gerbil | 28 | 523 | 11 994/12 730 | 62 | 364 | 11 968/12 469 |
Jellyfish 2 | 84 | 251 | 38 338/20 284 | 104 | 636 | 31 783/31 345 |
KMC 2 | 64 | 551 | 10 777/9036 | 72 | 381 | 13 774/11 804 |
KMC 3 | 33 | 596 | 9631/5985 | 34 | 389 | 8750/5331 |
The times are given for: uncompressed input FASTQ file (‘Time’) and gzipped input FASTQ files (‘gz-Time’) The units are: seconds (time), GB (Disk and RAM). All programs were executed in 12-threads mode (at the Xeon-based workstation using HDD, see Supplementary Material, Section 2 for details), except for KCMBT which was executed in 8-threads mode due to large main memory consumption (exceeding 128 GB RAM for 12 threads). ‘—’ means that the mode is not supported. There are no results for some programs for H. sapiens 3 due to: processing longer than 12 h (DSK 2, GTester4) or memory requirements larger than 128 GB (KCMBT).
Table 1.
Open in new tab
Comparison of k-mers counting algorithms
Algorithm | k = 28 | k = 55 | ||||
---|---|---|---|---|---|---|
RAM | Disk | Time/gz-Time | RAM | Disk | Time/gz-Time | |
G. gallus (35 Gbases in total) | ||||||
DSK 2 | 14 | 39 | 783/1180 | 15 | 30 | 719/1123 |
Gerbil | 2 | 24 | 607/631 | 4 | 17 | 615/551 |
GTester4 | 74 | 0 | 2494/— | unsupported k | ||
Jellyfish 2 | 33 | 0 | 909/946 | 77 | 0 | 1048/919 |
KCMBT | 107 | 0 | 1335/— | unsupported k | ||
KMC 2 | 12 | 25 | 489/321 | 12 | 18 | 861/672 |
KMC 3 | 12 | 28 | 492/292 | 12 | 18 | 455/245 |
H. sapiens 3 (729 Gbases in total) | ||||||
Gerbil | 28 | 523 | 11 994/12 730 | 62 | 364 | 11 968/12 469 |
Jellyfish 2 | 84 | 251 | 38 338/20 284 | 104 | 636 | 31 783/31 345 |
KMC 2 | 64 | 551 | 10 777/9036 | 72 | 381 | 13 774/11 804 |
KMC 3 | 33 | 596 | 9631/5985 | 34 | 389 | 8750/5331 |
Algorithm | k = 28 | k = 55 | ||||
---|---|---|---|---|---|---|
RAM | Disk | Time/gz-Time | RAM | Disk | Time/gz-Time | |
G. gallus (35 Gbases in total) | ||||||
DSK 2 | 14 | 39 | 783/1180 | 15 | 30 | 719/1123 |
Gerbil | 2 | 24 | 607/631 | 4 | 17 | 615/551 |
GTester4 | 74 | 0 | 2494/— | unsupported k | ||
Jellyfish 2 | 33 | 0 | 909/946 | 77 | 0 | 1048/919 |
KCMBT | 107 | 0 | 1335/— | unsupported k | ||
KMC 2 | 12 | 25 | 489/321 | 12 | 18 | 861/672 |
KMC 3 | 12 | 28 | 492/292 | 12 | 18 | 455/245 |
H. sapiens 3 (729 Gbases in total) | ||||||
Gerbil | 28 | 523 | 11 994/12 730 | 62 | 364 | 11 968/12 469 |
Jellyfish 2 | 84 | 251 | 38 338/20 284 | 104 | 636 | 31 783/31 345 |
KMC 2 | 64 | 551 | 10 777/9036 | 72 | 381 | 13 774/11 804 |
KMC 3 | 33 | 596 | 9631/5985 | 34 | 389 | 8750/5331 |
The times are given for: uncompressed input FASTQ file (‘Time’) and gzipped input FASTQ files (‘gz-Time’) The units are: seconds (time), GB (Disk and RAM). All programs were executed in 12-threads mode (at the Xeon-based workstation using HDD, see Supplementary Material, Section 2 for details), except for KCMBT which was executed in 8-threads mode due to large main memory consumption (exceeding 128 GB RAM for 12 threads). ‘—’ means that the mode is not supported. There are no results for some programs for H. sapiens 3 due to: processing longer than 12 h (DSK 2, GTester4) or memory requirements larger than 128 GB (KCMBT).
As we can observe for the Gallus gallus data, KMC3 is usually the fastest, especially for larger k. The times for hom*o sapiens 3, the largest of our dataset, are much longer. Nevertheless, KMC3 was able to complete in <100 min for typical input format for both examined k. It can be observed, that the improved I/O subsystem as well as the new sorting routine and better parallelization of the second stage gave substantial benefits comparing to KMC2. It is also worth noting that KMC3 even for the largest dataset used a reasonable amount of memory.
To evaluate KMC tools we picked three studies described recently in the literature. Below we briefly describe the goals of the studies and total gains in time and memory. The detailed results together with the scripts showing KMC tools usage are given in the Supplementary Material.
DIAMUND (Salzberg et al., 2014) is a novel approach for variant detection. It is dedicated to comparison of family trios or normal and diseased samples of the same individual. Instead of mapping the reads onto a reference genome, DIAMUND compares directly the raw reads. We followed this protocol for the Ashkenazim Jewish ancestry trio (Zook et al., 2016). DIAMUND needed 13 h to complete its work and used 107 GB RAM. As most of the stages can be made using KMC tools we replaced the original tools used in DIAMUND and reduced the processing time to 4 h (reducing memory usage to 12 GB RAM).
NIKS (Nordström et al., 2013) is another tool that uses raw sequencing reads for mutation identification taking into account mainly k-mer statistics. We replaced its stages related to k-mer counting and manipulation of k-mer databases by KMC tools operations. In a single stage it was necessary to use KMC API to prepare a short C ++ program to reproduce the format of intermediary data used in NIKS. The processing time for Oryza sativa dataset (83.3 GB input FASTQ) was reduced from ∼40 to 5 min in this case (RAM usage was reduced from 92 to 12 GB).
Finally, we made the same experiment as used in the GenomeTester4 article (Kaplinski et al., 2015), introducing a tool that is able to perform similar operations as KMC tools. The investigated problem was to find group-specific k-mers to identify bacteria. The processing time was reduced from 5 h to 15 min (RAM usage reduction from 75 to 12 GB).
4 Discussion
We proposed a new version of our k-mer counter. It is much faster (even a few times for large k) than its predecessor and faster than the existing competitors. The KMC tools offer a number of operations on k-mer databases that can be used in projects making use of k-mer statistics.
Funding
This work was supported by the Polish National Science Centre under the project DEC-2015/17/B/ST6/01890 and by Silesian University of Technology grant no. (BKM507/RAU2/2016). We used the infrastructure supported by POIG.02.03.01-24-099/13 grant: ‘GeCONiI–Upper Silesian Center for Computational Science and Engineering’.
Conflict of Interest: none declared.
References
Deorowicz S.
2013
) Disk-based k-mer counting on a PC, BMC Bioinformatics,
14
, 160.
Deorowicz S.
2015
)
KMC 2: Fast and resource-frugal k-mer counting
.
Bioinformatics
,
31
,
1569
–
1576
.
Erbert M.
2017
)
Gerbil: a fast and memory-efficient k-mer counter with GPU-support
.
Algorithms Mol. Biol
.,
12
, 9.
OpenURL Placeholder Text
Kaplinski L.
2015
)
GenomeTester4: a toolkit for performing basic set operations—union, intersection and complement on k-mer lists
.
GigaScience
,
4
,
58
.
Kokot M.
2016
) Sorting data on ultra-large scale with RADULS. In: Kozielski S.
Springer
,
pp. 235–245
.
Mamun A.A.
2016
)
KCMBT: a k-mer counter based on multiple burst trees
.
Bioinformatics
,
32
,
2783
–
2790
.
Marçais G. Kingsford C.
2011
)
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
.
Bioinformatics
,
27
,
764
–
770
.
Nordström K.J.V.
2013
)
Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers
.
Nat. Biotechnol
.,
31
,
325
–
330
.
Rizk G.
2013
)
DSK: k-mer counting with very low memory usage
.
Bioinformatics
,
29
,
652
–
653
.
Salzberg S.L.
2014
)
DIAMUND: Direct comparison of genomes to detect mutations
.
Hum. Mutat
.,
35
,
283
–
288
.
Zook J.M.
2016
)
Extensive sequencing of seven human genomes to characterize benchmark reference materials
.
Scientific Data
,
3
, Article no.
160025.
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Associate Editor: Bonnie Berger Bonnie Berger Associate Editor Search for other works by this author on: Oxford Academic
Download all slides
Advertisem*nt
Citations
Views
17,029
Altmetric
More metrics information
Metrics
Total Views 17,029
13,779 Pageviews
3,250 PDF Downloads
Since 5/1/2017
Month: | Total Views: |
---|---|
May 2017 | 93 |
June 2017 | 51 |
July 2017 | 36 |
August 2017 | 56 |
September 2017 | 130 |
October 2017 | 65 |
November 2017 | 72 |
December 2017 | 44 |
January 2018 | 36 |
February 2018 | 56 |
March 2018 | 28 |
April 2018 | 58 |
May 2018 | 53 |
June 2018 | 33 |
July 2018 | 39 |
August 2018 | 44 |
September 2018 | 82 |
October 2018 | 64 |
November 2018 | 90 |
December 2018 | 100 |
January 2019 | 61 |
February 2019 | 96 |
March 2019 | 93 |
April 2019 | 117 |
May 2019 | 150 |
June 2019 | 108 |
July 2019 | 133 |
August 2019 | 142 |
September 2019 | 159 |
October 2019 | 183 |
November 2019 | 187 |
December 2019 | 109 |
January 2020 | 124 |
February 2020 | 192 |
March 2020 | 165 |
April 2020 | 200 |
May 2020 | 121 |
June 2020 | 191 |
July 2020 | 225 |
August 2020 | 120 |
September 2020 | 185 |
October 2020 | 256 |
November 2020 | 293 |
December 2020 | 266 |
January 2021 | 239 |
February 2021 | 219 |
March 2021 | 283 |
April 2021 | 266 |
May 2021 | 240 |
June 2021 | 246 |
July 2021 | 225 |
August 2021 | 224 |
September 2021 | 288 |
October 2021 | 229 |
November 2021 | 271 |
December 2021 | 224 |
January 2022 | 305 |
February 2022 | 234 |
March 2022 | 294 |
April 2022 | 308 |
May 2022 | 347 |
June 2022 | 333 |
July 2022 | 297 |
August 2022 | 294 |
September 2022 | 304 |
October 2022 | 289 |
November 2022 | 310 |
December 2022 | 300 |
January 2023 | 280 |
February 2023 | 322 |
March 2023 | 318 |
April 2023 | 345 |
May 2023 | 289 |
June 2023 | 281 |
July 2023 | 215 |
August 2023 | 236 |
September 2023 | 297 |
October 2023 | 283 |
November 2023 | 289 |
December 2023 | 308 |
January 2024 | 319 |
February 2024 | 349 |
March 2024 | 368 |
April 2024 | 346 |
May 2024 | 428 |
June 2024 | 81 |
Altmetrics
Email alerts
Article activity alert
Advance article alerts
New issue alert
In progress issue alert
Receive exclusive offers and updates from Oxford Academic
Citing articles via
Google Scholar
-
Latest
-
Most Read
-
Most Cited
More from Oxford Academic
Bioinformatics and Computational Biology
Biological Sciences
Science and Mathematics
Books
Journals
Pittsburg, Pennsylvania
Burlington, Vermont
Long Island, New York
Long Island, New York
Advertisem*nt