KMC 3: counting and manipulating k-mer statistics (2024)

Article Navigation

Volume 33 Issue 17 September 2017

Article Contents

  • Abstract

  • 1 Introduction

  • 2 Materials and methods

  • 3 Results

  • 4 Discussion

  • Funding

  • References

  • < Previous
  • Next >

Journal Article

,

Marek Kokot

Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland

Search for other works by this author on:

Oxford Academic

,

Maciej Długosz

Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland

Search for other works by this author on:

Oxford Academic

Sebastian Deorowicz

Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland

To whom correspondence should be addressed. Email: sebastian.deorowicz@polsl.pl

Search for other works by this author on:

Oxford Academic

Bioinformatics, Volume 33, Issue 17, September 2017, Pages 2759–2761, https://doi.org/10.1093/bioinformatics/btx304

Published:

04 May 2017

Article history

Received:

30 January 2017

Revision received:

19 April 2017

Accepted:

03 May 2017

Published:

04 May 2017

Search

Close

Search

Advanced Search

Search Menu

Abstract

Summary

Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems.

Availability and implementation

Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

In many applications related to genome sequencing, e.g. de novo assembly, read correction, repeat detection and comparison of genomes, the first step is k-mer counting. This procedure consists in determining all unique k-symbol long strings (usually with counters) in the read collection.

From the conceptual point of view k-mer counting is quite simple task. Nevertheless, the problems appear when we deal with real data, which could be huge. There are many articles published in the recent years that discuss this problem, which shows that providing an efficient tool solving this task is far from trivial.

The obtained k-mer statistics are of course only a point of departure for following analyzes. In some of them various operations on sets of k-mers are necessary. We introduce a new version of KMC2 (Deorowicz et al., 2015), which is more memory frugal and much faster. We also introduce KMC tools for easy manipulation of sets produced by KMC. The usefulness of KMC tools is shown on a few literature case studies in which various utilities were replaced by operations from our package with significant gains in processing time and memory usage.

2 Materials and methods

KMC3 follows the same two-stage processing scheme as KMC2. In the first stage the reads are split into several hundred bins (disk files) according to the signatures (short m-symbol long substrings) of k-mers. The bins are then sorted one by one to remove duplicates in the second stage. A similar processing was used in several recently published algorithms for k-mer counting. Their first stages are similar to KMC2, but then the bins are handled in different ways, e.g. Gerbil (Erbert et al., 2017) employs hash tables, DSK2 (Rizk et al., 2013) uses multistage processing in a case of small disk space. Recently published, KCMBT (Mamun et al., 2016) applies burst tries for k-mer storage. Jellyfish (Marcais and Kingsford, 2011) utilizes thread-safe hash tables.

There are several main novelties in KMC3. Concerning the first stage, input files, especially in gzipped FASTQ format, are loaded faster due to better input/output (I/O) subsystem. Signatures are assigned to bins in an improved way, i.e. larger default signature length (nine instead of seven) is picked and larger part of input data is taken in the pre-processing stage in which the signatures are assigned to bins. These lead to better balance of bin sizes (at the cost of slightly larger total size of bins), which results in smaller main memory requirements and faster processing in the second stage. The most significant improvements are in the second stage. Both KMC (Deorowicz et al., 2013) and KMC2 used the least-significant-digit (LSD) radix sort. When the keys to sort are long (which happens when k > 32) the LSD radix sort requires unnecessarily many iterations over the data. Therefore, in KMC3 we implemented the fastest existing most-significant-digit radix sort (Kokot et al., 2016). We also improved the parallelization scheme of other routines (not directly related to sorting) at this stage.

The second part of the package consists of various tools for manipulation of sets containing k-mers. Figure 1 shows a general scheme of the package (complete description is given in the Supplementary Material). The filtering allows to extract from FASTQ files the reads satisfying some criteria, e.g. with sufficiently large number of k-mers occurrences. The family of transform operations allows to modify the KMC database, e.g. remove too frequent/rare k-mers, remove counters, build histogram. The family of simple operations is composed of several set operations involving more than one KMC database. For example, the user is able to intersect, union the databases. Finally, the complex operations allow to construct compound expressions taking as parameters KMC databases.

3 Results

We used six datasets for KMC3 evaluation. Table 1 shows the results for two representatives. The description of the datasets, the platform used for tests, the remaining results are given in the Supplementary Material. FASTQ files are almost always stored in compressed form. Thus, we provide the running times for both: uncompressed and gzipped FASTQ.

Table 1.

Open in new tab

Comparison of k-mers counting algorithms

Algorithmk = 28k = 55
RAMDiskTime/gz-TimeRAMDiskTime/gz-Time
G. gallus (35 Gbases in total)
DSK 21439783/11801530719/1123
Gerbil224607/631417615/551
GTester47402494/—unsupported k
Jellyfish 2330909/9467701048/919
KCMBT10701335/—unsupported k
KMC 21225489/3211218861/672
KMC 31228492/2921218455/245
H. sapiens 3 (729 Gbases in total)
Gerbil2852311 994/12 7306236411 968/12 469
Jellyfish 28425138 338/20 28410463631 783/31 345
KMC 26455110 777/90367238113 774/11 804
KMC 3335969631/5985343898750/5331
Algorithmk = 28k = 55
RAMDiskTime/gz-TimeRAMDiskTime/gz-Time
G. gallus (35 Gbases in total)
DSK 21439783/11801530719/1123
Gerbil224607/631417615/551
GTester47402494/—unsupported k
Jellyfish 2330909/9467701048/919
KCMBT10701335/—unsupported k
KMC 21225489/3211218861/672
KMC 31228492/2921218455/245
H. sapiens 3 (729 Gbases in total)
Gerbil2852311 994/12 7306236411 968/12 469
Jellyfish 28425138 338/20 28410463631 783/31 345
KMC 26455110 777/90367238113 774/11 804
KMC 3335969631/5985343898750/5331

The times are given for: uncompressed input FASTQ file (‘Time’) and gzipped input FASTQ files (‘gz-Time’) The units are: seconds (time), GB (Disk and RAM). All programs were executed in 12-threads mode (at the Xeon-based workstation using HDD, see Supplementary Material, Section 2 for details), except for KCMBT which was executed in 8-threads mode due to large main memory consumption (exceeding 128 GB RAM for 12 threads). ‘—’ means that the mode is not supported. There are no results for some programs for H. sapiens 3 due to: processing longer than 12 h (DSK 2, GTester4) or memory requirements larger than 128 GB (KCMBT).

Table 1.

Open in new tab

Comparison of k-mers counting algorithms

Algorithmk = 28k = 55
RAMDiskTime/gz-TimeRAMDiskTime/gz-Time
G. gallus (35 Gbases in total)
DSK 21439783/11801530719/1123
Gerbil224607/631417615/551
GTester47402494/—unsupported k
Jellyfish 2330909/9467701048/919
KCMBT10701335/—unsupported k
KMC 21225489/3211218861/672
KMC 31228492/2921218455/245
H. sapiens 3 (729 Gbases in total)
Gerbil2852311 994/12 7306236411 968/12 469
Jellyfish 28425138 338/20 28410463631 783/31 345
KMC 26455110 777/90367238113 774/11 804
KMC 3335969631/5985343898750/5331
Algorithmk = 28k = 55
RAMDiskTime/gz-TimeRAMDiskTime/gz-Time
G. gallus (35 Gbases in total)
DSK 21439783/11801530719/1123
Gerbil224607/631417615/551
GTester47402494/—unsupported k
Jellyfish 2330909/9467701048/919
KCMBT10701335/—unsupported k
KMC 21225489/3211218861/672
KMC 31228492/2921218455/245
H. sapiens 3 (729 Gbases in total)
Gerbil2852311 994/12 7306236411 968/12 469
Jellyfish 28425138 338/20 28410463631 783/31 345
KMC 26455110 777/90367238113 774/11 804
KMC 3335969631/5985343898750/5331

The times are given for: uncompressed input FASTQ file (‘Time’) and gzipped input FASTQ files (‘gz-Time’) The units are: seconds (time), GB (Disk and RAM). All programs were executed in 12-threads mode (at the Xeon-based workstation using HDD, see Supplementary Material, Section 2 for details), except for KCMBT which was executed in 8-threads mode due to large main memory consumption (exceeding 128 GB RAM for 12 threads). ‘—’ means that the mode is not supported. There are no results for some programs for H. sapiens 3 due to: processing longer than 12 h (DSK 2, GTester4) or memory requirements larger than 128 GB (KCMBT).

As we can observe for the Gallus gallus data, KMC3 is usually the fastest, especially for larger k. The times for hom*o sapiens 3, the largest of our dataset, are much longer. Nevertheless, KMC3 was able to complete in <100 min for typical input format for both examined k. It can be observed, that the improved I/O subsystem as well as the new sorting routine and better parallelization of the second stage gave substantial benefits comparing to KMC2. It is also worth noting that KMC3 even for the largest dataset used a reasonable amount of memory.

To evaluate KMC tools we picked three studies described recently in the literature. Below we briefly describe the goals of the studies and total gains in time and memory. The detailed results together with the scripts showing KMC tools usage are given in the Supplementary Material.

DIAMUND (Salzberg et al., 2014) is a novel approach for variant detection. It is dedicated to comparison of family trios or normal and diseased samples of the same individual. Instead of mapping the reads onto a reference genome, DIAMUND compares directly the raw reads. We followed this protocol for the Ashkenazim Jewish ancestry trio (Zook et al., 2016). DIAMUND needed 13 h to complete its work and used 107 GB RAM. As most of the stages can be made using KMC tools we replaced the original tools used in DIAMUND and reduced the processing time to 4 h (reducing memory usage to 12 GB RAM).

NIKS (Nordström et al., 2013) is another tool that uses raw sequencing reads for mutation identification taking into account mainly k-mer statistics. We replaced its stages related to k-mer counting and manipulation of k-mer databases by KMC tools operations. In a single stage it was necessary to use KMC API to prepare a short C ++ program to reproduce the format of intermediary data used in NIKS. The processing time for Oryza sativa dataset (83.3 GB input FASTQ) was reduced from ∼40 to 5 min in this case (RAM usage was reduced from 92 to 12 GB).

Finally, we made the same experiment as used in the GenomeTester4 article (Kaplinski et al., 2015), introducing a tool that is able to perform similar operations as KMC tools. The investigated problem was to find group-specific k-mers to identify bacteria. The processing time was reduced from 5 h to 15 min (RAM usage reduction from 75 to 12 GB).

4 Discussion

We proposed a new version of our k-mer counter. It is much faster (even a few times for large k) than its predecessor and faster than the existing competitors. The KMC tools offer a number of operations on k-mer databases that can be used in projects making use of k-mer statistics.

Funding

This work was supported by the Polish National Science Centre under the project DEC-2015/17/B/ST6/01890 and by Silesian University of Technology grant no. (BKM507/RAU2/2016). We used the infrastructure supported by POIG.02.03.01-24-099/13 grant: ‘GeCONiI–Upper Silesian Center for Computational Science and Engineering’.

Conflict of Interest: none declared.

References

Deorowicz

S.

et al. (

2013

) Disk-based k-mer counting on a PC, BMC Bioinformatics,

14

, 160.

Deorowicz

S.

et al. (

2015

)

KMC 2: Fast and resource-frugal k-mer counting

.

Bioinformatics

,

31

,

1569

1576

.

Erbert

M.

et al. (

2017

)

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

.

Algorithms Mol. Biol

.,

12

, 9.

Google Scholar

OpenURL Placeholder Text

Kaplinski

L.

et al. (

2015

)

GenomeTester4: a toolkit for performing basic set operations—union, intersection and complement on k-mer lists

.

GigaScience

,

4

,

58

.

Kokot

M.

et al. (

2016

) Sorting data on ultra-large scale with RADULS. In:

Kozielski

S.

et al. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation.

Springer

,

pp. 235–245

.

Mamun

A.A.

et al. (

2016

)

KCMBT: a k-mer counter based on multiple burst trees

.

Bioinformatics

,

32

,

2783

2790

.

Marçais

G.

,

Kingsford

C.

(

2011

)

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

.

Bioinformatics

,

27

,

764

770

.

Nordström

K.J.V.

et al. (

2013

)

Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers

.

Nat. Biotechnol

.,

31

,

325

330

.

Rizk

G.

et al. (

2013

)

DSK: k-mer counting with very low memory usage

.

Bioinformatics

,

29

,

652

653

.

Salzberg

S.L.

et al. (

2014

)

DIAMUND: Direct comparison of genomes to detect mutations

.

Hum. Mutat

.,

35

,

283

288

.

Zook

J.M.

et al. (

2016

)

Extensive sequencing of seven human genomes to characterize benchmark reference materials

.

Scientific Data

,

3

, Article no.

160025.

© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Associate Editor: Bonnie Berger

Bonnie Berger

Associate Editor

Search for other works by this author on:

Oxford Academic


Download all slides

  • Supplementary data

  • Supplementary data

    Advertisem*nt

    Citations

    Views

    17,029

    Altmetric

    More metrics information

    Metrics

    Total Views 17,029

    13,779 Pageviews

    3,250 PDF Downloads

    Since 5/1/2017

    Month: Total Views:
    May 2017 93
    June 2017 51
    July 2017 36
    August 2017 56
    September 2017 130
    October 2017 65
    November 2017 72
    December 2017 44
    January 2018 36
    February 2018 56
    March 2018 28
    April 2018 58
    May 2018 53
    June 2018 33
    July 2018 39
    August 2018 44
    September 2018 82
    October 2018 64
    November 2018 90
    December 2018 100
    January 2019 61
    February 2019 96
    March 2019 93
    April 2019 117
    May 2019 150
    June 2019 108
    July 2019 133
    August 2019 142
    September 2019 159
    October 2019 183
    November 2019 187
    December 2019 109
    January 2020 124
    February 2020 192
    March 2020 165
    April 2020 200
    May 2020 121
    June 2020 191
    July 2020 225
    August 2020 120
    September 2020 185
    October 2020 256
    November 2020 293
    December 2020 266
    January 2021 239
    February 2021 219
    March 2021 283
    April 2021 266
    May 2021 240
    June 2021 246
    July 2021 225
    August 2021 224
    September 2021 288
    October 2021 229
    November 2021 271
    December 2021 224
    January 2022 305
    February 2022 234
    March 2022 294
    April 2022 308
    May 2022 347
    June 2022 333
    July 2022 297
    August 2022 294
    September 2022 304
    October 2022 289
    November 2022 310
    December 2022 300
    January 2023 280
    February 2023 322
    March 2023 318
    April 2023 345
    May 2023 289
    June 2023 281
    July 2023 215
    August 2023 236
    September 2023 297
    October 2023 283
    November 2023 289
    December 2023 308
    January 2024 319
    February 2024 349
    March 2024 368
    April 2024 346
    May 2024 428
    June 2024 81

    Citations

    Powered by Dimensions

    252 Web of Science

    Altmetrics

    ×

    Email alerts

    Article activity alert

    Advance article alerts

    New issue alert

    In progress issue alert

    Receive exclusive offers and updates from Oxford Academic

    Citing articles via

    Google Scholar

    • Latest

    • Most Read

    • Most Cited

    Subtype-MGTP: a cancer subtype identification framework based on Multi-Omics translation
    Figeno: multi-region genomic figures with long-read support
    Biotextgraph: graphical summarization of functional similarities from textual information
    spillR: Spillover compensation in mass cytometry data

    More from Oxford Academic

    Bioinformatics and Computational Biology

    Biological Sciences

    Science and Mathematics

    Books

    Journals

    Advertisem*nt

    KMC 3: counting and manipulating k-mer statistics (2024)

    References

    Top Articles
    Latest Posts
    Article information

    Author: Errol Quitzon

    Last Updated:

    Views: 5891

    Rating: 4.9 / 5 (79 voted)

    Reviews: 94% of readers found this page helpful

    Author information

    Name: Errol Quitzon

    Birthday: 1993-04-02

    Address: 70604 Haley Lane, Port Weldonside, TN 99233-0942

    Phone: +9665282866296

    Job: Product Retail Agent

    Hobby: Computer programming, Horseback riding, Hooping, Dance, Ice skating, Backpacking, Rafting

    Introduction: My name is Errol Quitzon, I am a fair, cute, fancy, clean, attractive, sparkling, kind person who loves writing and wants to share my knowledge and understanding with you.