KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies (2024)

Article Navigation

Volume 33 Issue 4 February 2017

Article Contents

  • Abstract

  • 1 Introduction

  • 2 The K-mer analysis toolkit

  • 3 Summary

  • Acknowledgements

  • References

  • < Previous
  • Next >

Journal Article

,

Daniel Mapleson

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

,

Gonzalo Garcia Accinelli

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

,

George Kettleborough

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

,

Jonathan Wright

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

Bernardo J Clavijo

Earlham Institute, Norwich Research Park, Norwich, UK

To whom correspondence should be addressed. Email: bernardo.clavijo@earlham.ac.uk

Search for other works by this author on:

Oxford Academic

Bioinformatics, Volume 33, Issue 4, February 2017, Pages 574–576, https://doi.org/10.1093/bioinformatics/btw663

Published:

28 November 2016

Article history

Received:

19 July 2016

Revision received:

06 October 2016

Accepted:

17 October 2016

Published:

28 November 2016

  • PDF
  • Split View
  • Views
    • Article contents
    • Figures & tables
    • Video
    • Audio
    • Supplementary Data
  • Cite

    Cite

    Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, Bernardo J Clavijo, KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies, Bioinformatics, Volume 33, Issue 4, February 2017, Pages 574–576, https://doi.org/10.1093/bioinformatics/btw663

    Close

Abstract

Motivation

De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.

Results

We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT’s ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies.

Availability and Implementation

KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Rapid analysis of high-throughput whole genome shotgun (WGS) datasets is challenging due to their large size (Metzker, 2010), with genome size and complexity creating additional challenges (Schatz et al., 2012). Reference-free approaches for analyzing WGS data typically involve examining base calling quality, read length, GC content (Yang et al., 2013) and exploring k-mer (words of size k) spectra (Chor et al., 2009; Lo and Chain, 2014). A frequently used reference-free quality control tool is FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

K-mer spectra reveal information not only about the data quality (level of errors, sequencing biases, completeness of sequencing coverage and potential contamination) but also of genomic complexity (size, karyotype, levels of heterozygosity and repeat content; Simpson, 2014). Additional information can be extracted through pairwise comparisons of WGS datasets (Anvar et al., 2014), which can identify problematic samples by highlighting differences between spectra.

KAT, the K-mer Analysis Tookit, is a suite of tools for rapidly counting, comparing and analysing spectra for k-mers of arbitrary length directly from sequence data (see Supplementary section 2 for a discussion on choice of k and Supplementary section 3 for a comparison of k-mer tools).

2 The K-mer analysis toolkit

KAT is a C ++11 application containing multiple tools, each of which exploits multi-core machines via multi-threading where possible. Core functionality is contained in a library designed to promote rapid development of new tools. Runtime and memory requirements depend on input data size, error and bias levels, and properties of the biological sample but as a rule of thumb, machines capable of de novo assembly of a dataset will be sufficient to run KAT on the dataset (see Supplementary section 4 for details). K-mer counting in KAT is performed by an integrated and modified version of Jellyfish2 (Marçais and Kingsford, 2011), which supports large k values and is among the fastest k-mer counters available (Zhang et al., 2014).

2.1 Assembly validation by comparison of read spectrum and assembly copy number

The KAT comp tool generates a matrix, with a sequence set’s k-mer frequency on one axis, and another set's frequency on the other, with cells holding distinct k-mers counts at the given frequencies. When comparing reads against an assembly, KAT highlights properties of assembly composition and quality. If represented in a stacked histogram, read k-mer spectrum is split by copy number in the assembly (see Supplementary section 5 for a primer on how to interpret KAT’s stacked histograms). In addition, KAT provides the sect tool necessary to study specific assembled sequences and track the k-mer coverage across both the read and the assembly spectra. This can help identify assembly artefacts such as collapsing or expanding events, or detect repeat regions. Figure 1 shows plots relating to two Fraxinus excelsior assemblies created from the same dataset using the comp and sect tools. The plots highlight different strategies taken by the assembler, in (a) and (c) we see some hom*ozygous content being duplicated, and in (b) and (d) some heterozygous content eliminated.

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies (3)

Fig. 1.

(a) and (b), generated using KAT comp, show read k-mer frequency versus assembly copy number stacked histograms for two different assemblies of a heterozygous Fraxinus excelsior genome http://ftp-oadb.tsl.ac.uk/fraxinus_excelsior. Read content in black is absent from the assembly, red occurs once, purple twice, etc. Both k-mer spectra show an error distribution under 25×, heterozygous content around 50× and hom*ozygous content around 100×. (a) contains most (but not all) the heterozygous content, and introduces more duplications on hom*ozygous content. (b) is more collapsed, including mostly a single copy of the hom*ozygous content and less of the heterozygous content. (c) and (d), generated using KAT sect, show kmer coverage across example assembled loci. The assembly k-mer coverage (black line) of assembly (a) in plot (c) shows that the assembly has two copies of this locus, whereas the read k-mer coverage (red line) implies there should be only a single copy. This incorrect duplication has been corrected in assembly (b) with the read and assembly k-mer coverage agreeing in plot (d). The increased read and assembly k-mer coverage at positions 100 and 400 indicates small regions of repetitive sequence in the genome. The halved read k-mer coverage after position 400 indicates a heterozygous locus, which likely caused the duplication of this locus in the assembly (a). See Supplementary Section 5 for a more extensive analysis of all sequences from this loci and their impact on (a) and (b)

Open in new tabDownload slide

2.2 Other KAT tools

KAT also includes the hist tool for computing spectrum from a single sequence set and the gcp tool to analyse gc content against k-mer frequency. The filter tool can be used to isolate sequences from a set according to their k-mer coverage or gc content from a given spectrum (see Supplementary section 1 for details on all the tools). These tools can be used for various tasks including contaminant detection and extraction both in raw reads and assemblies, analysis of the GC bias and consistency between paired end reads and other types of libraries.

3 Summary

KAT is a user-friendly, scalable toolkit for rapidly counting, comparing and analyzing k-mers from various data sources. The tools in KAT assist the user with a wide range of tasks including error profiling, assessing sequencing bias and identifying contaminants and de novo genome assembly QC and validation.

Acknowledgements

Thanks to David Swarbreck and Federica Di Palma for their support and all KAT users for their valuable feedback. This research was supported in part by the NBIP Computing infrastructure for Science (CiS) group.

Funding

This work was strategically funded by the BBSRC, Institute Strategic Programme Grant BB/J004669/1.

Conflict of Interest: none declared.

References

Anvar

S.Y.

et al. (

2014

)

Determining the quality and complexity of next-generation sequencing data without a reference genome

.

Genome Biol

.,

15

,

555.

Chor

B.

et al. (

2009

)

Genomic DNA k-mer spectra: models and modalities

.

Genome Biol

.,

10

,

R108.

Lo

C.C.

,

Chain

P.S.G.

(

2014

)

Rapid evaluation and quality control of next generation sequencing data with faqcs

.

BMC Bioinformatics

,

15

,

366.

Marçais

G.

,

Kingsford

C.

(

2011

)

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

.

Bioinformatics

,

27

,

764

770

.

Metzker

M.L.

(

2010

)

Sequencing technologies - the next generation

.

Nat. Rev. Genet

.,

11

,

31

46

.

Schatz

M.C.

et al. (

2012

)

Current challenges in de novo plant genome sequencing and assembly

.

Genome Biol

.,

13

,

243.

Simpson

J.T.

(

2014

)

Exploring genome characteristics and sequence quality without a reference

.

Bioinformatics

,

30

,

1228

1235

.

Yang

X.

et al. (

2013

)

HTQC: a fast quality control toolkit for illumina sequencing data

.

BMC Bioinformatics

,

14

,

33.

Zhang

Q.

et al. (

2014

)

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

.

PLoS One

,

9

,

e101271.

© The Author 2016. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor: Bonnie Berger

Bonnie Berger

Associate Editor

Search for other works by this author on:

Oxford Academic


Download all slides

  • Supplementary data

  • Supplementary data

    Advertisem*nt

    Citations

    Views

    12,441

    Altmetric

    More metrics information

    Metrics

    Total Views 12,441

    9,504 Pageviews

    2,937 PDF Downloads

    Since 12/1/2016

    Month: Total Views:
    December 2016 7
    January 2017 69
    February 2017 263
    March 2017 161
    April 2017 63
    May 2017 93
    June 2017 71
    July 2017 107
    August 2017 81
    September 2017 71
    October 2017 71
    November 2017 44
    December 2017 80
    January 2018 106
    February 2018 155
    March 2018 204
    April 2018 140
    May 2018 121
    June 2018 118
    July 2018 123
    August 2018 99
    September 2018 161
    October 2018 111
    November 2018 116
    December 2018 73
    January 2019 97
    February 2019 125
    March 2019 177
    April 2019 172
    May 2019 173
    June 2019 111
    July 2019 132
    August 2019 105
    September 2019 181
    October 2019 112
    November 2019 132
    December 2019 121
    January 2020 113
    February 2020 124
    March 2020 209
    April 2020 105
    May 2020 94
    June 2020 145
    July 2020 138
    August 2020 98
    September 2020 134
    October 2020 148
    November 2020 132
    December 2020 156
    January 2021 151
    February 2021 156
    March 2021 162
    April 2021 205
    May 2021 134
    June 2021 142
    July 2021 130
    August 2021 156
    September 2021 147
    October 2021 171
    November 2021 132
    December 2021 135
    January 2022 124
    February 2022 153
    March 2022 210
    April 2022 184
    May 2022 209
    June 2022 191
    July 2022 212
    August 2022 152
    September 2022 161
    October 2022 137
    November 2022 137
    December 2022 93
    January 2023 143
    February 2023 179
    March 2023 248
    April 2023 152
    May 2023 141
    June 2023 121
    July 2023 126
    August 2023 149
    September 2023 134
    October 2023 120
    November 2023 156
    December 2023 164
    January 2024 181
    February 2024 140
    March 2024 160
    April 2024 174
    May 2024 105
    June 2024 122

    Citations

    Powered by Dimensions

    270 Web of Science

    Altmetrics

    ×

    Email alerts

    Article activity alert

    Advance article alerts

    New issue alert

    In progress issue alert

    Receive exclusive offers and updates from Oxford Academic

    Citing articles via

    Google Scholar

    • Latest

    • Most Read

    • Most Cited

    Surface-based Multimodal Protein-Ligand Binding Affinity Prediction
    BAllC and BAllCools: Efficient formatting and operating for Single-Cell DNA methylation data
    GENTANGLE: integrated computational design of gene entanglements
    DDN3.0: Determining significant rewiring of biological network structure with differential dependency networks
    SEraster: a rasterization preprocessing framework for scalable spatial omics data analysis

    More from Oxford Academic

    Bioinformatics and Computational Biology

    Biological Sciences

    Science and Mathematics

    Books

    Journals

    Advertisem*nt

    KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies (2024)

    References

    Top Articles
    Latest Posts
    Article information

    Author: Tish Haag

    Last Updated:

    Views: 5998

    Rating: 4.7 / 5 (67 voted)

    Reviews: 90% of readers found this page helpful

    Author information

    Name: Tish Haag

    Birthday: 1999-11-18

    Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

    Phone: +4215847628708

    Job: Internal Consulting Engineer

    Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

    Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.