KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies (2024)

Article Navigation

Volume 33 Issue 4 February 2017

Article Contents

Abstract
1 Introduction
2 The K-mer analysis toolkit
3 Summary
Acknowledgements
References

< Previous
Next >

Journal Article

Daniel Mapleson

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Gonzalo Garcia Accinelli

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

George Kettleborough

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Jonathan Wright

Earlham Institute, Norwich Research Park, Norwich, UK

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Bernardo J Clavijo

Earlham Institute, Norwich Research Park, Norwich, UK

To whom correspondence should be addressed. Email: bernardo.clavijo@earlham.ac.uk

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Bioinformatics, Volume 33, Issue 4, February 2017, Pages 574–576, https://doi.org/10.1093/bioinformatics/btw663

Published:

28 November 2016

Article history

Received:

19 July 2016

Revision received:

06 October 2016

Accepted:

17 October 2016

Published:

28 November 2016

PDF
Split View
Views
- Article contents
- Figures & tables
- Video
- Audio
- Supplementary Data
Cite

Cite

Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, Bernardo J Clavijo, KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies, Bioinformatics, Volume 33, Issue 4, February 2017, Pages 574–576, https://doi.org/10.1093/bioinformatics/btw663

Close
Permissions Icon Permissions

Navbar Search Filter Mobile Enter search term Search

Abstract

Motivation

De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.

Results

We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT’s ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies.

Availability and Implementation

KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Rapid analysis of high-throughput whole genome shotgun (WGS) datasets is challenging due to their large size (Metzker, 2010), with genome size and complexity creating additional challenges (Schatz et al., 2012). Reference-free approaches for analyzing WGS data typically involve examining base calling quality, read length, GC content (Yang et al., 2013) and exploring k-mer (words of size k) spectra (Chor et al., 2009; Lo and Chain, 2014). A frequently used reference-free quality control tool is FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

K-mer spectra reveal information not only about the data quality (level of errors, sequencing biases, completeness of sequencing coverage and potential contamination) but also of genomic complexity (size, karyotype, levels of heterozygosity and repeat content; Simpson, 2014). Additional information can be extracted through pairwise comparisons of WGS datasets (Anvar et al., 2014), which can identify problematic samples by highlighting differences between spectra.

KAT, the K-mer Analysis Tookit, is a suite of tools for rapidly counting, comparing and analysing spectra for k-mers of arbitrary length directly from sequence data (see Supplementary section 2 for a discussion on choice of k and Supplementary section 3 for a comparison of k-mer tools).

2 The K-mer analysis toolkit

KAT is a C ++11 application containing multiple tools, each of which exploits multi-core machines via multi-threading where possible. Core functionality is contained in a library designed to promote rapid development of new tools. Runtime and memory requirements depend on input data size, error and bias levels, and properties of the biological sample but as a rule of thumb, machines capable of de novo assembly of a dataset will be sufficient to run KAT on the dataset (see Supplementary section 4 for details). K-mer counting in KAT is performed by an integrated and modified version of Jellyfish2 (Marçais and Kingsford, 2011), which supports large k values and is among the fastest k-mer counters available (Zhang et al., 2014).

2.1 Assembly validation by comparison of read spectrum and assembly copy number

The KAT comp tool generates a matrix, with a sequence set’s k-mer frequency on one axis, and another set's frequency on the other, with cells holding distinct k-mers counts at the given frequencies. When comparing reads against an assembly, KAT highlights properties of assembly composition and quality. If represented in a stacked histogram, read k-mer spectrum is split by copy number in the assembly (see Supplementary section 5 for a primer on how to interpret KAT’s stacked histograms). In addition, KAT provides the sect tool necessary to study specific assembled sequences and track the k-mer coverage across both the read and the assembly spectra. This can help identify assembly artefacts such as collapsing or expanding events, or detect repeat regions. Figure 1 shows plots relating to two Fraxinus excelsior assemblies created from the same dataset using the comp and sect tools. The plots highlight different strategies taken by the assembler, in (a) and (c) we see some hom*ozygous content being duplicated, and in (b) and (d) some heterozygous content eliminated.

Fig. 1.

(a) and (b), generated using KAT comp, show read k-mer frequency versus assembly copy number stacked histograms for two different assemblies of a heterozygous Fraxinus excelsior genome http://ftp-oadb.tsl.ac.uk/fraxinus_excelsior. Read content in black is absent from the assembly, red occurs once, purple twice, etc. Both k-mer spectra show an error distribution under 25×, heterozygous content around 50× and hom*ozygous content around 100×. (a) contains most (but not all) the heterozygous content, and introduces more duplications on hom*ozygous content. (b) is more collapsed, including mostly a single copy of the hom*ozygous content and less of the heterozygous content. (c) and (d), generated using KAT sect, show kmer coverage across example assembled loci. The assembly k-mer coverage (black line) of assembly (a) in plot (c) shows that the assembly has two copies of this locus, whereas the read k-mer coverage (red line) implies there should be only a single copy. This incorrect duplication has been corrected in assembly (b) with the read and assembly k-mer coverage agreeing in plot (d). The increased read and assembly k-mer coverage at positions 100 and 400 indicates small regions of repetitive sequence in the genome. The halved read k-mer coverage after position 400 indicates a heterozygous locus, which likely caused the duplication of this locus in the assembly (a). See Supplementary Section 5 for a more extensive analysis of all sequences from this loci and their impact on (a) and (b)

Open in new tabDownload slide

2.2 Other KAT tools

KAT also includes the hist tool for computing spectrum from a single sequence set and the gcp tool to analyse gc content against k-mer frequency. The filter tool can be used to isolate sequences from a set according to their k-mer coverage or gc content from a given spectrum (see Supplementary section 1 for details on all the tools). These tools can be used for various tasks including contaminant detection and extraction both in raw reads and assemblies, analysis of the GC bias and consistency between paired end reads and other types of libraries.

3 Summary

KAT is a user-friendly, scalable toolkit for rapidly counting, comparing and analyzing k-mers from various data sources. The tools in KAT assist the user with a wide range of tasks including error profiling, assessing sequencing bias and identifying contaminants and de novo genome assembly QC and validation.

Acknowledgements

Thanks to David Swarbreck and Federica Di Palma for their support and all KAT users for their valuable feedback. This research was supported in part by the NBIP Computing infrastructure for Science (CiS) group.

Funding

This work was strategically funded by the BBSRC, Institute Strategic Programme Grant BB/J004669/1.

Conflict of Interest: none declared.

References

Anvar

S.Y.

et al. (

2014

)

Determining the quality and complexity of next-generation sequencing data without a reference genome

Genome Biol

555.

Chor

et al. (

2009

)

Genomic DNA k-mer spectra: models and modalities

Genome Biol

R108.

C.C.

Chain

P.S.G.

(

2014

)

Rapid evaluation and quality control of next generation sequencing data with faqcs

BMC Bioinformatics

366.

Marçais

Kingsford

(

2011

)

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Bioinformatics

764

–

770

Google Scholar

Crossref

Search ADS

Citations

Views

12,441

Altmetric

More metrics information

Metrics

Total Views 12,441

9,504 Pageviews

2,937 PDF Downloads

Since 12/1/2016

Month:	Total Views:
December 2016	7
January 2017	69
February 2017	263
March 2017	161
April 2017	63
May 2017	93
June 2017	71
July 2017	107
August 2017	81
September 2017	71
October 2017	71
November 2017	44
December 2017	80
January 2018	106
February 2018	155
March 2018	204
April 2018	140
May 2018	121
June 2018	118
July 2018	123
August 2018	99
September 2018	161
October 2018	111
November 2018	116
December 2018	73
January 2019	97
February 2019	125
March 2019	177
April 2019	172
May 2019	173
June 2019	111
July 2019	132
August 2019	105
September 2019	181
October 2019	112
November 2019	132
December 2019	121
January 2020	113
February 2020	124
March 2020	209
April 2020	105
May 2020	94
June 2020	145
July 2020	138
August 2020	98
September 2020	134
October 2020	148
November 2020	132
December 2020	156
January 2021	151
February 2021	156
March 2021	162
April 2021	205
May 2021	134
June 2021	142
July 2021	130
August 2021	156
September 2021	147
October 2021	171
November 2021	132
December 2021	135
January 2022	124
February 2022	153
March 2022	210
April 2022	184
May 2022	209
June 2022	191
July 2022	212
August 2022	152
September 2022	161
October 2022	137
November 2022	137
December 2022	93
January 2023	143
February 2023	179
March 2023	248
April 2023	152
May 2023	141
June 2023	121
July 2023	126
August 2023	149
September 2023	134
October 2023	120
November 2023	156
December 2023	164
January 2024	181
February 2024	140
March 2024	160
April 2024	174
May 2024	105
June 2024	122