kmdiff, large-scale and user-friendly differential k-mer analyses (2024)

Article Navigation

Volume 38 Issue 24 15 December 2022

Article Contents

  • Abstract

  • 1 Introduction

  • 2 Methods

  • 3 Results

  • 4 Conclusion

  • Acknowledgements

  • References

  • < Previous
  • Next >

Journal Article

,

Téo Lemane

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074

, Rennes, F-35000

France

Search for other works by this author on:

Oxford Academic

,

Rayan Chikhi

Institut Pasteur, Université Paris Cité, Sequence Bioinformatics

, Paris, F-75015,

France

Search for other works by this author on:

Oxford Academic

Pierre Peterlongo

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074

, Rennes, F-35000

France

To whom correspondence should be addressed. Email: pierre.peterlongo@inria.fr

Search for other works by this author on:

Oxford Academic

Bioinformatics, Volume 38, Issue 24, 15 December 2022, Pages 5443–5445, https://doi.org/10.1093/bioinformatics/btac689

Published:

31 October 2022

Article history

Received:

24 June 2022

Revision received:

23 September 2022

Editorial decision:

18 October 2022

Published:

31 October 2022

Corrected and typeset:

04 November 2022

Search

Close

Search

Advanced Search

Search Menu

Abstract

Summary

Genome wide association studies elucidate links between genotypes and phenotypes. Recent studies point out the interest of conducting such experiments using k-mers as the base signal instead of single-nucleotide polymorphisms. We propose a tool, kmdiff, that performs differential k-mer analyses on large sequencing cohorts in an order of magnitude less time and memory than previously possible.

Availabilityand implementation

https://github.com/tlemane/kmdiff

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Genome wide association studies (GWAS) determine links between genotypes, i.e. genomic variants and phenotypes such as diseases. GWAS are generally performed either by genotyping known variants using micro-arrays or by mapping vast amount of sequenced data to reference genomes. In both cases, the data are biased and incomplete as genotypes are a predefined set of single-nucleotide polymorphisms (SNPs), with respect to a particular reference genome. Parts of individual genomes from a population which are absent from this reference, or which do not align to it, are simply omitted. Recent approaches (Mehrab et al., 2021; Rahman et al., 2018; Voichek and Weigel, 2020) propose to overcome those limitations by directly comparing raw sequencing data without resorting to a reference genome. Despite being of fundamental interest these tools are clearly under-exploited, likely because of important practical limitations: a high expertise required for installing and running the tools and more importantly because of prohibitive computational requirements even for only dozens of individuals.

Here, we present kmdiff, a new tool that performs large reference-free GWAS experiments using k-mers. kmdiff is based on state-of-the-art statistical models described in HAWK (Rahman et al., 2018), which detect k-mers with significantly different frequencies between two cohorts, taking into account population stratification. The main novelties offered by kmdiff are its usability (user-friendly installation and usage) and its performance, being up to 16× faster than HAWK and using 9× less RAM and nearly 3× less disk. These features enable kmdiff to compare dozens of human whole-genome sequencing experiments in a few hours using reasonable hardware resources.

2 Methods

2.1 Kmdiff pipeline

For the statistical part, kmdiff follows HAWK both in terms of k-mer detection and population stratification correction. Each k-mer is tested for significant association with either cohort using a likelihood ratio test, which assumes that k-mers are Poisson-distributed. To take into account the population stratification and thus to compute corrected P-values, a random sample of k-mers (<1/100th of total) are used to infer a stratification using the Eigenstrat software (Patterson et al., 2006; Price et al., 2006; Rahman et al., 2018). Finally, P-values are adjusted for multiple tests (Salkind, 2006) using Bonferroni correction (though Benjamini–Hochberg can also be used).

kmdiff deviates from HAWK in the k-mer counting part. HAWK counts k-mers of each sample before loading and testing batches of them using a hash table. The k-mer abundance tables are obtained using a slightly modified version of Jellyfish (Marçais and Kingsford, 2011) bundled with the tool. Instead, kmdiff constructs a k-mer matrix, i.e. an abundance matrix with k-mers in rows and samples in columns. For efficiency reasons and to limit drastically the memory usage, this matrix is not represented as a whole but sub-matrices are streamed in parallel using kmtricks (Lemane et al., 2022). An overview of the procedure is shown in Figure1.

kmdiff, large-scale and user-friendly differential k-mer analyses (5)

Fig. 1.

kmdiff pipeline overview on two cohorts composed of two samples: S1 and S2 for controls in round boxes and S3 and S4 for cases in square boxes. (A) First stage corresponds to partitioned \kmer counting with kmtricks. (B) Matrix streaming process during which k-mers are tested for significance and sampled to contribute to the PCA. (C) Significant P-values are corrected to account for the population stratification and are then screened by common controlling procedures. The k-mers ACGTC and AAAGC are over-represented in controls and cases, respectively

Open in new tabDownload slide

2.2 Implementation

kmdiff is a well-documented and user-friendly command line tool implemented in C++. It extensively uses the kmtricks tools and APIs for efficient k-mer matrix construction. It also supports C++ plugins to easily prototyping new stream-friendly models while keeping the pipeline efficiency. Sources and documentation are available at https://github.com/tlemane/kmdiff.

3 Results

We compare the performance of kmdiff with the state-of-the-art tool HAWK and demonstrate the ability of kmdiff to be more scalable while producing an equivalent output. We present medium and large-scale experiments adapted from Rahman et al. (2018), respectively on bacterial and human data. Extended results, together with the benchmark environment and resources description are available as a supplement (see Supplementary Section S1).

We also compared the computational performances of kmdiff to kmerGWAS (Voichek and Weigel, 2020), but not the quality of results, as kmerGWAS uses a different statistical model which does not compare two cohorts but instead considers phenotypes as continuous real values. Because of the high memory usage of kmerGWAS, results are limited to the bacterial dataset (see Supplementary Section S1.2).

3.1 Ampicillin resistance

This dataset consists of sequencing data from 241 strains of Escherichia coli from Earle et al. (2016). Among them 189 are resistant to ampicillin and 52 are sensitive. On this dataset, kmdiff is 6× faster than HAWK and reduces memory and disk usage by 8× and 4.5×, respectively. The difference in memory usage is explained by the use of kmtricks, a disk-based counting algorithm. For the disk usage, the difference is due to the compressed representation of counted k-mers. The k-mers found are exactly the same for both tools: 13196814 over-represented k-mers occur in cases, and 16804587 in controls. After population stratification, due to stochasticity, results differ: 4542 (for HAWK) and 4591 (for kmdiff) k-mers from controls pass significance filters. The difference can be explained by imprecise floating-point arithmetics and non-deterministic sub-sampling during population stratification correction. Thus, some k-mers with P-values close the significance threshold may not be found by both tools. In this experiment, 98% of k-mers found by HAWK are also found by kmdiff. The distribution of the significant P-values reported by both two tools is available in the Supplementary Material.

3.2 Human cohorts

To illustrate the scalability of kmdiff, we compared it to HAWK on several datasets of different sizes from the 1000 Genome project (The 1000 Genomes Project Consortium, 2015). We used whole-genome sequencing from two populations, TSI (Toscani in Italia) and (Yoruba in Ibadan, Nigeria), to build benchmark datasets composed of 20, 40 and 80 individuals. As shown in the Figure2, kmdiff offers a better scalability than HAWK being at least 13 times faster while using significantly less memory and disk.

kmdiff, large-scale and user-friendly differential k-mer analyses (6)

Fig. 2.

Scalability of HAWK and kmdiff on human cohorts. Both tools support multi-threading and were executed using 20 threads. kmdiff reduces computation times by 13–16× and memory usage by 8×

Open in new tabDownload slide

4 Conclusion

kmdiff enables differential k-mer analysis over large cohorts of sequencing data. It provides results that are equivalent to the state-of-the-art tool HAWK, but it is an order of magnitude more efficient. It additionally has the advantage of being easy to install and use. Finally, kmdiff is designed to allow simple addition of new streaming-friendly models making future updates possible while maintaining the pipeline efficiency.

Acknowledgements

The authors are grateful to Atif Rahman who provided links to sequencing datasets used in HAWK experiments.

Funding

This work was supported by the IPL Inria Neuromarkers, ANR Inception (ANR-16-CONV-0005), ANR Prairie (ANR-19-P3IA-0001), ANR SeqDigger (ANR-19-CE45-0008), H2020 ITN ALPACA (956229).

Conflict of Interest: none declared.

References

Earle

S.G.

et al. (

2016

)

Identifying lineage effects when controlling for population structure improves power in bacterial association studies

.

Nat. Microbiol

.,

1

,

1

8

.

Lemane

T.

et al. (

2022

)

Kmtricks: efficient and flexible construction of bloom filters for large sequencing data collections

.

Bioinformatics Adv

.,2(1).

Google Scholar

OpenURL Placeholder Text

Marçais

G.

,

Kingsford

C.

(

2011

)

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

.

Bioinformatics

,

27

,

764

770

.

Mehrab

Z.

et al. (

2021

)

Efficient association mapping from k-mers—an application in finding sex-specific sequences

.

PLoS One

,

16

,

e0245058

.

Patterson

N.

et al. (

2006

)

Population structure and eigenanalysis

.

PLoS Genet

.,

2

,

e190

.

Price

A.L.

et al. (

2006

)

Principal components analysis corrects for stratification in genome-wide association studies

.

Nat. Genet

.,

38

,

904

909

.

Rahman

A.

et al. (

2018

).

Association mapping from sequencing reads using k-mers

.

Elife

,

7

,

e32920

.

Salkind

N.

(

2006

) Encyclopedia of Measurement and Statistics, SAGE publications.

The 1000 Genomes Project Consortium

. (

2015

)

A global reference for human genetic variation

.

Nature

,

526

,

68

74

.

Voichek

Y.

,

Weigel

D.

(

2020

)

Identifying genetic variants underlying phenotypic variation in plants without complete genomes

.

Nat. Genet

.,

52

,

534

540

.

© The Author(s) 2022. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Issue Section:

APPLICATIONS NOTE > Sequence analysis

Associate Editor: Inanc Birol

Inanc Birol

Associate Editor

Search for other works by this author on:

Oxford Academic


Download all slides

  • Supplementary data

  • Supplementary data

    Advertisem*nt

    Citations

    Views

    2,810

    Altmetric

    More metrics information

    Metrics

    Total Views 2,810

    2,160 Pageviews

    650 PDF Downloads

    Since 10/1/2022

    Month: Total Views:
    October 2022 7
    November 2022 629
    December 2022 408
    January 2023 132
    February 2023 123
    March 2023 198
    April 2023 142
    May 2023 113
    June 2023 103
    July 2023 79
    August 2023 78
    September 2023 77
    October 2023 82
    November 2023 58
    December 2023 47
    January 2024 80
    February 2024 150
    March 2024 128
    April 2024 106
    May 2024 61
    June 2024 9

    Citations

    Powered by Dimensions

    3 Web of Science

    Altmetrics

    ×

    Email alerts

    Article activity alert

    Advance article alerts

    New issue alert

    In progress issue alert

    Receive exclusive offers and updates from Oxford Academic

    Citing articles via

    Google Scholar

    • Latest

    • Most Read

    • Most Cited

    Subtype-MGTP: a cancer subtype identification framework based on Multi-Omics translation
    Figeno: multi-region genomic figures with long-read support
    Biotextgraph: graphical summarization of functional similarities from textual information
    spillR: Spillover compensation in mass cytometry data

    More from Oxford Academic

    Bioinformatics and Computational Biology

    Biological Sciences

    Science and Mathematics

    Books

    Journals

    Advertisem*nt

    kmdiff, large-scale and user-friendly differential k-mer analyses (2024)

    References

    Top Articles
    Latest Posts
    Article information

    Author: Patricia Veum II

    Last Updated:

    Views: 5901

    Rating: 4.3 / 5 (44 voted)

    Reviews: 83% of readers found this page helpful

    Author information

    Name: Patricia Veum II

    Birthday: 1994-12-16

    Address: 2064 Little Summit, Goldieton, MS 97651-0862

    Phone: +6873952696715

    Job: Principal Officer

    Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

    Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.