UCD Bioinformatics Core Workshop (2024)

Make sure our workspace is setup properly

mkdir -p /share/workshop/genome_assembly/$USERcd /share/workshop/genome_assembly/$USER

We can use K-mers (sequences of length K) to estimate biases, repeat content, sequencing coverage, and heterozygosity.

It is always a good idea to plot K-mer frequencies to get a picture of the genome composition and sequencing coverage/quality. For example, haploid and diploid genomes have differing K-mer distributions.

UCD Bioinformatics Core Workshop (1)

The 31-mer histograms of paired-end sequence data. Each histogram shows a bimodal distribution typical of a diploid heterozygous genome. The relative fraction of the distribution under the left (haploid) peak is proportional to the genome heterozygosity. Using the relative proportions of the two peaks the genomes can be ranked by their heterozygosity.

Kristian A. Stevens, Keith Woeste, Sandeep Chakraborty, Marc W. Crepeau, Charles A. Leslie, Pedro J. Martínez-García, Daniela Puiu, Jeanne Romero-Severson, Mark Coggeshall, Abhaya M. Dandekar, Daniel Kluepfel, David B. Neale, Steven L. Salzberg, Charles H. Langley, Genomic Variation Among and Within Six Juglans Species, G3: GENES, GENOMES, GENETICS July 1, 2018 vol. 8 no. 7 2153-2165. doi: 10.1534/g3.118.200030."

What are K-mers

A K-mer is a substring of length K in a string of DNA bases. For example: All 2-mers of the sequence AATTGGCCG are AA, AT, TT, TG, GG, GC, CC, CG. Similarly, all 3-mers of the sequence AATTGGCCG are AAT, ATT, TTG, TGG, GGC, GCC, CCG. There are an exponentially increasing number of possible K-mers for increasing numbers of K. There are 16 possible 2-mers for DNA if we assume there are only 4 types of bases (A,T,G,C). The equation for the number of K-mers possible for a given K is then 4^K.

Bases K-mer size Total possible kmers
4 1 4
4 2 16
4 3 64
4 4 256
4 5 1,024
4 6 4,096
4 7 16,384
4 8 65,536
4 9 262,144
4 10 1,048,576
4
4 21 4.4e+12
4 27 1.8e+16
4 31 4.6e+18

As you can see, there are 64 possibilities for a 3-mer and over 4 Trillion possibilities for a 21-mer!

For a given sequence of length L, and a K-mer size of K, the total k-mer’s possible will be given by ( L – k ) + 1

e.g. For the sequence of length of 14 , and a K-mer length of 8, the number of K-mer’s generated will be:

GATCCTACTGATGC

GATCCTAC, ATCCTACT, TCCTACTG, CCTACTGA, CTACTGAT, TACTGATG, ACTGATGC

For shorter fragments, as in the above example, the total number of K-mers estimated is n = 7, which it is not close to actual fragment size of L which is 14 bps. But for larger fragments , the total number of K-mer’s (n) provide a good approximation to the actual genome size. The following table tries to illustrate the approximation:

K=18

Genome Sizes Total K-mers of k=18 % error in genome estimation
L N=(L-K)+1
100 83 17
1000 983 1.7
10000 9983 0.17
100000 99983 0.017
1000000 999983 0.0017

So for a genome size of 1 Mb and K-mer size of 18, the error between estimation and reality is only .0017%. Which is a very good approximation of actual size.

In choosing a K-mer size, it should be large enough to allow the K-mer to map uniquely to the genome. So the total available K-mers should be sufficiently larger than the genome size and therefor has the ability to store all the K-mers in the genome (a K-mer size of 21 is large enough for most genomes). However, too large K-mers leads to need for substantial computational resources, as well as producing more erroneous K-mers caused by sequencing errors. In other words, the higher error rate in the sequencing data, the smaller k-mer size should be used. Extremely large (haploid size»10GB) and/or very repetitive genomes may benefit from larger kmer lengths to increase the number of unique k-mers.

Genome copies

A significant issue that we face in a real genome sequencing project is non-uniform coverage of genome. This is attributed to both technical and biological variables.

ex: amplification bias of certain genomic regions during PCR (a step in preparation of Illumina sequencing libraries) [Technical] and presence of repetitive sequences in genome [Biological].

Extending the above example, if we had 10 copies (C) of 1Mb genome, then the total no of K-mer’s seen (n) of length k = 21 will be 9999700.

n = [( L - k ) + 1 ] * C = [(1000000 - 21 ) + 1] * 10 = 9999700

To get the fragement size (ala genome size), we simply need to divide the total by the number of copies:

L = n / C = 9999700 / 10 = 9999700

The above example helps to understand that we never sequence a single copy of genome but rather a population. Rather we end up sequencing C copies of genome. This is also referred as coverage in sequencing. To obtain actual genome size (N), divide the total K-mers seen (n) by coverage (C).

Estimating genome characteristics

In the first step, a K-mer frequency is calculated. There are software tools that help in finding the K-mer frequency in sequencing projects. The K-mer frequency of a shotgun library follows a pseudo-normal distribution (actually it is a Poisson distribution) around the mean coverage.

K-mer counting tools

Computing K-mer frequencies with Jellyfish

Lets first get an interactive session on the cluster

cd /share/workshop/genome_assembly/$USERsrun -t 03:30:00 -c 8 -n 1 --mem 32000 --partition production --account genome_workshop --reservation genome_workshop --pty /bin/bash

Running jellyfish, here we use a K-mer size of 21. You should always use “canonical kmers” (-C) since the sequencing reads will come from both the forward and reverse strand of DNA.

module load jellyfishjellyfish count -C -m 21 -s 1000000000 -t 8 /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.38X.fasta -o reads.jfjellyfish histo -t 8 reads.jf > reads.histo

K-mer frequencies

Once the K-mer frequencies are calculated a histogram can be plotted to visualize the distribution.

Download the K-mer frequencing table

spec1_21 <- read.table("reads.histo")colnames(spec1_21) <- c("freq","k21")summary(spec1_21)barplot(spec1_21$k21)barplot(spec1_21$k21,xlim=c(0,100),ylim=c(0,5000000))# genome size estimateerror=7peak=36g_est = round(sum(as.numeric(spec1_21[error:nrow(spec1_21),1]*spec1_21[error:nrow(spec1_21),2]))/peak, digits=0)g_est# [1] 139970181# Single copy sumsinglecopy=60singleC = round(sum(as.numeric(spec1_21[error:singlecopy,1]*spec1_21[error:singlecopy,2]))/peak, digits=0)singleC# [1] 116938146# Percent in single copypsinglecopy <- round(singleC/g_est, digits=2)psinglecopy# [1] 0.84# Estimated coveragetotal_bp <- 5319759793cov <- round(total_bp/g_est, digits=2)cov# [1] 38.01

Genomescope

Genomescope is an R script that uses k-mer frequencies generated from raw read data to estimate the genome size, abundance of repetitive elements and rate of heterozygosity.

There is also an online version, which offers all of the same functionality in an easy to use web interface

Running on the command line

Rscript genomescope.R histogram_file k-mer_length read_length output_dir [kmer_max] [verbose]
cd /share/workshop/genome_assembly/$USERgit clone https://github.com/schatzlab/genomescope.gitmodule load RRscript ./genomescope/genomescope.R reads.histo 21 19000 genomescope_reads

The Rscript generates a directory called “genomescope_reads” with the following files

  • model.txt
  • plot.log.png
  • plot.png
  • progress.txt
  • summary.txt

The summary.txt show the following results

GenomeScope version 1.0k = 21property min maxHeterozygosity 0.645301% 0.647448%Genome Haploid Length 134,957,059 bp 134,998,768 bpGenome Repeat Length 20,721,599 bp 20,728,003 bpGenome Unique Length 114,235,461 bp 114,270,765 bpModel Fit 96.7426% 97.8227%Read Error Rate 0.182922% 0.182922%

The two plots allow you to visualize the K-mer frequencies

UCD Bioinformatics Core Workshop (2) UCD Bioinformatics Core Workshop (3)

The large number of unique (near unique) K-mers that have a frequency of 1-7 (the extreme left side of the graph) is due to technical error in sequencing. The big peak at 36 in the graph above is in fact the hom*ozygous portions of the genome that account for the identical 21-mers from both strands of the DNA. The dotted line corresponds to the predicted center of that peak. The small peak/shoulder to the left of the peak corresponds to the heterozygous portions of the genome that accounts for different 21-mers from each strand of the DNA. The two dotted lines to the right of the main peak (at coverage = 54 and 72) are the duplicated heterozygous regions and duplicated hom*ozygous regions and correspond to two smaller peaks. The shape of these peaks are affected by the sequencing errors and sequencing duplicates.

Genomescope model

Genomescope uses an equation that accurately models the shape and size of the K-mer graph using four negative binomial peaks whose shape and size are determined by % heterozygosity, % sequencing duplication, % sequencing Error. All very useful pieces of information to learn about your genome from your raw data.

For polyploid organisms, they are working on genomescope2.0https://github.com/tbenavi1/genomescope2.0

Continued Exploration

  • Given the estimated genome size and sum of read length = 5B basepairs, what is the coverage est.
  • What happens to the estimates and plot if you select 0.25% (68,000), 0.50% (136,000), 0.75% (204,000) of the reads in the file? File has 272,412 reads in it. These subsets can be found here:
    • /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.9X.fasta
    • /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.19X.fasta
    • /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.28X.fasta
  • What does changing the K-mer size do, for low (~13) and high (~31)?
  • What does Illumina data look like? Illumina data for the parents can be found here:
    /share/workshop/genome_assembly/pacbio_2020_data_drosophila/short_read_data
  • What does high noise Oxford Nanopore data look like?
    jellyfish count -C -m 21 -s 1000000000 -t 8 <(cat /share/biocore/workshops/Genome-Assembly-Workshop-Jun2020/Nanopore/00-RawData/fastq_pass/*.fastq) -o ONT.jf

    This one takes a long time to run

All of the resulting histo files can be found here:
/share/biocore/workshops/Genome-Assembly-Workshop-Jun2020/kmers

UCD Bioinformatics Core Workshop (2024)

References

Top Articles
Latest Posts
Article information

Author: Laurine Ryan

Last Updated:

Views: 5887

Rating: 4.7 / 5 (77 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Laurine Ryan

Birthday: 1994-12-23

Address: Suite 751 871 Lissette Throughway, West Kittie, NH 41603

Phone: +2366831109631

Job: Sales Producer

Hobby: Creative writing, Motor sports, Do it yourself, Skateboarding, Coffee roasting, Calligraphy, Stand-up comedy

Introduction: My name is Laurine Ryan, I am a adorable, fair, graceful, spotless, gorgeous, homely, cooperative person who loves writing and wants to share my knowledge and understanding with you.