UCD Bioinformatics Core Workshop (2024)

Make sure our workspace is setup properly

mkdir -p /share/workshop/genome_assembly/$USERcd /share/workshop/genome_assembly/$USER

We can use K-mers (sequences of length K) to estimate biases, repeat content, sequencing coverage, and heterozygosity.

It is always a good idea to plot K-mer frequencies to get a picture of the genome composition and sequencing coverage/quality. For example, haploid and diploid genomes have differing K-mer distributions.

The 31-mer histograms of paired-end sequence data. Each histogram shows a bimodal distribution typical of a diploid heterozygous genome. The relative fraction of the distribution under the left (haploid) peak is proportional to the genome heterozygosity. Using the relative proportions of the two peaks the genomes can be ranked by their heterozygosity.

Kristian A. Stevens, Keith Woeste, Sandeep Chakraborty, Marc W. Crepeau, Charles A. Leslie, Pedro J. Martínez-García, Daniela Puiu, Jeanne Romero-Severson, Mark Coggeshall, Abhaya M. Dandekar, Daniel Kluepfel, David B. Neale, Steven L. Salzberg, Charles H. Langley, Genomic Variation Among and Within Six Juglans Species, G3: GENES, GENOMES, GENETICS July 1, 2018 vol. 8 no. 7 2153-2165. doi: 10.1534/g3.118.200030."

What are K-mers

A K-mer is a substring of length K in a string of DNA bases. For example: All 2-mers of the sequence AATTGGCCG are AA, AT, TT, TG, GG, GC, CC, CG. Similarly, all 3-mers of the sequence AATTGGCCG are AAT, ATT, TTG, TGG, GGC, GCC, CCG. There are an exponentially increasing number of possible K-mers for increasing numbers of K. There are 16 possible 2-mers for DNA if we assume there are only 4 types of bases (A,T,G,C). The equation for the number of K-mers possible for a given K is then 4^K.

Bases	K-mer size	Total possible kmers
4	1	4
4	2	16
4	3	64
4	4	256
4	5	1,024
4	6	4,096
4	7	16,384
4	8	65,536
4	9	262,144
4	10	1,048,576
4	…	…
4	21	4.4e+12
4	27	1.8e+16
4	31	4.6e+18

As you can see, there are 64 possibilities for a 3-mer and over 4 Trillion possibilities for a 21-mer!

For a given sequence of length L, and a K-mer size of K, the total k-mer’s possible will be given by ( L – k ) + 1

e.g. For the sequence of length of 14 , and a K-mer length of 8, the number of K-mer’s generated will be:

GATCCTACTGATGC

n = ( L - K ) + 1 = (14 - 8 ) + 1 = 7

Genome Sizes	Total K-mers of k=18	% error in genome estimation
L	N=(L-K)+1
100	83	17
1000	983	1.7
10000	9983	0.17
100000	99983	0.017
1000000	999983	0.0017

Genome copies

A significant issue that we face in a real genome sequencing project is non-uniform coverage of genome. This is attributed to both technical and biological variables.

ex: amplification bias of certain genomic regions during PCR (a step in preparation of Illumina sequencing libraries) [Technical] and presence of repetitive sequences in genome [Biological].

Extending the above example, if we had 10 copies (C) of 1Mb genome, then the total no of K-mer’s seen (n) of length k = 21 will be 9999700.

n = [( L - k ) + 1 ] * C = [(1000000 - 21 ) + 1] * 10 = 9999700

To get the fragement size (ala genome size), we simply need to divide the total by the number of copies:

L = n / C = 9999700 / 10 = 9999700

The above example helps to understand that we never sequence a single copy of genome but rather a population. Rather we end up sequencing C copies of genome. This is also referred as coverage in sequencing. To obtain actual genome size (N), divide the total K-mers seen (n) by coverage (C).

N = n / C

Estimating genome characteristics

In the first step, a K-mer frequency is calculated. There are software tools that help in finding the K-mer frequency in sequencing projects. The K-mer frequency of a shotgun library follows a pseudo-normal distribution (actually it is a Poisson distribution) around the mean coverage.

K-mer counting tools

Computing K-mer frequencies with Jellyfish

Lets first get an interactive session on the cluster

cd /share/workshop/genome_assembly/$USERsrun -t 03:30:00 -c 8 -n 1 --mem 32000 --partition production --account genome_workshop --reservation genome_workshop --pty /bin/bash

Running jellyfish, here we use a K-mer size of 21. You should always use “canonical kmers” (-C) since the sequencing reads will come from both the forward and reverse strand of DNA.

module load jellyfishjellyfish count -C -m 21 -s 1000000000 -t 8 /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.38X.fasta -o reads.jfjellyfish histo -t 8 reads.jf > reads.histo

K-mer frequencies

Once the K-mer frequencies are calculated a histogram can be plotted to visualize the distribution.

Download the K-mer frequencing table

spec1_21 <- read.table("reads.histo")colnames(spec1_21) <- c("freq","k21")summary(spec1_21)barplot(spec1_21$k21)barplot(spec1_21$k21,xlim=c(0,100),ylim=c(0,5000000))# genome size estimateerror=7peak=36g_est = round(sum(as.numeric(spec1_21[error:nrow(spec1_21),1]*spec1_21[error:nrow(spec1_21),2]))/peak, digits=0)g_est# [1] 139970181# Single copy sumsinglecopy=60singleC = round(sum(as.numeric(spec1_21[error:singlecopy,1]*spec1_21[error:singlecopy,2]))/peak, digits=0)singleC# [1] 116938146# Percent in single copypsinglecopy <- round(singleC/g_est, digits=2)psinglecopy# [1] 0.84# Estimated coveragetotal_bp <- 5319759793cov <- round(total_bp/g_est, digits=2)cov# [1] 38.01

Genomescope

Genomescope is an R script that uses k-mer frequencies generated from raw read data to estimate the genome size, abundance of repetitive elements and rate of heterozygosity.

There is also an online version, which offers all of the same functionality in an easy to use web interface

Running on the command line

Rscript genomescope.R histogram_file k-mer_length read_length output_dir [kmer_max] [verbose]

cd /share/workshop/genome_assembly/$USERgit clone https://github.com/schatzlab/genomescope.gitmodule load RRscript ./genomescope/genomescope.R reads.histo 21 19000 genomescope_reads

The Rscript generates a directory called “genomescope_reads” with the following files

model.txt
plot.log.png
plot.png
progress.txt
summary.txt

The summary.txt show the following results

GenomeScope version 1.0k = 21property min maxHeterozygosity 0.645301% 0.647448%Genome Haploid Length 134,957,059 bp 134,998,768 bpGenome Repeat Length 20,721,599 bp 20,728,003 bpGenome Unique Length 114,235,461 bp 114,270,765 bpModel Fit 96.7426% 97.8227%Read Error Rate 0.182922% 0.182922%

The two plots allow you to visualize the K-mer frequencies

The large number of unique (near unique) K-mers that have a frequency of 1-7 (the extreme left side of the graph) is due to technical error in sequencing. The big peak at 36 in the graph above is in fact the hom*ozygous portions of the genome that account for the identical 21-mers from both strands of the DNA. The dotted line corresponds to the predicted center of that peak. The small peak/shoulder to the left of the peak corresponds to the heterozygous portions of the genome that accounts for different 21-mers from each strand of the DNA. The two dotted lines to the right of the main peak (at coverage = 54 and 72) are the duplicated heterozygous regions and duplicated hom*ozygous regions and correspond to two smaller peaks. The shape of these peaks are affected by the sequencing errors and sequencing duplicates.

Genomescope model

Genomescope uses an equation that accurately models the shape and size of the K-mer graph using four negative binomial peaks whose shape and size are determined by % heterozygosity, % sequencing duplication, % sequencing Error. All very useful pieces of information to learn about your genome from your raw data.

For polyploid organisms, they are working on genomescope2.0https://github.com/tbenavi1/genomescope2.0

Continued Exploration

Given the estimated genome size and sum of read length = 5B basepairs, what is the coverage est.
What happens to the estimates and plot if you select 0.25% (68,000), 0.50% (136,000), 0.75% (204,000) of the reads in the file? File has 272,412 reads in it. These subsets can be found here:
- /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.9X.fasta
- /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.19X.fasta
- /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.28X.fasta
What does changing the K-mer size do, for low (~13) and high (~31)?
What does Illumina data look like? Illumina data for the parents can be found here:
/share/workshop/genome_assembly/pacbio_2020_data_drosophila/short_read_data
What does high noise Oxford Nanopore data look like?
jellyfish count -C -m 21 -s 1000000000 -t 8 <(cat /share/biocore/workshops/Genome-Assembly-Workshop-Jun2020/Nanopore/00-RawData/fastq_pass/*.fastq) -o ONT.jf

This one takes a long time to run

All of the resulting histo files can be found here:
/share/biocore/workshops/Genome-Assembly-Workshop-Jun2020/kmers