Genome Size Estimation Tutorial | Computational Biology Core (2024)

How can K-mer estimation help to find genome sizes?

For a given sequence of length L, and a k-mer size of k, the total k-mer’s possible will be given by ( L – k ) + 1

e.g. In a sequence of length of 14, and a k-mer length of 8, the number of k-mer’s generated will be:

GATCCTACTGATGC ( L = 14 ) on decomposition of k-mers of length k = 8,Total number of k-mer's generated will be n = ( L - k ) + 1 = (14 - 8 ) + 1 = 7GATCCTAC, ATCCTACT, TCCTACTG, CCTACTGA, CTACTGAT, TACTGATG, ACTGATGC

For shorter fragment sizes, as in the above example, the Total number of k-mers estimated is n = 7, and it is not close to actual fragment size of L which is 14 bps. But for larger fragment size, the total number of k-mer’s (n) provide a good approximation to the actual genome size. The following table tries to illustrate the approximation:

k=18

Genome Sizes

Total K-mers of k=18

% error in genome estimation

LN=(L-K)+1
1008317
10009831.7
1000099830.17
100000999830.017
10000009999830.00171MB genome size

So for a genome size of 1 Mb, the error between estimation and reality is only .0017%. Which is a very good approximation of actual size.

In the case of, 10 copies (C) of GATCCTACTGATGC sequence, then the total no of k-mer’s (n) of length k = 8 will be 70.

n = [( L - k ) + 1 ] * C = [(14 - 8 ) + 1] * 10 = 70

To get the actual genome size, we simply need to divide the total by the number of copies:

 = n / C = 70 / 10 = 7

That will help us to understand, that we never sequence a single copy of genome but a population. Hence we end up sequencing C copies of genome. This is also referred as coverage in sequencing. To obtain actual genome size (N), divide the total k-mers (n) by coverage (C).

 N = n / C

k-mer Distribution of a Typical Real World Genome

Major issue that is faced in a real world genome sequencing projects is a non-uniform coverage of genome. This can be accounted to technical and biological variables.

example: biased amplification of certain genomic regions during PCR (a step in preparation of sequencing libraries) and presence of repetitive sequences in genome.

k-mer size:

The size of k-mers should be large enough allowing the k-mer to map uniquely to the genome (a concept used in designing primer/oligo length for PCR). Too large k-mers leads to overuse of computational resources.

In the first step, k-mer frequency is calculated to determine the coverage of genome achieved during sequencing. There are software tools like Jellyfish that helps in finding the k-mer frequency in sequencing projects. The k-mer frequency follows a pseudo-normal distribution (actually it is a Poisson distribution) around the mean coverage in histogram of k-mer counts.

Once the k-mer frequencies are calculated a histogram is plotted to visualize the distribution and to calculate mean coverage.

Genome Size Estimation Tutorial | Computational Biology Core (1)

Figure 1: (A2) An example of k-mer histogram. The x-axis (V1), is the frequency or the number of times a given k-mer is observed in the sequencing data. The y-axis (V2), is the total number of k-mers with a given frequency.

The first peak is (in red region) primarily due to rare and random sequencing errors in reads. The values in graph can be trimmed to remove reads with sequencing errors.

Genome Size Estimation Tutorial | Computational Biology Core (2)

With the assumption that k-mers are uniquely mapped to genome, they should be present only once in a genome sequence. So their frequency will reflect the coverage of the genome.

For calculation purposes we use mean coverage which is 14 in above graph. The area under the curve will represent the total number of k-mers.

So the genome estimation will be:

 N = Total no. of k-mers/Coverage = Area under curve /mean coverage(14)

Methodology

The study was conducted to estimate the genome size of the species with low coverage short read data to validate existing estimates (flow cytometry sourced) or produce a new computational estimate. The genome size is calculated from short sub-sequences (k-mers) of Illumina short read data. A larger k-mer size need to be considered for the genome estimation.

Reads should be quality controlled before the genome estimate is provided. Numerous programs exist for this purpose but Sickle (https://github.com/ucdavis-bioinformatics/sickle) is the application of choice in this tutorial. We require a minimum Phred-scaled quality value of 25 to for estimates.

Upon performing quality control, the k-mer distribution was calculated using the Jellyfish k-mer counting program. Then a histogram was constructed to perform the genome estimation using the same program.

The R statistical package is used to plot the binned distributions for the selected k-mer lengths. Initally the full data set is plotted and initial data points are often very high number due to the low frequency data points, and thus it should be avoided. Once the peak position is determined the total number of k-mers in the distribution is calculated, and then the genome size can be estimated using the peak position. In the ideal situation (or theoretically) the peak shape should be a Poisson distribution. In order to come up with a k-mer size a number of k-mer sizes are selected and genome estimation is done, where we see a regular distribution of the genome size.

Data sets used in this tutorial are available on BBC cluster:

Path: /common/Tutorial/Genome_estimationData sets: sample_read_1.fastq sample_read_2.fastqScript: GenomeEstimationScript.sh

Tutorial Outline

  1. Count k-mer occurrence using Jellyfish 2.2.6
  2. Generate histogram using R
  3. Determine the single copy region and the total k-mers
  4. Determine peak position and genome size
  5. Compare the peak shape with Poisson distribution

1. Count k-mer occurrence using Jellyfish 2.2.6

jellyfish count -t 8 -C -m 19 -s 5G -o 19mer_out --min-qual-char=? /common/Tutorial/Genome_estimation/sample_read_1.fastq /common/Tutorial/Genome_estimation/sample_read_2.fastq

options used in the counting k-mers:

-t -treads=unit32 Number of treads to be used in the run. eg: 1,2,3,..etc.-C -both-strands Count both strands-m -mer-len=unit32 Length of the k-mer -s -size=unit32 Hash size / memory allocation -o -output=string Output file name--min-quality-char Base quality value. Version 2.2.3 of Jellyfish uses the “Phred” score, where "?" = 30

If you need to look at a detailed usage of ‘count’ in jellyfish, type the following after loading the jellyfish module.

jellyfish <cmd> [options]jellyfish count --help orjellyfish count --full-help

In order to run the program Jellyfish in the bbc cluster, compose the following script and save it or you can copy GenomeEstimationScript.sh from the bbc cluster, which is located at the following path. (More on the commands used in the script can be found in here)

#!/bin/bash#$ -N jellyfish#$ -M user_email_id#$ -q highmem.q#$ -m bea#$ -S /bin/bash#$ -cwd#$ -pe smp 8#$ -o JellyFish_$JOB_ID.out#$ -e Jellyfish_$JOB_ID.errmodule load jellyfish/2.2.6jellyfish count -t 8 -C -m 19 -s 5G -o 19mer_out --min-qual-char=? /common/Tutorial/Genome_estimation/sample_read_1.fastq /common/Tutorial/Genome_estimation/sample_read_2.fastq

It will create a out put file called 19mer_out. Then using the above created file (19mer_out), data points for a histogram will be created using the following command:

jellyfish histo -o 19mer_out.histo 19mer_out

If you open the 19mer_out.histo file, it will have the frequency counting of each k-mer with the k-mer length of 19.

1 130166942 31596773 39382734 54161305 71401736 88609567 104619028 119387829 1327737210 1441950111 1523076212 1559490713 1541649914 1465569015 1344258916 1183464617 1004924018 823940819 653364120 504744521 380893022 281766523 206905024 152236525 1134562

2. Plot results using R

To visualize and to plot the data we use R Studio. Using the following command, we will load the data from the out_19mer.histo file in to a data frame called dataframe19.

dataframe19 <- read.table("19mer_out.histo") #load the data into dataframe19plot(dataframe19[1:200,], type="l") #plots the data points 1 through 200 in the dataframe19 using a line

This will create a line graph using the first 200 data points and the graph will look like the following:

Genome Size Estimation Tutorial | Computational Biology Core (3)

In general, very low frequency k-mers represent high numbers that would skew the y –axis. If we look at the data points in the histogram file, we can see that the very first data point has a exceptionally high value than the next (second) data point. So we remove just the first line and re-plot using R. From now on we will disregard the first data point for our calculations.

plot(dataframe19[2:200,], type="l")

Genome Size Estimation Tutorial | Computational Biology Core (4)

3. Determine the single copy region and the total number of k-mers

To determine the cut off points in the single copy region, we need to see the actual data point positions in the graph. In the initial examination the peak starts from the 2nd data point onward. So we will disregard the first data point in determining the single copy region.

Now using R we will re-plot the graph to determine the single copy region. Then we will include the points for clarity on the same graph.

plot(dataframe19[2:100,], type="l") #plot line graph points(dataframe19[2:100,]) #plot the data points from 2 through 100

Genome Size Estimation Tutorial | Computational Biology Core (5)

According to the graph, the single copy region would be between points 2 and 28.

Assuming the total number of data points is 9325, we can now calculate the total k-mers in the distribution.

sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2]))

It will produce the results as:

[1] 3667909811

4. Determine peak position and genome size

From the plotted graph we can get an idea where the peak position lies, and it should be between 5-20 data points. Now to determine the peak, thus, we need to look at the actual data points in that region. Using the below command we will examine the actual data points between 5 and 20.

data[10:20,] V1 V25 5 71401736 6 88609567 7 104619028 8 119387829 9 1327737210 10 1441950111 11 1523076212 12 1559490713 13 1541649914 14 1465569015 15 1344258916 16 1183464617 17 1004924018 18 823940819 19 6533641

In this case the peak is at, 12. So the Genome Size can be estimated as:

sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2]))/12

Where the genome size is:

[1] 305659151

~ 305 Mb

It would be interesting to see the proportionality of the single copy region compared to the total genome size. In this data set the single copy region is between data point 2 and 28, So the size of single copy region can be roughly calculated as:

sum(as.numeric(dataframe19[2:28,1]*dataframe19[2:28,2]))/12

[1] 213956126

~ 213 Mb

The proportion can be calculated as:

(sum(as.numeric(dataframe19[2:28,1]*dataframe19[2:28,2]))) / (sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2])))

[1] 0.6999827

~ 70 %

5. Compare the peak shape with Poisson distribution

Since we have a nice curve, we can compare our curve to a theoretical curve, which is the Poission distribution.

singleC <- sum(as.numeric(dataframe19[2:28,1]*dataframe19[2:28,2]))/12poisdtb <- dpois(1:100,12)*singleCplot(poisdtb, type='l', lty=2, col="green")lines(dataframe19[1:100,12] * singleC, type = "l", col=3)#, Ity=2)lines(dataframe19[1:100,],type= "l")

Genome Size Estimation Tutorial | Computational Biology Core (6)

Likewise the procedure can be iterated across different number of k-mers.

K-mer length19212325272931
total No of fields9741965394369325916089398818
Total K-mer count3.66E+094.4E+094.53E+094.67E+094.26E+094.44E+093.98E+09
Genome size3.05E+082.9E+083.02E+083.14E+083.04E+083.41E+083.06E+08
single copy region2.13E+082E+082.08E+082.25E+082.21E+082.36E+082.3E+08
Proportion0.699980.694030.6889150.7161510.7258140.6902770.750801
Genome Size Estimation Tutorial | Computational Biology Core (2024)

References

Top Articles
Latest Posts
Article information

Author: Stevie Stamm

Last Updated:

Views: 5893

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Stevie Stamm

Birthday: 1996-06-22

Address: Apt. 419 4200 Sipes Estate, East Delmerview, WY 05617

Phone: +342332224300

Job: Future Advertising Analyst

Hobby: Leather crafting, Puzzles, Leather crafting, scrapbook, Urban exploration, Cabaret, Skateboarding

Introduction: My name is Stevie Stamm, I am a colorful, sparkling, splendid, vast, open, hilarious, tender person who loves writing and wants to share my knowledge and understanding with you.