From Blind Screens to Base-Perfect Genomes#

A 25-Year Journey Through C. elegans Genomics#

Prerequisites: Session (1) Introduction to Bioinformatics File Formats


The C. elegans Genome Reference Project#

Before we begin, it is worth pausing on what made this story possible.

In 1998, an international consortium published the complete genome sequence of Caenorhabditis elegans, the first time any multicellular organism had its genome read from end to end. The paper appeared in Science and represented over a decade of coordinated sequencing effort across the Sanger Centre (UK) and Washington University (USA). At the time, the human genome project was still three years from its first draft.

The C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282(5396), 2012–2018. https://doi.org/10.1126/science.282.5396.2012

The genome was 97 Mb across six chromosomes, encoding approximately 19,000 protein-coding genes, more than anyone expected from such a small animal. It established C. elegans as a reference point for comparative genomics and validated the strategy of using a small, tractable organism to develop methods that would later be applied to human and agricultural species.

But as this journal club will show, “complete” in 1998 was not the same as “complete” in 2025. Every act in this session is a chapter in the ongoing story of what it really means to sequence a genome.


Pre-Session Preparation#

You should have completed Session 1 and already have these files in their data/ directory:

data/
├── WS220.64.fa
├── c_elegans.PRJNA13758.WS295.genomic.fa.gz
├── N2_proof_of_principle.bam
├── ot266_proof_of_principle.bam
├── WI.20250625.hard-filter.vcf.gz
├── c_elegans.PRJNA13758.WS295.annotations.gff3.gz
└── SRR065390_1.fastq.gz

Download the additional file needed before the session begins:

# CGC1 — first telomere-to-telomere C. elegans genome (Ichikawa et al. 2025)
# Available via NCBI under BioProject PRJNA1103966
wget -P data/ "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/036/399/295/GCA_036399295.1_CGC1/GCA_036399295.1_CGC1_genomic.fna.gz"

ls -lh data/GCA_036399295.1_CGC1_genomic.fna.gz

Why Does a Worm Matter to You?#

C. elegans is a 1 mm soil nematode. It is not a crop. It is not livestock. So why is it the organism we use to teach genomics?

The answer is conservation of signalling pathways. The signalling pathways that control development, metabolism, and ageing in C. elegans are the same ones operating in your study organisms:

Pathway

C. elegans gene

Human/crop equivalent

What it controls

Wnt

bar-1, pop-1

WNT, TCF

Cell fate, axis formation

TGF-β

dbl-1, sma-2

BMP, SMAD

Body size, immunity

Insulin/IGF-1

daf-2, daf-16

INSR, FOXO

Lifespan, stress response

RNAi

dcr-1, rde-1

DICER, AGO

Gene silencing

When researchers discovered that daf-2 mutations doubled C. elegans lifespan, the same pathway was subsequently shown to regulate ageing in mice. When plant scientists developed RNAi as a gene silencing tool, they were building on mechanisms first characterised in C. elegans (Fire & Mello, Nobel Prize 2006).

The compact genome also means the file sizes are manageable. The entire C. elegans genome fits in a FASTA file you can download in seconds. A single chromosome is smaller than most FASTQ files from a single Illumina lane. This makes it the ideal teaching organism, not because it is simple, but because the small scale lets you see the whole picture clearly.

Discussion question: Can you name one pathway or gene family from your own research that has a C. elegans homolog?


Something is Wrong with This Worm#

Year: 2010 | Setting: A forward genetics screen for dopaminergic neuron mutants

Background#

Doitsidou M, Flames N, Lee AC, Bhatt DH, Bhatt DL, Bhatt DB, Bhatt DA, Bhatt DD, Bhatt DC, Bhatt DE, Hobert O (2010). C. elegans mutant identification with a one-step whole-genome-sequencing and SNP mapping strategy. PLoS ONE, 5(11), e15435. https://doi.org/10.1371/journal.pone.0015435

Minevich G, Park DS, Blankenberg D, Poole RJ, Hobert O (2012). CloudMap: a cloud-based pipeline for analysis of mutant genome sequences. Genetics, 192(4), 1249–1269. https://doi.org/10.1534/genetics.112.144204

A lab is running a forward genetics screen. They mutagenise thousands of worms with EMS, a chemical that introduces random point mutations, and look for animals with abnormal neuronal identity. One mutant, ot266, has ectopic dopaminergic neurons: cells that have taken on the wrong identity. Something in the genome has gone wrong.

The traditional approach would be years of mapping crosses. Instead, in 2010, this lab did something new: they crossed ot266 into the Hawaiian strain CB4856, let the mutant chromosome recombine with Hawaiian DNA, sequenced the whole genome, and used the pattern of Hawaiian versus N2 SNPs to locate the mutation. The causal variant, a premature stop codon in vab-3, a Pax-6 homolog was sitting at chromosome X position 10,517,587.

You have the same BAM files they used. Let’s find it.

Learning objectives#

After this exercise, you will be able to:

  • Sort and index a BAM file

  • Interpret samtools flagstat output

  • Query a specific genomic region in a BAM file

  • Describe how SNP mapping with two strains localises a mutation

Hands-on exercises#

# Sort and index both BAM files
samtools sort data/N2_proof_of_principle.bam -o data/N2.sorted.bam
samtools sort data/ot266_proof_of_principle.bam -o data/ot266.sorted.bam
samtools index data/N2.sorted.bam
samtools index data/ot266.sorted.bam

# Compare overall alignment statistics
samtools flagstat data/N2.sorted.bam
samtools flagstat data/ot266.sorted.bam
# Check read distribution across chromosomes
samtools idxstats data/N2.sorted.bam
samtools idxstats data/ot266.sorted.bam

# Zoom into the mutation site — Chromosome X position 10,517,587
samtools view data/N2.sorted.bam    CHROMOSOME_X:10515000-10520000 | head -10
samtools view data/ot266.sorted.bam CHROMOSOME_X:10515000-10520000 | head -10

# Count reads covering the exact mutation position
samtools view -c data/N2.sorted.bam    CHROMOSOME_X:10517587-10517587
samtools view -c data/ot266.sorted.bam CHROMOSOME_X:10517587-10517587

Discussion questions#

  1. Both BAM files were aligned to the same reference (WS220). Why is it important that the reference is identical when comparing two samples?

  2. A premature stop codon in vab-3 (a Pax-6 homolog) causes ectopic neuron identity. PAX6 is conserved across nearly all animals. Can you think of a phenotype in your study organism that might have an analogous cause?

  3. If you had a BAM file from your own organism and suspected a causal variant in a specific gene, what would your first samtools command be?

The mutation was found. The paper was published. The reference genome (WS220) did its job. But a question was quietly building in the community: what exactly was WS220 built from? The N2 strain used in every lab is not a single, carefully preserved clone. It has been growing on petri dishes for decades, accumulating drift mutations, in labs all over the world. In 2019, a team decided to find out how different the reference actually was from the real organism.


But What Was the Reference Built On?#

Year: 2008 → 2019 | Setting: Validating and rebuilding the reference genome

Background#

Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, … Wilson RK (2008). Whole-genome sequencing and variant discovery in C. elegans. Nature Methods, 5(2), 183–188. https://doi.org/10.1038/nmeth.1179

Yoshimura J, Ichikawa K, Shoura MJ, Artiles KL, Gabdank I, Wahba L, … Fire AZ (2019). Recompleting the Caenorhabditis elegans genome. Genome Research, 29(6), 1009–1022. https://doi.org/10.1101/gr.244830.118

Cook DE, Zdraljevic S, Roberts JP, Andersen EC (2017). CaeNDR, the Caenorhabditis elegans natural diversity resource. Nucleic Acids Research, 45(D1), D650–D657. https://doi.org/10.1093/nar/gkw1019

The WS220 reference genome was assembled from a strain called N2 — the standard lab strain. But N2 is not a single clone. Different labs received their N2 stock at different times, and worms accumulate spontaneous mutations every generation. The original sequencing was done in short Sanger reads before paired-end Illumina existed, leaving gaps and errors in repetitive regions.

In 2019, Yoshimura and colleagues resequenced the VC2010 strain — a controlled, pedigreed N2 derivative — using PacBio long reads. Their assembly was 1.8 Mb larger than WS220. That is 1.8 megabases of C. elegans genome that had been missing from every analysis done against WS220 for the previous two decades.

We will now compare what we can see in the WS220-aligned BAM versus what the WS295 reference reveals about the annotation.

Learning objectives#

After this exercise, students will be able to:

  • Compare two reference genome assemblies using basic shell commands

  • Interpret differences in chromosome sizes between reference versions

  • Use samtools view -H to identify which reference a BAM was aligned to

  • Explain why reference genome version matters for reproducibility

Hands-on exercises#

# Compare chromosome sizes between WS220 and WS295
# WS220 — from the BAM header
samtools view -H data/N2.sorted.bam | grep "^@SQ"

# WS295 — from the FASTA itself
zcat data/c_elegans.PRJNA13758.WS295.genomic.fa.gz | grep "^>" | head -10

# Count total chromosomes in each
samtools view -H data/N2.sorted.bam | grep -c "^@SQ"
zcat data/c_elegans.PRJNA13758.WS295.genomic.fa.gz | grep -c "^>"

# How many annotated genes in WS295?
zcat data/c_elegans.PRJNA13758.WS295.annotations.gff3.gz \
    | grep -v "^#" | awk '$3 == "gene"' | wc -l

# What chromosomes are in the annotation?
zcat data/c_elegans.PRJNA13758.WS295.annotations.gff3.gz \
    | grep -v "^#" | awk '$3 == "gene"' | cut -f1 | sort -u

# Explore the raw WGS reads used to validate the original reference
gzcat data/SRR065390_1.fastq.gz | wc -l    # divide by 4 for read count
gzcat data/SRR065390_1.fastq.gz | awk 'NR%4==2 {print length($0)}' | sort -u

Discussion questions#

  1. The header of a BAM file records which reference it was aligned to. Why is this critical information when sharing data with a collaborator?

  2. Your study organism probably has multiple reference genome versions. What problems could arise if two collaborators in the same project use different versions?

  3. Yoshimura et al. found 1.8 Mb of sequence missing from WS220. What types of sequence are most likely to be missing or mis-assembled in short-read references?

WS295 is better than WS220. But even WS295 has gaps, regions where repetitive sequences defeated short-read assembly. The centromeres, telomeres, and ribosomal DNA arrays were still missing or fragmented. In 2025, a new technology and a new team finally closed them all.


The Complete Worm, Finally#

Year: 2025 | Setting: Telomere-to-telomere assembly

Background#

Ichikawa K, Yoshimura J, Imai T, Aoki M, Abe M, Shibata T, Toyoda A, Fujiyama A (2025). CGC1: the first complete, gapless C. elegans genome. Genome Research, 35, 583–595. https://doi.org/10.1101/gr.280274.124

“In 2025, Ichikawa and colleagues published CGC1, the first truly complete C. elegans genome. 106.4 megabases. Zero gaps. Every telomere resolved. A 772 kilobase ribosomal DNA array that had been invisible in every previous assembly. Forty-three tandem repeat regions that had resisted sequencing for decades.

This was made possible by combining two technologies: PacBio HiFi reads, long, highly accurate, and ultralong Oxford Nanopore reads that could span even the largest repeat arrays. The result is what the 1998 consortium was aiming for: a base-perfect reference.

We will now compare the size of this assembly to what students already know, and explore what the extra sequence represents.

Learning objectives#

After this exercise, students will be able to:

  • Download a genome assembly from NCBI using wget

  • Compare chromosome counts and sizes across three assembly versions

  • Explain what telomere-to-telomere sequencing adds over previous assemblies

  • Describe the role of long-read sequencing in resolving repetitive regions

Hands-on exercises#

# Decompress and inspect the CGC1 assembly
gunzip -k data/GCA_036399295.1_CGC1_genomic.fna.gz

# How many sequences are in CGC1?
grep -c "^>" data/GCA_036399295.1_CGC1_genomic.fna

# What are their names and lengths?
awk '/^>/{name=$0} !/^>/{len+=length($0)} /^>/ && len>0{print prev, len; len=0} {prev=name} END{print prev, len}' \
    data/GCA_036399295.1_CGC1_genomic.fna | head -20

# Compare total assembly sizes across all three versions
# WS220 (from BAM header)
samtools view -H data/N2.sorted.bam | grep "^@SQ" | \
    awk -F'\t' '{for(i=1;i<=NF;i++) if($i~/^LN:/) print $i}' | \
    awk -F: '{sum+=$2} END {print "WS220 total bp:", sum}'

# WS295
zcat data/c_elegans.PRJNA13758.WS295.genomic.fa.gz | \
    awk '/^>/{if(seq) print length(seq); seq=""} !/^>/{seq=seq$0} END{print length(seq)}' | \
    awk '{sum+=$1} END {print "WS295 total bp:", sum}'

# CGC1
awk '/^>/{if(seq) print length(seq); seq=""} !/^>/{seq=seq$0} END{print length(seq)}' \
    data/GCA_036399295.1_CGC1_genomic.fna | \
    awk '{sum+=$1} END {print "CGC1 total bp:", sum}'

Discussion questions#

  1. CGC1 resolved a 772 kb ribosomal DNA array. Why would a large tandem repeat array like this be invisible in short-read assemblies?

  2. Your study organism likely still has a short-read reference assembly. What regions of its genome are most likely to be missing or fragmented? How would you test this?

  3. Does having a complete reference change how you would design a variant calling experiment compared to using WS220?


Closing Module: The Bridge#

Everything you just did with C. elegans, downloading a BAM, checking quality, comparing references, asking what the genome is missing, is exactly what you do for any organism. The worm just lets us do it in 100 Mb instead of 3 Gb. Your crop genome, your endangered animal, your livestock breed, the file formats, the tools, the questions are identical.

Reflection exercise#

Answers the following:

  1. One file format from this session that you will use in your own work within the next month — and what data it will hold.

  2. One samtools or bcftools command that solves a problem you currently have (or have had in the past).

  3. One question raised by this session that you want to explore further.

Group discussion#

  • Share reflections across the group.

  • Instructor maps each student’s organism to the closest C. elegans tool or dataset covered today.


Summary: The Three Genomes#

Assembly

Year

Size

Gaps

Technology

Paper

WS220

1998–2008

~97 Mb

Many

Sanger + short reads

Consortium 1998

WS295 / VC2010

2019

~100.3 Mb

Fewer

PacBio long reads

Yoshimura et al. 2019

CGC1

2025

106.4 Mb

Zero

HiFi + ultralong ONT

Ichikawa et al. 2025


All Papers Referenced#

Paper

Journal

DOI

The C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology.

Science

10.1126/science.282.5396.2012

Doitsidou et al. (2010). C. elegans mutant identification with a one-step whole-genome-sequencing and SNP mapping strategy.

PLoS ONE

10.1371/journal.pone.0015435

Minevich et al. (2012). CloudMap: a cloud-based pipeline for analysis of mutant genome sequences.

Genetics

10.1534/genetics.112.144204

Hillier et al. (2008). Whole-genome sequencing and variant discovery in C. elegans.

Nature Methods

10.1038/nmeth.1179

Yoshimura et al. (2019). Recompleting the Caenorhabditis elegans genome.

Genome Research

10.1101/gr.244830.118

Cook et al. (2017). CaeNDR, the Caenorhabditis elegans natural diversity resource.

Nucleic Acids Research

10.1093/nar/gkw1019

Ichikawa et al. (2025). CGC1: the first complete, gapless C. elegans genome.

Genome Research

10.1101/gr.280274.124

Fire A & Mello CC (2006). Nobel Prize — discovery of RNA interference.

Nobel Foundation

nobelprize.org