Introduction to Bioinformatics File Formats#
Learning Objectives#
By the end of this session, you will be able to:
Identify common bioinformatics file formats and their use cases
Work with FASTA and FASTQ files
Learn how to install and run your first bioinfomatics tool
Understand quality scores in sequencing data
Navigate BED and GFF annotation formats
Apply best practices for organising bioinformatics projects
Before We Start: Installing the Tools#
Before downloading any data, we need to install the software we will use throughout this session. The bioinformatics community has standardised on conda, a package manager that installs tools along with all their dependencies automatically.
Step 1: Install Miniconda#
Miniconda is a lightweight version of Anaconda. Follow the installation instructions for your operating system.
Once installed, verify it works:
conda --version
Step 2: Install the bioinformatics tools#
# samtools — for working with FASTA, BAM, and CRAM files
conda install bioconda::samtools
# bcftools — for working with VCF files
conda install bioconda::bcftools
# SRA Toolkit — for downloading raw reads from NCBI SRA
conda install -c bioconda sra-tools
Verify each installation:
samtools --version
bcftools --version
fasterq-dump --version
On an HPC cluster: Your system administrator may have already installed these tools as modules. Check with
module avail samtoolsbefore installing via conda.
Downloading the Training Data#
Now that the tools are installed, we can download the data. All files in this session come from Caenorhabditis elegans, a 1 mm soil nematode and the first multicellular organism to have its genome completely sequenced. Its compact ~100 Mb genome makes the files small enough to download and explore in minutes, and every command you run here works identically on a sheep, wheat, or camel genome, just with larger files.
The BAM files are from a 2010 forward genetics screen (Doitsidou et al., PLoS ONE) that used whole-genome sequencing to find a mutation in a worm mutant called ot266. You will revisit these files in the journal club session.
How to download files: wget#
wget fetches a file directly from a URL to your current directory. It is pre-installed on most Linux systems.
wget <URL> # download to current directory
wget -P data/ <URL> # save into a specific directory
wget -c <URL> # resume an interrupted download
How to download raw reads: fasterq-dump#
Raw sequencing reads are stored in NCBI’s Sequence Read Archive (SRA). Each dataset has an accession number (e.g. SRR065390). Use fasterq-dump to download them:
fasterq-dump --split-files SRR065390 # download all reads
fasterq-dump --split-files -X 100000 SRR065390 # first 100,000 reads only
The -X flag limits how many reads are downloaded — useful when you only need a subset to explore the format.
Download all session files#
# Create a data directory
mkdir -p data
cd data
# 1. MiModD teaching archive — N2 and ot266 BAM files + matched WS220 reference
# Source: Minevich et al. 2012, Genetics
wget -O worm_ot266.tar.gz \
"https://sourceforge.net/projects/mimodd/files/Example%20Datasets/worm_ot266.tar.gz/download"
tar -xzvf worm_ot266.tar.gz
# 2. Current C. elegans reference genome from WormBase (release WS295)
# The WS220.64.fa above is kept only to match the BAM files from the 2010 study.
# For any new analysis always use the latest reference from WormBase.
wget https://downloads.wormbase.org/releases/WS295/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS295.genomic.fa.gz
# 3. C. elegans variant calls (VCF) from CaeNDR
wget https://caendr-open-access-data-bucket.s3.us-east-2.amazonaws.com/dataset_release/c_elegans/20250625/variation/WI.20250625.hard-filter.vcf.gz
wget https://caendr-open-access-data-bucket.s3.us-east-2.amazonaws.com/dataset_release/c_elegans/20250625/variation/WI.20250625.hard-filter.vcf.gz.tbi
# 4. WormBase gene annotation, release WS295
wget https://downloads.wormbase.org/releases/WS295/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS295.annotations.gff3.gz
# 5. N2 wild-type WGS reads — first 100,000 reads (Hillier et al. 2008)
cd ..
fasterq-dump --split-files -X 100000 -O data/ SRR065390
gzip data/SRR065390_1.fastq data/SRR065390_2.fastq
# Confirm all files are present
ls -lh data/
Why two reference genomes?
WS220.64.facomes bundled with the BAM files — they were aligned against it, so it must be kept for BAM-related exercises.WS295is the current release and should be used for any new analysis or annotation work. Always match your reference to the one used during alignment.
File |
Format |
Description |
|---|---|---|
|
FASTA |
C. elegans reference genome, WormBase WS220 (matches BAM files) |
|
FASTA |
C. elegans reference genome, WormBase WS295 (current) |
|
BAM |
Wild-type N2 WGS reads aligned to WS220 |
|
BAM |
ot266 mutant WGS reads aligned to WS220 |
|
VCF |
C. elegans natural variants (CaeNDR) |
|
GFF3 |
Gene annotations, WormBase WS295 |
|
FASTQ |
N2 WGS reads, 100k read subset |
1. Common Bioinformatics File Formats#
1.1 Sequence Data Formats#
Nucleotide, protein, and reference genome sequences are stored in two plain-text formats widely used in bioinformatics: FASTA and FASTQ. We will discuss both file formats below and explore tools for working with data in these formats.
FASTA Format#
FASTA is a plain text format and one of the simplest and most widely used formats for storing nucleotide, protein sequences, coding DNA sequences (CDS), transcript sequences, reference genome files, and so on. It originates from the FASTA alignment suite, created by William R. Pearson and David J. Lipman.
Structure: FASTA files are composed of sequence entries, each containing a description and the sequence data.
The description line (header line) starts with the greater than symbol
>followed by a sequence identifier and descriptionSequence data follows on subsequent lines and continues until there is another description line starting with
>
>sequence_id description
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
Example from our training data:
# View the first two lines of the reference genome
head -2 data/WS220.64.fa
# List all chromosome names
grep "^>" data/WS220.64.fa
# Count how many sequences (chromosomes) are in the file
grep -c "^>" data/WS220.64.fa
FASTQ Format#
FASTQ extends FASTA by including quality scores for each base. This is the standard output format for most sequencing data produced by modern high-throughput sequencing platforms. The per-base quality score indicates the confidence for each base call.
Structure (4 lines per read):
Header line starting with
@followed by sequence identifier and other informationRaw sequence
Separator line starting with
+Quality scores (same length as sequence)
Example from our training data:
# View the first read from the training FASTQ file
gzcat data/SRR065390_1.fastq.gz | head -4
@SRR065390.1 1/1
TCCTGCGTTGATTGAAAGAAATCGAAATGGAAATGGTTCATATTTTGTTTT...
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...
Quality Score Reference#
You will hear about the quality score of sequenced reads, as different sequencing instruments produce reads with varying quality metrics.
Phred Score |
Error Probability |
Accuracy |
|---|---|---|
10 |
1 in 10 |
90% |
20 |
1 in 100 |
99% |
30 |
1 in 1,000 |
99.9% |
40 |
1 in 10,000 |
99.99% |
Rule of thumb: Q30 or higher is generally considered high quality for Illumina data.
ASCII Encoding: Quality scores are encoded as ASCII characters. For Illumina 1.8+ (Phred+33):
!= 0 (worst)I= 40 (excellent)#= 2 (very poor, often indicates N base call)
The fastq or fasta files contains million of short/long reads from a high-throughput sequencer where their genomic position, function and consequences relative to the reference genome is not known. This is a result of digesting the nucleic acid during library preperation, where DNA or RNA is cut into a particular size, adapters are added and then loaded to be sequenced. There isn’t a genomic sequencer that is capable of sequencing a genome on it’s entire length, despite the fact that a good progress has been made with pore sequencing. However, reconstructing reads in respect to a genome reference is still required in a process commpnly known as alignemnt or mapping.
1.2 Alignment Formats#
SAM/BAM Format#
SAM (Sequence Alignment/Map) is a text format for storing sequence alignments. BAM is the compressed binary version.
Key fields in SAM format:
QNAME: Query (read) name
FLAG: Bitwise flag (paired, mapped, strand, etc.)
RNAME: Reference sequence name
POS: 1-based leftmost mapping position
MAPQ: Mapping quality
CIGAR: Alignment description string
RNEXT: Reference name of mate/next read
PNEXT: Position of mate/next read
TLEN: Template length
SEQ: Segment sequence
QUAL: Quality scores
CRAM Format#
CRAM is a reference-based compressed columnar format for storing sequence alignments, developed as a more space-efficient alternative to BAM. It achieves 30-60% smaller file sizes through reference-based compression and advanced encoding schemes.
Key Differences from BAM Reference dependency: CRAM requires the reference genome (.fasta) for both reading and writing, as it stores only differences from the reference rather than full sequences. Compression modes: Supports both lossless (preserving all data) and lossy compression (discarding quality scores or other fields for additional space savings).
Storage architecture: Uses columnar storage organized into containers and slices, rather than sequential read-by-read storage like SAM/BAM.
CRAM File Contents
CRAM files contain the same alignment information as SAM/BAM but organized differently:
File header: SAM-compatible header including reference sequences, read groups, and program information, plus reference genome MD5 checksum for validation.
Containers and slices: Data organized into hierarchical containers (units of compression) containing slices (subsets of reads from specific genomic regions).
Alignment fields: Same core fields as SAM/BAM: QNAME, FLAG, RNAME, POS, MAPQ, CIGAR RNEXT, PNEXT, TLEN SEQ (stored as differences from reference) QUAL (optionally preserved or discarded) Optional tags
Compression codecs: Multiple algorithms (gzip, bzip2, lzma, rANS) applied per field type for optimal compression.
Example from our training data:
In order to view or manipualte the alignment file we need to use a specialised bioinfomatics tools developed for interacting and post processing of genomic DNA aligments in the SAM, BAM and CRAM format. Samtools is a suite of programs for interacting with high-throughput sequencing read alignments.
# Investigate samtools command line options
samtools
# View alignments from the C. elegans BAM files
samtools view data/N2_proof_of_principle.bam | head -2
samtools view data/ot266_proof_of_principle.bam | head -2
Understanding CIGAR Strings#
The CIGAR string describes how a read aligns to the reference. Each operation is represented by a number followed by a letter:
Operation |
Description |
Consumes Query |
Consumes Reference |
|---|---|---|---|
M |
Match or mismatch |
Yes |
Yes |
I |
Insertion to reference |
Yes |
No |
D |
Deletion from reference |
No |
Yes |
N |
Skipped region (intron) |
No |
Yes |
S |
Soft clipping |
Yes |
No |
H |
Hard clipping |
No |
No |
Examples:
50M- 50 bases match/mismatch30M2I18M- 30 matches, 2 base insertion, 18 matches25M100N25M- 25 matches, 100 base intron (RNA-seq), 25 matches16S109M- 16 bases soft clipped, 109 matches (seen in our BAM file)
BAM vs CRAM: Choosing the Right Format#
Feature |
BAM |
CRAM |
|---|---|---|
Compression |
~25% of SAM |
~50% of BAM |
Speed |
Faster |
Slower |
Reference required |
No |
Yes (for full compression) |
Tool support |
Universal |
Growing |
Best for |
Active analysis |
Long-term storage |
Recommendation: Use BAM during active analysis, convert to CRAM for archival storage.
# Convert BAM to CRAM
samtools view -T reference.fasta -C -o output.cram input.bam
# Convert CRAM back to BAM
samtools view -T reference.fasta -b -o output.bam input.cram
Working with SAM/BAM Files#
samtools is the standard tool for manipulating alignment files: Here are common command line tools for using samtools
# View BAM file header
samtools view -H data/N2_proof_of_principle.bam
# Convert SAM to BAM
samtools view -bS aligned.sam > aligned.bam
# Sort BAM file by coordinate
samtools sort aligned.bam -o aligned.sorted.bam
# Index sorted BAM file (required for most downstream tools)
samtools index aligned.sorted.bam
# View alignments in a specific region
samtools view aligned.sorted.bam CHROMOSOME_X:10000000-11000000
# Get alignment statistics
samtools flagstat data/N2_proof_of_principle.bam
samtools flagstat data/ot266_proof_of_principle.bam
# Calculate coverage depth
samtools depth aligned.sorted.bam > coverage.txt
1.3 Variant Formats#
VCF (Variant Call Format)#
VCF is the standard format for storing genetic variant data. It is used by virtually all variant calling pipelines.
Structure:
Meta-information lines (starting with
##)Header line (starting with
#)Data lines (one variant per line)
Required columns:
CHROM - Chromosome
POS - Position (1-based)
ID - Variant identifier (e.g., rs number)
REF - Reference allele
ALT - Alternative allele(s)
QUAL - Phred-scaled quality score
FILTER - Filter status (PASS or failed filter names)
INFO - Additional information (key=value pairs)
Example from our training data:
# View VCF header
bcftools view -h data/celegans_variants.vcf.gz | head -20
# View first few variants
bcftools view -H data/celegans_variants.vcf.gz | head -3
Understanding Genotypes in VCF#
The genotype field (GT) uses numbers to represent alleles:
Genotype |
Meaning |
Description |
|---|---|---|
0/0 |
Homozygous reference |
Both alleles match reference |
0/1 |
Heterozygous |
One reference, one alternate |
1/1 |
Homozygous alternate |
Both alleles are alternate |
1/2 |
Heterozygous (multi-allelic) |
Two different alternate alleles |
./. |
Missing |
Genotype could not be determined |
Phased vs Unphased:
0/1- Unphased (we don’t know which chromosome has which allele)0|1- Phased (first allele is on the maternal/paternal chromosome)
Our C. elegans VCF from CaeNDR contains natural variants across multiple wild strains.
Working with VCF Files#
bcftools is the standard tool for VCF manipulation. BCFtools contains a set of utilities to manipulate variant calling format (VCF):
# View VCF file
bcftools view data/celegans_variants.vcf.gz | less
# Filter variants by quality
bcftools filter -i 'QUAL>30' data/celegans_variants.vcf.gz > data/filtered.vcf
# Extract only SNPs
bcftools view -v snps data/celegans_variants.vcf.gz > data/snps_only.vcf
# Get variant statistics
bcftools stats data/celegans_variants.vcf.gz > data/stats.txt
# Count variants per chromosome
bcftools view -H data/celegans_variants.vcf.gz | cut -f1 | sort | uniq -c
# Extract specific region
bcftools view data/celegans_variants.vcf.gz CHROMOSOME_X:10500000-10525000 > data/chrX_region.vcf
1.4 Annotation Formats#
BED Format (Browser Extensible Data)#
BED is a tab-delimited format for describing genomic intervals. It is simpler than GFF/GTF and commonly used for:
Peak calling results (ChIP-seq, ATAC-seq)
Target regions for capture sequencing
Genomic features and intervals
Important: BED uses 0-based, half-open coordinates (start is 0-based, end is excluded).
Standard BED columns:
Column |
Name |
Description |
|---|---|---|
1 |
chrom |
Chromosome name |
2 |
chromStart |
Start position (0-based) |
3 |
chromEnd |
End position (excluded) |
4 |
name |
Feature name (optional) |
5 |
score |
Score 0-1000 (optional) |
6 |
strand |
+ or - (optional) |
Example:
chr1 0 100 feature1 500 +
chr1 200 300 feature2 750 -
chr2 1000 2000 feature3 1000 +
GFF/GTF Format (Gene Feature Format)#
GFF (General Feature Format) and GTF (Gene Transfer Format) are used for gene annotations. GTF is a stricter version of GFF2.
Important: GFF/GTF uses 1-based, fully-closed coordinates (both start and end are included).
GFF3 columns:
Column |
Name |
Description |
|---|---|---|
1 |
seqid |
Chromosome/scaffold name |
2 |
source |
Program or database |
3 |
type |
Feature type (gene, exon, CDS, etc.) |
4 |
start |
Start position (1-based) |
5 |
end |
End position (included) |
6 |
score |
Score or “.” |
7 |
strand |
+ or - |
8 |
phase |
Reading frame (0, 1, 2) or “.” |
9 |
attributes |
Key=value pairs |
Example from our training data:
# View the first few annotation records
zcat data/c_elegans.PRJNA13758.WS295.annotations.gff3.gz | grep -v "^#" | head -5
# Count feature types
zcat data/c_elegans.PRJNA13758.WS295.annotations.gff3.gz \
| grep -v "^#" | cut -f3 | sort | uniq -c | sort -rn
Coordinate System Comparison#
Understanding coordinate systems is crucial to avoid off-by-one errors:
Format |
Start |
End |
Example (bases 1-10) |
|---|---|---|---|
BED |
0-based |
excluded |
0-10 |
GFF/GTF |
1-based |
included |
1-10 |
SAM |
1-based |
included |
1-10 |
VCF |
1-based |
included |
1-10 |
Conversion:
BED to GFF: start + 1, end stays the same
GFF to BED: start - 1, end stays the same
2. File Format Summary Table#
Format |
Extension |
Type |
Coordinates |
Use Case |
Compressed Version |
|---|---|---|---|---|---|
FASTA |
.fa, .fasta |
Text |
N/A |
Reference sequences |
.fa.gz |
FASTQ |
.fq, .fastq |
Text |
N/A |
Raw sequencing reads |
.fq.gz |
SAM |
.sam |
Text |
1-based |
Sequence alignments |
- |
BAM |
.bam |
Binary |
1-based |
Sequence alignments |
Already compressed |
CRAM |
.cram |
Binary |
1-based |
Sequence alignments |
Higher compression |
VCF |
.vcf |
Text |
1-based |
Variant calls |
.vcf.gz |
BCF |
.bcf |
Binary |
1-based |
Variant calls |
Already compressed |
BED |
.bed |
Text |
0-based |
Genomic intervals |
.bed.gz |
GFF/GTF |
.gff, .gtf |
Text |
1-based |
Gene annotations |
.gff.gz |
3. Hands-On Exercise#
Using the training datasets provided, complete the following tasks:
Exercise 1: Exploring the FASTQ File#
# 1. Count the total number of reads in the FASTQ file
# Hint: Each read has 4 lines, so divide total lines by 4
gzcat data/SRR065390_1.fastq.gz | wc -l
# Then divide by 4
# 2. Extract the first 10 read identifiers
gzcat data/SRR065390_1.fastq.gz | grep "^@SRR" | head -10
# 3. Find reads that start with 'N' (unknown base)
gzcat data/SRR065390_1.fastq.gz | awk 'NR%4==2' | grep "^N" | wc -l
Exercise 2: Exploring the BAM File#
# 1. View the BAM header to see which chromosomes are present
samtools view -H data/N2_proof_of_principle.bam | grep "^@SQ"
# 2. Count total alignments
samtools view -c data/N2_proof_of_principle.bam
# 3. Count mapped reads only (exclude unmapped)
samtools view -c -F 4 data/N2_proof_of_principle.bam
# 4. Get alignment statistics
samtools flagstat data/N2_proof_of_principle.bam
Exercise 3: Exploring the VCF File#
# 1. Count total number of variants
bcftools view -H data/celegans_variants.vcf.gz | wc -l
# 2. Count variants on Chromosome X
bcftools view -H data/celegans_variants.vcf.gz CHROMOSOME_X | wc -l
# 3. Find how many samples are in the VCF
bcftools query -l data/celegans_variants.vcf.gz | wc -l
# 4. Extract high-quality variants
bcftools view -i 'QUAL>30' data/celegans_variants.vcf.gz | bcftools view -H | wc -l