Introduction to Bioinformatics File Formats#

Learning Objectives#

By the end of this session, you will be able to:

Identify common bioinformatics file formats and their use cases
Work with FASTA and FASTQ files
Learn how to install and run your first bioinfomatics tool
Understand quality scores in sequencing data
Navigate BED and GFF annotation formats
Apply best practices for organising bioinformatics projects

Before We Start: Installing the Tools#

Before downloading any data, we need to install the software we will use throughout this session. The bioinformatics community has standardised on conda, a package manager that installs tools along with all their dependencies automatically.

Step 1: Install Miniconda#

Miniconda is a lightweight version of Anaconda. Follow the installation instructions for your operating system.

Once installed, verify it works:

conda --version

Step 2: Install the bioinformatics tools#

# samtools — for working with FASTA, BAM, and CRAM files
conda install bioconda::samtools

# bcftools — for working with VCF files
conda install bioconda::bcftools

# SRA Toolkit — for downloading raw reads from NCBI SRA
conda install -c bioconda sra-tools

Verify each installation:

samtools --version
bcftools --version
fasterq-dump --version

On an HPC cluster: Your system administrator may have already installed these tools as modules. Check with module avail samtools before installing via conda.

Downloading the Training Data#

Now that the tools are installed, we can download the data. All files in this session come from Caenorhabditis elegans, a 1 mm soil nematode and the first multicellular organism to have its genome completely sequenced. Its compact ~100 Mb genome makes the files small enough to download and explore in minutes, and every command you run here works identically on a sheep, wheat, or camel genome, just with larger files.

The BAM files are from a 2010 forward genetics screen (Doitsidou et al., PLoS ONE) that used whole-genome sequencing to find a mutation in a worm mutant called ot266. You will revisit these files in the journal club session.

How to download files: `wget`#

wget fetches a file directly from a URL to your current directory. It is pre-installed on most Linux systems.

wget <URL>              # download to current directory
wget -P data/ <URL>     # save into a specific directory
wget -c <URL>           # resume an interrupted download

How to download raw reads: `fasterq-dump`#

Raw sequencing reads are stored in NCBI’s Sequence Read Archive (SRA). Each dataset has an accession number (e.g. SRR065390). Use fasterq-dump to download them:

fasterq-dump --split-files SRR065390             # download all reads
fasterq-dump --split-files -X 100000 SRR065390   # first 100,000 reads only

The -X flag limits how many reads are downloaded — useful when you only need a subset to explore the format.

Download all session files#

# Create a data directory
mkdir -p data
cd data

# 1. MiModD teaching archive — N2 and ot266 BAM files + matched WS220 reference
#    Source: Minevich et al. 2012, Genetics
wget -O worm_ot266.tar.gz \
  "https://sourceforge.net/projects/mimodd/files/Example%20Datasets/worm_ot266.tar.gz/download"
tar -xzvf worm_ot266.tar.gz

# 2. Current C. elegans reference genome from WormBase (release WS295)
#    The WS220.64.fa above is kept only to match the BAM files from the 2010 study.
#    For any new analysis always use the latest reference from WormBase.
wget https://downloads.wormbase.org/releases/WS295/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS295.genomic.fa.gz

# 3. C. elegans variant calls (VCF) from CaeNDR
wget https://caendr-open-access-data-bucket.s3.us-east-2.amazonaws.com/dataset_release/c_elegans/20250625/variation/WI.20250625.hard-filter.vcf.gz
wget https://caendr-open-access-data-bucket.s3.us-east-2.amazonaws.com/dataset_release/c_elegans/20250625/variation/WI.20250625.hard-filter.vcf.gz.tbi

# 4. WormBase gene annotation, release WS295
wget https://downloads.wormbase.org/releases/WS295/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS295.annotations.gff3.gz

# 5. N2 wild-type WGS reads — first 100,000 reads (Hillier et al. 2008)
cd ..
fasterq-dump --split-files -X 100000 -O data/ SRR065390
gzip data/SRR065390_1.fastq data/SRR065390_2.fastq

# Confirm all files are present
ls -lh data/

Why two reference genomes? WS220.64.fa comes bundled with the BAM files — they were aligned against it, so it must be kept for BAM-related exercises. WS295 is the current release and should be used for any new analysis or annotation work. Always match your reference to the one used during alignment.

File	Format	Description
`WS220.64.fa`	FASTA	C. elegans reference genome, WormBase WS220 (matches BAM files)
`c_elegans.PRJNA13758.WS295.genomic.fa.gz`	FASTA	C. elegans reference genome, WormBase WS295 (current)
`N2_proof_of_principle.bam`	BAM	Wild-type N2 WGS reads aligned to WS220
`ot266_proof_of_principle.bam`	BAM	ot266 mutant WGS reads aligned to WS220
`WI.20250625.hard-filter.vcf.gz`	VCF	C. elegans natural variants (CaeNDR)
`c_elegans.PRJNA13758.WS295.annotations.gff3.gz`	GFF3	Gene annotations, WormBase WS295
`SRR065390_1.fastq.gz`	FASTQ	N2 WGS reads, 100k read subset

1. Common Bioinformatics File Formats#

1.1 Sequence Data Formats#

Nucleotide, protein, and reference genome sequences are stored in two plain-text formats widely used in bioinformatics: FASTA and FASTQ. We will discuss both file formats below and explore tools for working with data in these formats.

FASTA Format#

FASTA is a plain text format and one of the simplest and most widely used formats for storing nucleotide, protein sequences, coding DNA sequences (CDS), transcript sequences, reference genome files, and so on. It originates from the FASTA alignment suite, created by William R. Pearson and David J. Lipman.

Structure: FASTA files are composed of sequence entries, each containing a description and the sequence data.

The description line (header line) starts with the greater than symbol > followed by a sequence identifier and description
Sequence data follows on subsequent lines and continues until there is another description line starting with >

>sequence_id description
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG

Example from our training data:

# View the first two lines of the reference genome
head -2 data/WS220.64.fa

# List all chromosome names
grep "^>" data/WS220.64.fa

# Count how many sequences (chromosomes) are in the file
grep -c "^>" data/WS220.64.fa

FASTQ Format#

FASTQ extends FASTA by including quality scores for each base. This is the standard output format for most sequencing data produced by modern high-throughput sequencing platforms. The per-base quality score indicates the confidence for each base call.

Structure (4 lines per read):

Header line starting with @ followed by sequence identifier and other information
Raw sequence
Separator line starting with +
Quality scores (same length as sequence)

Example from our training data:

# View the first read from the training FASTQ file
gzcat data/SRR065390_1.fastq.gz | head -4

@SRR065390.1 1/1
TCCTGCGTTGATTGAAAGAAATCGAAATGGAAATGGTTCATATTTTGTTTT...
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...

Quality Score Reference#

You will hear about the quality score of sequenced reads, as different sequencing instruments produce reads with varying quality metrics.

Phred Score	Error Probability	Accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1,000	99.9%
40	1 in 10,000	99.99%

Rule of thumb: Q30 or higher is generally considered high quality for Illumina data.

ASCII Encoding: Quality scores are encoded as ASCII characters. For Illumina 1.8+ (Phred+33):

! = 0 (worst)
I = 40 (excellent)
# = 2 (very poor, often indicates N base call)

The fastq or fasta files contains million of short/long reads from a high-throughput sequencer where their genomic position, function and consequences relative to the reference genome is not known. This is a result of digesting the nucleic acid during library preperation, where DNA or RNA is cut into a particular size, adapters are added and then loaded to be sequenced. There isn’t a genomic sequencer that is capable of sequencing a genome on it’s entire length, despite the fact that a good progress has been made with pore sequencing. However, reconstructing reads in respect to a genome reference is still required in a process commpnly known as alignemnt or mapping.

1.2 Alignment Formats#

SAM/BAM Format#

SAM (Sequence Alignment/Map) is a text format for storing sequence alignments. BAM is the compressed binary version.

Key fields in SAM format:

QNAME: Query (read) name
FLAG: Bitwise flag (paired, mapped, strand, etc.)
RNAME: Reference sequence name
POS: 1-based leftmost mapping position
MAPQ: Mapping quality
CIGAR: Alignment description string
RNEXT: Reference name of mate/next read
PNEXT: Position of mate/next read
TLEN: Template length
SEQ: Segment sequence
QUAL: Quality scores

CRAM Format#

CRAM is a reference-based compressed columnar format for storing sequence alignments, developed as a more space-efficient alternative to BAM. It achieves 30-60% smaller file sizes through reference-based compression and advanced encoding schemes.

Key Differences from BAM Reference dependency: CRAM requires the reference genome (.fasta) for both reading and writing, as it stores only differences from the reference rather than full sequences. Compression modes: Supports both lossless (preserving all data) and lossy compression (discarding quality scores or other fields for additional space savings).

Storage architecture: Uses columnar storage organized into containers and slices, rather than sequential read-by-read storage like SAM/BAM.

CRAM File Contents

CRAM files contain the same alignment information as SAM/BAM but organized differently:

File header: SAM-compatible header including reference sequences, read groups, and program information, plus reference genome MD5 checksum for validation.

Containers and slices: Data organized into hierarchical containers (units of compression) containing slices (subsets of reads from specific genomic regions).

Alignment fields: Same core fields as SAM/BAM: QNAME, FLAG, RNAME, POS, MAPQ, CIGAR RNEXT, PNEXT, TLEN SEQ (stored as differences from reference) QUAL (optionally preserved or discarded) Optional tags

Compression codecs: Multiple algorithms (gzip, bzip2, lzma, rANS) applied per field type for optimal compression.

Example from our training data:

In order to view or manipualte the alignment file we need to use a specialised bioinfomatics tools developed for interacting and post processing of genomic DNA aligments in the SAM, BAM and CRAM format. Samtools is a suite of programs for interacting with high-throughput sequencing read alignments.

# Investigate samtools command line options
samtools

# View alignments from the C. elegans BAM files
samtools view data/N2_proof_of_principle.bam | head -2
samtools view data/ot266_proof_of_principle.bam | head -2

Understanding CIGAR Strings#

The CIGAR string describes how a read aligns to the reference. Each operation is represented by a number followed by a letter:

Operation	Description	Consumes Query	Consumes Reference
M	Match or mismatch	Yes	Yes
I	Insertion to reference	Yes	No
D	Deletion from reference	No	Yes
N	Skipped region (intron)	No	Yes
S	Soft clipping	Yes	No
H	Hard clipping	No	No

Examples:

50M - 50 bases match/mismatch
30M2I18M - 30 matches, 2 base insertion, 18 matches
25M100N25M - 25 matches, 100 base intron (RNA-seq), 25 matches
16S109M - 16 bases soft clipped, 109 matches (seen in our BAM file)

BAM vs CRAM: Choosing the Right Format#

Feature	BAM	CRAM
Compression	~25% of SAM	~50% of BAM
Speed	Faster	Slower
Reference required	No	Yes (for full compression)
Tool support	Universal	Growing
Best for	Active analysis	Long-term storage

Recommendation: Use BAM during active analysis, convert to CRAM for archival storage.

# Convert BAM to CRAM
samtools view -T reference.fasta -C -o output.cram input.bam

# Convert CRAM back to BAM
samtools view -T reference.fasta -b -o output.bam input.cram

Working with SAM/BAM Files#

samtools is the standard tool for manipulating alignment files: Here are common command line tools for using samtools

# View BAM file header
samtools view -H data/N2_proof_of_principle.bam

# Convert SAM to BAM
samtools view -bS aligned.sam > aligned.bam

# Sort BAM file by coordinate
samtools sort aligned.bam -o aligned.sorted.bam

# Index sorted BAM file (required for most downstream tools)
samtools index aligned.sorted.bam

# View alignments in a specific region
samtools view aligned.sorted.bam CHROMOSOME_X:10000000-11000000

# Get alignment statistics
samtools flagstat data/N2_proof_of_principle.bam
samtools flagstat data/ot266_proof_of_principle.bam

# Calculate coverage depth
samtools depth aligned.sorted.bam > coverage.txt

1.3 Variant Formats#

VCF (Variant Call Format)#

VCF is the standard format for storing genetic variant data. It is used by virtually all variant calling pipelines.

Structure:

Meta-information lines (starting with ##)
Header line (starting with #)
Data lines (one variant per line)

Required columns:

CHROM - Chromosome
POS - Position (1-based)
ID - Variant identifier (e.g., rs number)
REF - Reference allele
ALT - Alternative allele(s)
QUAL - Phred-scaled quality score
FILTER - Filter status (PASS or failed filter names)
INFO - Additional information (key=value pairs)

Example from our training data:

# View VCF header
bcftools view -h data/celegans_variants.vcf.gz | head -20

# View first few variants
bcftools view -H data/celegans_variants.vcf.gz | head -3

Understanding Genotypes in VCF#

The genotype field (GT) uses numbers to represent alleles:

Genotype	Meaning	Description
0/0	Homozygous reference	Both alleles match reference
0/1	Heterozygous	One reference, one alternate
1/1	Homozygous alternate	Both alleles are alternate
1/2	Heterozygous (multi-allelic)	Two different alternate alleles
./.	Missing	Genotype could not be determined

Phased vs Unphased:

0/1 - Unphased (we don’t know which chromosome has which allele)
0|1 - Phased (first allele is on the maternal/paternal chromosome)

Our C. elegans VCF from CaeNDR contains natural variants across multiple wild strains.

Working with VCF Files#

bcftools is the standard tool for VCF manipulation. BCFtools contains a set of utilities to manipulate variant calling format (VCF):

# View VCF file
bcftools view data/celegans_variants.vcf.gz | less

# Filter variants by quality
bcftools filter -i 'QUAL>30' data/celegans_variants.vcf.gz > data/filtered.vcf

# Extract only SNPs
bcftools view -v snps data/celegans_variants.vcf.gz > data/snps_only.vcf

# Get variant statistics
bcftools stats data/celegans_variants.vcf.gz > data/stats.txt

# Count variants per chromosome
bcftools view -H data/celegans_variants.vcf.gz | cut -f1 | sort | uniq -c

# Extract specific region
bcftools view data/celegans_variants.vcf.gz CHROMOSOME_X:10500000-10525000 > data/chrX_region.vcf

1.4 Annotation Formats#

BED Format (Browser Extensible Data)#

BED is a tab-delimited format for describing genomic intervals. It is simpler than GFF/GTF and commonly used for:

Peak calling results (ChIP-seq, ATAC-seq)
Target regions for capture sequencing
Genomic features and intervals

Important: BED uses 0-based, half-open coordinates (start is 0-based, end is excluded).

Standard BED columns:

Column	Name	Description
1	chrom	Chromosome name
2	chromStart	Start position (0-based)
3	chromEnd	End position (excluded)
4	name	Feature name (optional)
5	score	Score 0-1000 (optional)
6	strand	+ or - (optional)

Example:

chr1    0       100     feature1    500     +
chr1    200     300     feature2    750     -
chr2    1000    2000    feature3    1000    +

GFF/GTF Format (Gene Feature Format)#

GFF (General Feature Format) and GTF (Gene Transfer Format) are used for gene annotations. GTF is a stricter version of GFF2.

Important: GFF/GTF uses 1-based, fully-closed coordinates (both start and end are included).

GFF3 columns:

Column	Name	Description
1	seqid	Chromosome/scaffold name
2	source	Program or database
3	type	Feature type (gene, exon, CDS, etc.)
4	start	Start position (1-based)
5	end	End position (included)
6	score	Score or “.”
7	strand	+ or -
8	phase	Reading frame (0, 1, 2) or “.”
9	attributes	Key=value pairs

Example from our training data:

# View the first few annotation records
zcat data/c_elegans.PRJNA13758.WS295.annotations.gff3.gz | grep -v "^#" | head -5

# Count feature types
zcat data/c_elegans.PRJNA13758.WS295.annotations.gff3.gz \
    | grep -v "^#" | cut -f3 | sort | uniq -c | sort -rn

Coordinate System Comparison#

Understanding coordinate systems is crucial to avoid off-by-one errors:

Format	Start	End	Example (bases 1-10)
BED	0-based	excluded	0-10
GFF/GTF	1-based	included	1-10
SAM	1-based	included	1-10
VCF	1-based	included	1-10

Conversion:

BED to GFF: start + 1, end stays the same
GFF to BED: start - 1, end stays the same

2. File Format Summary Table#

Format	Extension	Type	Coordinates	Use Case	Compressed Version
FASTA	.fa, .fasta	Text	N/A	Reference sequences	.fa.gz
FASTQ	.fq, .fastq	Text	N/A	Raw sequencing reads	.fq.gz
SAM	.sam	Text	1-based	Sequence alignments	-
BAM	.bam	Binary	1-based	Sequence alignments	Already compressed
CRAM	.cram	Binary	1-based	Sequence alignments	Higher compression
VCF	.vcf	Text	1-based	Variant calls	.vcf.gz
BCF	.bcf	Binary	1-based	Variant calls	Already compressed
BED	.bed	Text	0-based	Genomic intervals	.bed.gz
GFF/GTF	.gff, .gtf	Text	1-based	Gene annotations	.gff.gz

3. Hands-On Exercise#

Using the training datasets provided, complete the following tasks:

Exercise 1: Exploring the FASTQ File#

# 1. Count the total number of reads in the FASTQ file
# Hint: Each read has 4 lines, so divide total lines by 4
gzcat data/SRR065390_1.fastq.gz | wc -l
# Then divide by 4

# 2. Extract the first 10 read identifiers
gzcat data/SRR065390_1.fastq.gz | grep "^@SRR" | head -10

# 3. Find reads that start with 'N' (unknown base)
gzcat data/SRR065390_1.fastq.gz | awk 'NR%4==2' | grep "^N" | wc -l

Exercise 2: Exploring the BAM File#

# 1. View the BAM header to see which chromosomes are present
samtools view -H data/N2_proof_of_principle.bam | grep "^@SQ"

# 2. Count total alignments
samtools view -c data/N2_proof_of_principle.bam

# 3. Count mapped reads only (exclude unmapped)
samtools view -c -F 4 data/N2_proof_of_principle.bam

# 4. Get alignment statistics
samtools flagstat data/N2_proof_of_principle.bam

Exercise 3: Exploring the VCF File#

# 1. Count total number of variants
bcftools view -H data/celegans_variants.vcf.gz | wc -l

# 2. Count variants on Chromosome X
bcftools view -H data/celegans_variants.vcf.gz CHROMOSOME_X | wc -l

# 3. Find how many samples are in the VCF
bcftools query -l data/celegans_variants.vcf.gz | wc -l

# 4. Extract high-quality variants
bcftools view -i 'QUAL>30' data/celegans_variants.vcf.gz | bcftools view -H | wc -l

Introduction to Bioinformatics File Formats

Contents

Introduction to Bioinformatics File Formats#

Learning Objectives#

Before We Start: Installing the Tools#

Step 1: Install Miniconda#

Step 2: Install the bioinformatics tools#

Downloading the Training Data#

How to download files: wget#

How to download raw reads: fasterq-dump#

Download all session files#

1. Common Bioinformatics File Formats#

1.1 Sequence Data Formats#

FASTA Format#

FASTQ Format#

Quality Score Reference#

1.2 Alignment Formats#

SAM/BAM Format#

CRAM Format#

Understanding CIGAR Strings#

BAM vs CRAM: Choosing the Right Format#

Working with SAM/BAM Files#

1.3 Variant Formats#

VCF (Variant Call Format)#

Understanding Genotypes in VCF#

Working with VCF Files#

1.4 Annotation Formats#

BED Format (Browser Extensible Data)#

GFF/GTF Format (Gene Feature Format)#

Coordinate System Comparison#

2. File Format Summary Table#

3. Hands-On Exercise#

Exercise 1: Exploring the FASTQ File#

Exercise 2: Exploring the BAM File#

Exercise 3: Exploring the VCF File#

Additional Resources#

How to download files: `wget`#

How to download raw reads: `fasterq-dump`#