Organising Bioinformatics Projects#
Learning Objectives#
By the end of this session, you will be able to:
Design a logical directory structure for bioinformatics projects
Apply consistent naming conventions for files and samples
Document your analysis workflow effectively
Implement version control basics with Git
Create reproducible analysis environments
1. Why Organisation Matters#
Poor project organisation leads to:
Lost or overwritten data
Inability to reproduce results
Wasted time searching for files
Confusion when collaborating
Difficulty publishing or sharing work
The Reproducibility Crisis#
Studies show that a significant portion of published bioinformatics analyses cannot be reproduced, even by the original authors. Good organisation is the foundation of reproducible research.
Principles of Good Organisation#
Separation: Keep raw data, code, and results separate
Documentation: Record what you did and why
Consistency: Use the same structure across projects
Automation: Reduce manual steps that introduce errors
Version control: Track changes to code and documents
2. Standard Directory Structure#
Recommended Project Layout#
project_name/
├── README.md # Project overview and instructions
├── LICENSE # Licence for sharing
├── config/ # Configuration files
│ ├── samples.tsv # Sample metadata
│ └── config.yaml # Pipeline parameters
├── data/
│ ├── raw/ # Original, immutable data
│ │ ├── fastq/ # Raw sequencing reads
│ │ └── checksums.md5 # Data integrity verification
│ ├── reference/ # Reference genomes, annotations
│ │ ├── genome.fasta
│ │ ├── genome.fasta.fai
│ │ └── annotation.gtf
│ └── processed/ # Cleaned/filtered data
│ ├── trimmed/
│ └── aligned/
├── scripts/ # Analysis scripts
│ ├── 01_quality_control.sh
│ ├── 02_alignment.sh
│ ├── 03_variant_calling.sh
│ └── utils/ # Helper functions
├── envs/ # Environment specifications
│ ├── environment.yaml # Conda environment
│ └── requirements.txt # Python packages
├── results/ # Analysis outputs
│ ├── figures/
│ ├── tables/
│ └── reports/
├── docs/ # Documentation
│ ├── methods.md
│ └── analysis_log.md
└── notebooks/ # Jupyter notebooks for exploration
├── 01_eda.ipynb
└── 02_visualisation.ipynb
Setting Up a New Project#
# Create project structure in one command
mkdir -p my_project/{config,data/{raw/fastq,reference,processed/{trimmed,aligned}},scripts/utils,envs,results/{figures,tables,reports},docs,notebooks}
# Create essential files
touch my_project/README.md
touch my_project/config/samples.tsv
touch my_project/docs/analysis_log.md
# View the structure
tree my_project/
3. Protecting Raw Data#
Raw data is sacred. Once modified, you may never be able to get it back.
Golden Rules for Raw Data#
Never modify raw files - always work on copies
Store checksums - verify data integrity
Make read-only - prevent accidental changes
Back up - store copies in multiple locations
Implementing Data Protection#
# Generate checksums for all raw FASTQ files
cd data/raw/fastq/
md5sum *.fastq.gz > checksums.md5
# Verify checksums later
md5sum -c checksums.md5
# Make raw data read-only
chmod -R a-w data/raw/
# If you need to add more raw data later
chmod u+w data/raw/
# ... add files ...
chmod -R a-w data/raw/
Sample Checksum File#
# checksums.md5
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6 sample_001_R1.fastq.gz
b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7 sample_001_R2.fastq.gz
c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8 sample_002_R1.fastq.gz
d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9 sample_002_R2.fastq.gz
4. File Naming Conventions#
Good file names are descriptive, consistent, and machine-readable.
Naming Guidelines#
Rule |
Bad Example |
Good Example |
|---|---|---|
No spaces |
|
|
No special characters |
|
|
Use underscores or hyphens |
|
|
Include leading zeros |
|
|
Use ISO dates |
|
|
Be descriptive |
|
|
Recommended Naming Patterns#
For sequencing data:
{sample_id}_{condition}_{replicate}_{read}.fastq.gz
Examples:
sheep_liver_rep1_R1.fastq.gz
sheep_liver_rep1_R2.fastq.gz
sheep_muscle_rep1_R1.fastq.gz
For processed files:
{sample_id}_{processing_step}_{date/version}.{extension}
Examples:
sheep_liver_aligned_oar4.bam
sheep_liver_variants_filtered.vcf
cohort_analysis_v2.csv
For scripts:
{step_number}_{action}_{description}.{extension}
Examples:
01_quality_control.sh
02_trim_adapters.sh
03_align_reads.sh
5. Sample Metadata Management#
A well-organised sample sheet is essential for tracking samples through analysis.
Sample Sheet Format (TSV/CSV)#
sample_id condition replicate read1 read2 batch sequencing_date
sheep_L_1 liver 1 data/raw/sheep_L_1_R1.fastq.gz data/raw/sheep_L_1_R2.fastq.gz batch1 2024-01-15
sheep_L_2 liver 2 data/raw/sheep_L_2_R1.fastq.gz data/raw/sheep_L_2_R2.fastq.gz batch1 2024-01-15
sheep_M_1 muscle 1 data/raw/sheep_M_1_R1.fastq.gz data/raw/sheep_M_1_R2.fastq.gz batch1 2024-01-15
sheep_M_2 muscle 2 data/raw/sheep_M_2_R1.fastq.gz data/raw/sheep_M_2_R2.fastq.gz batch1 2024-01-15
Essential Metadata Fields#
Field |
Description |
Example |
|---|---|---|
sample_id |
Unique identifier |
sheep_L_1 |
condition |
Experimental group |
liver, muscle, treated |
replicate |
Biological replicate number |
1, 2, 3 |
batch |
Processing batch |
batch1, batch2 |
sequencing_date |
When sequenced |
2024-01-15 |
platform |
Sequencing platform |
Illumina NovaSeq |
library_type |
Library preparation |
RNA-Seq, WGS |
Tips for Sample Management#
Create the sample sheet before starting analysis
Store it in version control
Never manually rename files - use the sample sheet to link IDs
Include all relevant experimental factors for downstream analysis
6. Documentation Best Practices#
The README File#
Every project should have a README.md explaining:
# Project Title
Brief description of the project and its goals.
## Data
- Source of raw data
- Number of samples
- Sequencing platform and parameters
## Methods
Overview of the analysis pipeline.
## Directory Structure
Brief explanation of folder organisation.
## Usage
How to reproduce the analysis.
## Dependencies
Required software and versions.
## Authors
Who worked on this project.
## Licence
Terms for using this work.
Analysis Log#
Keep a running log of your analysis:
# Analysis Log
## 2024-01-15
### Quality Control
- Ran FastQC on all samples
- Sample sheep_003 showed adapter contamination
- Decision: Include in trimming, monitor downstream
### Reference Download
- Downloaded Oar_v4.0 from NCBI
- Accession: GCF_000298735.2
- Command: wget [URL]
## 2024-01-16
### Trimming
- Used Trimmomatic v0.39
- Parameters: ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3
- Results: 95% reads retained on average
7. Managing Software Environments#
Reproducibility requires recording exact software versions.
Using Conda Environments#
# envs/environment.yaml
name: sheep_rnaseq
channels:
- conda-forge
- bioconda
dependencies:
- python=3.10
- bwa=0.7.17
- samtools=1.17
- bcftools=1.17
- fastqc=0.12.1
- multiqc=1.14
- hisat2=2.2.1
Creating and Using Environments#
# Create environment from file
conda env create -f envs/environment.yaml
# Activate environment
conda activate sheep_rnaseq
# Export current environment (for sharing)
conda env export > envs/environment_exported.yaml
# List installed packages with versions
conda list > envs/package_versions.txt
Recording Software Versions in Scripts#
#!/bin/bash
# 01_alignment.sh
# Log software versions
echo "=== Software Versions ===" > alignment.log
bwa 2>&1 | head -3 >> alignment.log
samtools --version | head -2 >> alignment.log
echo "========================" >> alignment.log
# Run alignment
bwa mem reference.fasta reads_R1.fastq reads_R2.fastq > aligned.sam 2>> alignment.log
8. Version Control with Git#
Git tracks changes to your code and documents, allowing you to revert mistakes and collaborate.
Essential Git Commands#
# Initialise a new repository
cd my_project/
git init
# Create .gitignore to exclude large data files
echo "data/raw/" >> .gitignore
echo "data/processed/" >> .gitignore
echo "*.fastq*" >> .gitignore
echo "*.bam" >> .gitignore
echo "*.bam.bai" >> .gitignore
# Add files to staging
git add scripts/ config/ README.md .gitignore
# Commit changes
git commit -m "Initial project setup"
# View history
git log --oneline
# Check status
git status
What to Track vs Ignore#
Track (git add) |
Ignore (.gitignore) |
|---|---|
Scripts (.sh, .py, .R) |
Raw data (.fastq, .bam) |
Configuration files |
Processed data |
Sample sheets |
Large reference files |
Documentation |
Log files |
Environment specs |
Temporary files |
Small result tables |
Figures (regenerate) |
Sample .gitignore for Bioinformatics#
# Data files
*.fastq
*.fastq.gz
*.fq
*.fq.gz
*.bam
*.bam.bai
*.cram
*.vcf
*.vcf.gz
*.bcf
# Reference files
*.fasta
*.fa
*.fasta.fai
*.dict
# Indices
*.bt2
*.bwt
*.pac
*.ann
*.amb
*.sa
# Logs and temp
*.log
*.tmp
.snakemake/
# System files
.DS_Store
Thumbs.db
9. Hands-On Exercise#
Exercise: Organise the Training Data#
Create a properly organised project structure for analysing our sheep training data:
# Step 1: Create project structure
PROJECT="sheep_training_analysis"
mkdir -p $PROJECT/{config,data/{raw,reference,processed},scripts,envs,results/{figures,tables},docs}
# Step 2: Create README
cat > $PROJECT/README.md << 'EOF'
# Sheep Training Data Analysis
Analysis of sheep genomic data for bioinformatics training.
## Data
- FASTQ: SRR10532784 (truncated)
- BAM: u9_liver_100.bam (sheep liver RNA-seq)
- VCF: sheep.snp.vcf.gz (~300 samples)
## Methods
Training exercises for file format exploration.
EOF
# Step 3: Create sample sheet for the VCF samples
cat > $PROJECT/config/samples.tsv << 'EOF'
file_type file_name description
FASTQ SRR10532784_1.fastq.gz RNA-seq reads (truncated)
BAM u9_liver_100.bam Sheep liver aligned reads
VCF sheep.snp.vcf.gz Sheep population SNPs
EOF
# Step 4: Link the training data (or copy)
ln -s $(pwd)/trining_datasets/* $PROJECT/data/raw/
# Step 5: Generate checksums
cd $PROJECT/data/raw
md5 *.gz *.bam > checksums.md5
# Step 6: Create .gitignore
cat > $PROJECT/.gitignore << 'EOF'
data/raw/
data/processed/
*.fastq*
*.bam*
*.vcf*
*.log
EOF
# Step 7: View final structure
tree $PROJECT/
10. Summary#
In this session, we covered:
Directory Structure: Separating raw data, code, and results
Raw Data Protection: Checksums, read-only permissions, backups
Naming Conventions: Machine-readable, descriptive file names
Sample Metadata: Tracking samples with structured sheets
Documentation: README files and analysis logs
Environments: Conda for reproducible software stacks
Version Control: Git basics for tracking changes
Checklist for Every New Project#
Create standard directory structure
Write README.md
Create sample sheet
Set up conda environment
Initialise git repository
Create .gitignore
Generate checksums for raw data
Make raw data read-only
Next Steps#
With your project properly organised, you are ready to begin the actual bioinformatics analysis. The following sessions will cover quality control, alignment, and downstream analysis.