Organising Bioinformatics Projects#

Learning Objectives#

By the end of this session, you will be able to:

Design a logical directory structure for bioinformatics projects
Apply consistent naming conventions for files and samples
Document your analysis workflow effectively
Implement version control basics with Git
Create reproducible analysis environments

1. Why Organisation Matters#

Poor project organisation leads to:

Lost or overwritten data
Inability to reproduce results
Wasted time searching for files
Confusion when collaborating
Difficulty publishing or sharing work

The Reproducibility Crisis#

Studies show that a significant portion of published bioinformatics analyses cannot be reproduced, even by the original authors. Good organisation is the foundation of reproducible research.

Principles of Good Organisation#

Separation: Keep raw data, code, and results separate
Documentation: Record what you did and why
Consistency: Use the same structure across projects
Automation: Reduce manual steps that introduce errors
Version control: Track changes to code and documents

2. Standard Directory Structure#

Recommended Project Layout#

project_name/
├── README.md                 # Project overview and instructions
├── LICENSE                   # Licence for sharing
├── config/                   # Configuration files
│   ├── samples.tsv          # Sample metadata
│   └── config.yaml          # Pipeline parameters
├── data/
│   ├── raw/                  # Original, immutable data
│   │   ├── fastq/           # Raw sequencing reads
│   │   └── checksums.md5    # Data integrity verification
│   ├── reference/            # Reference genomes, annotations
│   │   ├── genome.fasta
│   │   ├── genome.fasta.fai
│   │   └── annotation.gtf
│   └── processed/            # Cleaned/filtered data
│       ├── trimmed/
│       └── aligned/
├── scripts/                  # Analysis scripts
│   ├── 01_quality_control.sh
│   ├── 02_alignment.sh
│   ├── 03_variant_calling.sh
│   └── utils/               # Helper functions
├── envs/                     # Environment specifications
│   ├── environment.yaml     # Conda environment
│   └── requirements.txt     # Python packages
├── results/                  # Analysis outputs
│   ├── figures/
│   ├── tables/
│   └── reports/
├── docs/                     # Documentation
│   ├── methods.md
│   └── analysis_log.md
└── notebooks/                # Jupyter notebooks for exploration
    ├── 01_eda.ipynb
    └── 02_visualisation.ipynb

Setting Up a New Project#

# Create project structure in one command
mkdir -p my_project/{config,data/{raw/fastq,reference,processed/{trimmed,aligned}},scripts/utils,envs,results/{figures,tables,reports},docs,notebooks}

# Create essential files
touch my_project/README.md
touch my_project/config/samples.tsv
touch my_project/docs/analysis_log.md

# View the structure
tree my_project/

3. Protecting Raw Data#

Raw data is sacred. Once modified, you may never be able to get it back.

Golden Rules for Raw Data#

Never modify raw files - always work on copies
Store checksums - verify data integrity
Make read-only - prevent accidental changes
Back up - store copies in multiple locations

Implementing Data Protection#

# Generate checksums for all raw FASTQ files
cd data/raw/fastq/
md5sum *.fastq.gz > checksums.md5

# Verify checksums later
md5sum -c checksums.md5

# Make raw data read-only
chmod -R a-w data/raw/

# If you need to add more raw data later
chmod u+w data/raw/
# ... add files ...
chmod -R a-w data/raw/

Sample Checksum File#

# checksums.md5
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6  sample_001_R1.fastq.gz
b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7  sample_001_R2.fastq.gz
c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8  sample_002_R1.fastq.gz
d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9  sample_002_R2.fastq.gz

4. File Naming Conventions#

Good file names are descriptive, consistent, and machine-readable.

Naming Guidelines#

Rule	Bad Example	Good Example
No spaces	`sample 1.fastq`	`sample_001.fastq`
No special characters	`sample#1.fastq`	`sample_001.fastq`
Use underscores or hyphens	`sampleone.fastq`	`sample_001.fastq`
Include leading zeros	`sample_1.fastq`	`sample_001.fastq`
Use ISO dates	`sample_15-1-24.fastq`	`sample_2024-01-15.fastq`
Be descriptive	`data.bam`	`sheep_liver_001_aligned.bam`

Recommended Naming Patterns#

For sequencing data:

{sample_id}_{condition}_{replicate}_{read}.fastq.gz

Examples:
sheep_liver_rep1_R1.fastq.gz
sheep_liver_rep1_R2.fastq.gz
sheep_muscle_rep1_R1.fastq.gz

For processed files:

{sample_id}_{processing_step}_{date/version}.{extension}

Examples:
sheep_liver_aligned_oar4.bam
sheep_liver_variants_filtered.vcf
cohort_analysis_v2.csv

For scripts:

{step_number}_{action}_{description}.{extension}

Examples:
01_quality_control.sh
02_trim_adapters.sh
03_align_reads.sh

5. Sample Metadata Management#

A well-organised sample sheet is essential for tracking samples through analysis.

Sample Sheet Format (TSV/CSV)#

sample_id	condition	replicate	read1	read2	batch	sequencing_date
sheep_L_1	liver	1	data/raw/sheep_L_1_R1.fastq.gz	data/raw/sheep_L_1_R2.fastq.gz	batch1	2024-01-15
sheep_L_2	liver	2	data/raw/sheep_L_2_R1.fastq.gz	data/raw/sheep_L_2_R2.fastq.gz	batch1	2024-01-15
sheep_M_1	muscle	1	data/raw/sheep_M_1_R1.fastq.gz	data/raw/sheep_M_1_R2.fastq.gz	batch1	2024-01-15
sheep_M_2	muscle	2	data/raw/sheep_M_2_R1.fastq.gz	data/raw/sheep_M_2_R2.fastq.gz	batch1	2024-01-15

Essential Metadata Fields#

Field	Description	Example
sample_id	Unique identifier	sheep_L_1
condition	Experimental group	liver, muscle, treated
replicate	Biological replicate number	1, 2, 3
batch	Processing batch	batch1, batch2
sequencing_date	When sequenced	2024-01-15
platform	Sequencing platform	Illumina NovaSeq
library_type	Library preparation	RNA-Seq, WGS

Tips for Sample Management#

Create the sample sheet before starting analysis
Store it in version control
Never manually rename files - use the sample sheet to link IDs
Include all relevant experimental factors for downstream analysis

6. Documentation Best Practices#

The README File#

Every project should have a README.md explaining:

# Project Title

Brief description of the project and its goals.

## Data

- Source of raw data
- Number of samples
- Sequencing platform and parameters

## Methods

Overview of the analysis pipeline.

## Directory Structure

Brief explanation of folder organisation.

## Usage

How to reproduce the analysis.

## Dependencies

Required software and versions.

## Authors

Who worked on this project.

## Licence

Terms for using this work.

Analysis Log#

Keep a running log of your analysis:

# Analysis Log

## 2024-01-15

### Quality Control
- Ran FastQC on all samples
- Sample sheep_003 showed adapter contamination
- Decision: Include in trimming, monitor downstream

### Reference Download
- Downloaded Oar_v4.0 from NCBI
- Accession: GCF_000298735.2
- Command: wget [URL]

## 2024-01-16

### Trimming
- Used Trimmomatic v0.39
- Parameters: ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3
- Results: 95% reads retained on average

7. Managing Software Environments#

Reproducibility requires recording exact software versions.

Using Conda Environments#

# envs/environment.yaml
name: sheep_rnaseq
channels:
  - conda-forge
  - bioconda
dependencies:
  - python=3.10
  - bwa=0.7.17
  - samtools=1.17
  - bcftools=1.17
  - fastqc=0.12.1
  - multiqc=1.14
  - hisat2=2.2.1

Creating and Using Environments#

# Create environment from file
conda env create -f envs/environment.yaml

# Activate environment
conda activate sheep_rnaseq

# Export current environment (for sharing)
conda env export > envs/environment_exported.yaml

# List installed packages with versions
conda list > envs/package_versions.txt

Recording Software Versions in Scripts#

#!/bin/bash
# 01_alignment.sh

# Log software versions
echo "=== Software Versions ===" > alignment.log
bwa 2>&1 | head -3 >> alignment.log
samtools --version | head -2 >> alignment.log
echo "========================" >> alignment.log

# Run alignment
bwa mem reference.fasta reads_R1.fastq reads_R2.fastq > aligned.sam 2>> alignment.log

8. Version Control with Git#

Git tracks changes to your code and documents, allowing you to revert mistakes and collaborate.

Essential Git Commands#

# Initialise a new repository
cd my_project/
git init

# Create .gitignore to exclude large data files
echo "data/raw/" >> .gitignore
echo "data/processed/" >> .gitignore
echo "*.fastq*" >> .gitignore
echo "*.bam" >> .gitignore
echo "*.bam.bai" >> .gitignore

# Add files to staging
git add scripts/ config/ README.md .gitignore

# Commit changes
git commit -m "Initial project setup"

# View history
git log --oneline

# Check status
git status

What to Track vs Ignore#

Track (git add)	Ignore (.gitignore)
Scripts (.sh, .py, .R)	Raw data (.fastq, .bam)
Configuration files	Processed data
Sample sheets	Large reference files
Documentation	Log files
Environment specs	Temporary files
Small result tables	Figures (regenerate)

Sample .gitignore for Bioinformatics#

# Data files
*.fastq
*.fastq.gz
*.fq
*.fq.gz
*.bam
*.bam.bai
*.cram
*.vcf
*.vcf.gz
*.bcf

# Reference files
*.fasta
*.fa
*.fasta.fai
*.dict

# Indices
*.bt2
*.bwt
*.pac
*.ann
*.amb
*.sa

# Logs and temp
*.log
*.tmp
.snakemake/

# System files
.DS_Store
Thumbs.db

9. Hands-On Exercise#

Exercise: Organise the Training Data#

Create a properly organised project structure for analysing our sheep training data:

# Step 1: Create project structure
PROJECT="sheep_training_analysis"
mkdir -p $PROJECT/{config,data/{raw,reference,processed},scripts,envs,results/{figures,tables},docs}

# Step 2: Create README
cat > $PROJECT/README.md << 'EOF'
# Sheep Training Data Analysis

Analysis of sheep genomic data for bioinformatics training.

## Data

- FASTQ: SRR10532784 (truncated)
- BAM: u9_liver_100.bam (sheep liver RNA-seq)
- VCF: sheep.snp.vcf.gz (~300 samples)

## Methods

Training exercises for file format exploration.
EOF

# Step 3: Create sample sheet for the VCF samples
cat > $PROJECT/config/samples.tsv << 'EOF'
file_type	file_name	description
FASTQ	SRR10532784_1.fastq.gz	RNA-seq reads (truncated)
BAM	u9_liver_100.bam	Sheep liver aligned reads
VCF	sheep.snp.vcf.gz	Sheep population SNPs
EOF

# Step 4: Link the training data (or copy)
ln -s $(pwd)/trining_datasets/* $PROJECT/data/raw/

# Step 5: Generate checksums
cd $PROJECT/data/raw
md5 *.gz *.bam > checksums.md5

# Step 6: Create .gitignore
cat > $PROJECT/.gitignore << 'EOF'
data/raw/
data/processed/
*.fastq*
*.bam*
*.vcf*
*.log
EOF

# Step 7: View final structure
tree $PROJECT/

10. Summary#

In this session, we covered:

Directory Structure: Separating raw data, code, and results
Raw Data Protection: Checksums, read-only permissions, backups
Naming Conventions: Machine-readable, descriptive file names
Sample Metadata: Tracking samples with structured sheets
Documentation: README files and analysis logs
Environments: Conda for reproducible software stacks
Version Control: Git basics for tracking changes

Checklist for Every New Project#

Next Steps#

With your project properly organised, you are ready to begin the actual bioinformatics analysis. The following sessions will cover quality control, alignment, and downstream analysis.

Organising Bioinformatics Projects

Contents