Organising Bioinformatics Projects#

Learning Objectives#

By the end of this session, you will be able to:

  1. Design a logical directory structure for bioinformatics projects

  2. Apply consistent naming conventions for files and samples

  3. Document your analysis workflow effectively

  4. Implement version control basics with Git

  5. Create reproducible analysis environments

1. Why Organisation Matters#

Poor project organisation leads to:

  • Lost or overwritten data

  • Inability to reproduce results

  • Wasted time searching for files

  • Confusion when collaborating

  • Difficulty publishing or sharing work

The Reproducibility Crisis#

Studies show that a significant portion of published bioinformatics analyses cannot be reproduced, even by the original authors. Good organisation is the foundation of reproducible research.

Principles of Good Organisation#

  1. Separation: Keep raw data, code, and results separate

  2. Documentation: Record what you did and why

  3. Consistency: Use the same structure across projects

  4. Automation: Reduce manual steps that introduce errors

  5. Version control: Track changes to code and documents

2. Standard Directory Structure#

Setting Up a New Project#

# Create project structure in one command
mkdir -p my_project/{config,data/{raw/fastq,reference,processed/{trimmed,aligned}},scripts/utils,envs,results/{figures,tables,reports},docs,notebooks}

# Create essential files
touch my_project/README.md
touch my_project/config/samples.tsv
touch my_project/docs/analysis_log.md

# View the structure
tree my_project/

3. Protecting Raw Data#

Raw data is sacred. Once modified, you may never be able to get it back.

Golden Rules for Raw Data#

  1. Never modify raw files - always work on copies

  2. Store checksums - verify data integrity

  3. Make read-only - prevent accidental changes

  4. Back up - store copies in multiple locations

Implementing Data Protection#

# Generate checksums for all raw FASTQ files
cd data/raw/fastq/
md5sum *.fastq.gz > checksums.md5

# Verify checksums later
md5sum -c checksums.md5

# Make raw data read-only
chmod -R a-w data/raw/

# If you need to add more raw data later
chmod u+w data/raw/
# ... add files ...
chmod -R a-w data/raw/

Sample Checksum File#

# checksums.md5
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6  sample_001_R1.fastq.gz
b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7  sample_001_R2.fastq.gz
c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8  sample_002_R1.fastq.gz
d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9  sample_002_R2.fastq.gz

4. File Naming Conventions#

Good file names are descriptive, consistent, and machine-readable.

Naming Guidelines#

Rule

Bad Example

Good Example

No spaces

sample 1.fastq

sample_001.fastq

No special characters

sample#1.fastq

sample_001.fastq

Use underscores or hyphens

sampleone.fastq

sample_001.fastq

Include leading zeros

sample_1.fastq

sample_001.fastq

Use ISO dates

sample_15-1-24.fastq

sample_2024-01-15.fastq

Be descriptive

data.bam

sheep_liver_001_aligned.bam

5. Sample Metadata Management#

A well-organised sample sheet is essential for tracking samples through analysis.

Sample Sheet Format (TSV/CSV)#

sample_id	condition	replicate	read1	read2	batch	sequencing_date
sheep_L_1	liver	1	data/raw/sheep_L_1_R1.fastq.gz	data/raw/sheep_L_1_R2.fastq.gz	batch1	2024-01-15
sheep_L_2	liver	2	data/raw/sheep_L_2_R1.fastq.gz	data/raw/sheep_L_2_R2.fastq.gz	batch1	2024-01-15
sheep_M_1	muscle	1	data/raw/sheep_M_1_R1.fastq.gz	data/raw/sheep_M_1_R2.fastq.gz	batch1	2024-01-15
sheep_M_2	muscle	2	data/raw/sheep_M_2_R1.fastq.gz	data/raw/sheep_M_2_R2.fastq.gz	batch1	2024-01-15

Essential Metadata Fields#

Field

Description

Example

sample_id

Unique identifier

sheep_L_1

condition

Experimental group

liver, muscle, treated

replicate

Biological replicate number

1, 2, 3

batch

Processing batch

batch1, batch2

sequencing_date

When sequenced

2024-01-15

platform

Sequencing platform

Illumina NovaSeq

library_type

Library preparation

RNA-Seq, WGS

Tips for Sample Management#

  1. Create the sample sheet before starting analysis

  2. Store it in version control

  3. Never manually rename files - use the sample sheet to link IDs

  4. Include all relevant experimental factors for downstream analysis

6. Documentation Best Practices#

The README File#

Every project should have a README.md explaining:

# Project Title

Brief description of the project and its goals.

## Data

- Source of raw data
- Number of samples
- Sequencing platform and parameters

## Methods

Overview of the analysis pipeline.

## Directory Structure

Brief explanation of folder organisation.

## Usage

How to reproduce the analysis.

## Dependencies

Required software and versions.

## Authors

Who worked on this project.

## Licence

Terms for using this work.

Analysis Log#

Keep a running log of your analysis:

# Analysis Log

## 2024-01-15

### Quality Control
- Ran FastQC on all samples
- Sample sheep_003 showed adapter contamination
- Decision: Include in trimming, monitor downstream

### Reference Download
- Downloaded Oar_v4.0 from NCBI
- Accession: GCF_000298735.2
- Command: wget [URL]

## 2024-01-16

### Trimming
- Used Trimmomatic v0.39
- Parameters: ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3
- Results: 95% reads retained on average

7. Managing Software Environments#

Reproducibility requires recording exact software versions.

Using Conda Environments#

# envs/environment.yaml
name: sheep_rnaseq
channels:
  - conda-forge
  - bioconda
dependencies:
  - python=3.10
  - bwa=0.7.17
  - samtools=1.17
  - bcftools=1.17
  - fastqc=0.12.1
  - multiqc=1.14
  - hisat2=2.2.1

Creating and Using Environments#

# Create environment from file
conda env create -f envs/environment.yaml

# Activate environment
conda activate sheep_rnaseq

# Export current environment (for sharing)
conda env export > envs/environment_exported.yaml

# List installed packages with versions
conda list > envs/package_versions.txt

Recording Software Versions in Scripts#

#!/bin/bash
# 01_alignment.sh

# Log software versions
echo "=== Software Versions ===" > alignment.log
bwa 2>&1 | head -3 >> alignment.log
samtools --version | head -2 >> alignment.log
echo "========================" >> alignment.log

# Run alignment
bwa mem reference.fasta reads_R1.fastq reads_R2.fastq > aligned.sam 2>> alignment.log

8. Version Control with Git#

Git tracks changes to your code and documents, allowing you to revert mistakes and collaborate.

Essential Git Commands#

# Initialise a new repository
cd my_project/
git init

# Create .gitignore to exclude large data files
echo "data/raw/" >> .gitignore
echo "data/processed/" >> .gitignore
echo "*.fastq*" >> .gitignore
echo "*.bam" >> .gitignore
echo "*.bam.bai" >> .gitignore

# Add files to staging
git add scripts/ config/ README.md .gitignore

# Commit changes
git commit -m "Initial project setup"

# View history
git log --oneline

# Check status
git status

What to Track vs Ignore#

Track (git add)

Ignore (.gitignore)

Scripts (.sh, .py, .R)

Raw data (.fastq, .bam)

Configuration files

Processed data

Sample sheets

Large reference files

Documentation

Log files

Environment specs

Temporary files

Small result tables

Figures (regenerate)

Sample .gitignore for Bioinformatics#

# Data files
*.fastq
*.fastq.gz
*.fq
*.fq.gz
*.bam
*.bam.bai
*.cram
*.vcf
*.vcf.gz
*.bcf

# Reference files
*.fasta
*.fa
*.fasta.fai
*.dict

# Indices
*.bt2
*.bwt
*.pac
*.ann
*.amb
*.sa

# Logs and temp
*.log
*.tmp
.snakemake/

# System files
.DS_Store
Thumbs.db

9. Hands-On Exercise#

Exercise: Organise the Training Data#

Create a properly organised project structure for analysing our sheep training data:

# Step 1: Create project structure
PROJECT="sheep_training_analysis"
mkdir -p $PROJECT/{config,data/{raw,reference,processed},scripts,envs,results/{figures,tables},docs}

# Step 2: Create README
cat > $PROJECT/README.md << 'EOF'
# Sheep Training Data Analysis

Analysis of sheep genomic data for bioinformatics training.

## Data

- FASTQ: SRR10532784 (truncated)
- BAM: u9_liver_100.bam (sheep liver RNA-seq)
- VCF: sheep.snp.vcf.gz (~300 samples)

## Methods

Training exercises for file format exploration.
EOF

# Step 3: Create sample sheet for the VCF samples
cat > $PROJECT/config/samples.tsv << 'EOF'
file_type	file_name	description
FASTQ	SRR10532784_1.fastq.gz	RNA-seq reads (truncated)
BAM	u9_liver_100.bam	Sheep liver aligned reads
VCF	sheep.snp.vcf.gz	Sheep population SNPs
EOF

# Step 4: Link the training data (or copy)
ln -s $(pwd)/trining_datasets/* $PROJECT/data/raw/

# Step 5: Generate checksums
cd $PROJECT/data/raw
md5 *.gz *.bam > checksums.md5

# Step 6: Create .gitignore
cat > $PROJECT/.gitignore << 'EOF'
data/raw/
data/processed/
*.fastq*
*.bam*
*.vcf*
*.log
EOF

# Step 7: View final structure
tree $PROJECT/

10. Summary#

In this session, we covered:

  1. Directory Structure: Separating raw data, code, and results

  2. Raw Data Protection: Checksums, read-only permissions, backups

  3. Naming Conventions: Machine-readable, descriptive file names

  4. Sample Metadata: Tracking samples with structured sheets

  5. Documentation: README files and analysis logs

  6. Environments: Conda for reproducible software stacks

  7. Version Control: Git basics for tracking changes

Checklist for Every New Project#

  • Create standard directory structure

  • Write README.md

  • Create sample sheet

  • Set up conda environment

  • Initialise git repository

  • Create .gitignore

  • Generate checksums for raw data

  • Make raw data read-only

Next Steps#

With your project properly organised, you are ready to begin the actual bioinformatics analysis. The following sessions will cover quality control, alignment, and downstream analysis.

Additional Resources#