Linux Command Line Essentials for Bioinformatics#
Welcome to KCGEB Computing#
At the Khalifa Centre for Genetic Engineering & Biotechnology (KCGEB), we process terabytes of genomic data daily. Mastering the Linux command line is essential for working efficiently with sequencing data, running analysis pipelines, and managing files on our high-performance computing cluster.
Learning Objectives#
By the end of this session, you will be able to:
Navigate the Linux file system confidently
Create, move, copy, and delete files and directories
View and manipulate text files
Use wildcards and pipes for efficient workflows
Apply these skills to common bioinformatics tasks at KCGEB
2. File and Directory Management#
At KCGEB, proper file organisation is critical. With multiple researchers working on different projects, keeping data organised prevents confusion and data loss.
Creating Directories and Files#
# Create a new project directory for a rice genome study
mkdir rice_genome_2024
# Create nested directories in one command
mkdir -p rice_genome_2024/data/raw rice_genome_2024/data/processed rice_genome_2024/results
# Create an empty file (useful for logs or placeholder files)
touch rice_genome_2024/README.txt
Copying and Moving Files#
# Copy a reference genome to your project
cp /data/kcgeb/references/rice_IR64.fasta rice_genome_2024/data/
# Copy an entire directory with all contents
cp -r /data/kcgeb/templates/pipeline_scripts rice_genome_2024/scripts/
# Move (rename) a file
mv sample1.fastq sample_001_R1.fastq
# Move files to a different directory
mv *.fastq.gz rice_genome_2024/data/raw/
Removing Files and Directories#
# Remove a file
rm temporary_file.txt
# Remove an empty directory
rmdir empty_folder
# Remove a directory and all its contents (use with caution!)
rm -r old_project
# Interactive removal (asks for confirmation)
rm -i important_file.txt
Warning
The rm command permanently deletes files. There is no recycle bin in Linux! Always double-check before using rm -r.
3. Viewing and Inspecting Files#
Bioinformatics files can be massive. A single whole-genome sequencing FASTQ file can be 50 GB or more. These commands help you peek at files without loading them entirely into memory.
Quick File Inspection#
# View the first 10 lines of a FASTQ file
head sample_001_R1.fastq
# View the first 20 lines
head -n 20 sample_001_R1.fastq
# View the last 10 lines (useful for checking if a file completed)
tail sample_001_R1.fastq
# Watch a file in real-time (great for monitoring running jobs)
tail -f alignment.log
# View an entire small file
cat README.txt
# Page through a large file (use q to quit)
less reference_genome.fasta
Counting and Summarising#
# Count lines, words, and characters
wc sample_001_R1.fastq
# Output: 4000000 4500000 180000000 sample_001_R1.fastq
# Count only lines (divide by 4 to get number of reads in FASTQ)
wc -l sample_001_R1.fastq
# Output: 4000000 (= 1,000,000 reads)
# Check file size
ls -lh sample_001_R1.fastq
# Output: -rw-r--r-- 1 trainee kcgeb 2.5G Jan 15 14:30 sample_001_R1.fastq
4. Searching and Filtering#
Finding specific sequences or patterns in large files is a daily task at KCGEB. The grep command is your best friend.
Using grep for Pattern Matching#
# Find all headers in a FASTA file
grep ">" reference_genome.fasta
# Count the number of sequences in a FASTA file
grep -c ">" reference_genome.fasta
# Output: 12 (12 chromosomes)
# Find a specific gene in a GFF annotation file
grep "gene_id=Os01g0100100" rice_annotation.gff
# Case-insensitive search
grep -i "chromosome" reference_genome.fasta
# Show line numbers with matches
grep -n "ATGCATGC" sequences.fasta
# Find lines that do NOT match a pattern
grep -v "#" variants.vcf
Finding Files#
# Find all FASTQ files in the current directory and subdirectories
find . -name "*.fastq"
# Find all files modified in the last 7 days
find /data/kcgeb/projects -mtime -7
# Find large files (over 1 GB)
find . -size +1G
5. Wildcards and Patterns#
Wildcards allow you to work with multiple files at once, saving time when processing batches of samples.
Common Wildcards#
Wildcard |
Meaning |
Example |
|---|---|---|
|
Matches any characters |
|
|
Matches single character |
|
|
Matches any character in brackets |
|
|
Matches any pattern in braces |
|
Practical Examples#
# List all FASTQ files (compressed or not)
ls *.fastq *.fastq.gz
# Move all R1 (forward) reads to a folder
mv *_R1*.fastq.gz data/forward/
# Count reads in all FASTQ files
wc -l *.fastq
# Compress all FASTA files
gzip *.fasta
6. Pipes and Redirection#
Pipes (|) connect commands together, passing the output of one command as input to the next. This is the essence of the Unix philosophy: small tools that do one thing well, combined to perform complex tasks.
Using Pipes#
# Count sequences in a FASTA file
grep ">" reference.fasta | wc -l
# Find the 10 largest files in a directory
ls -lS *.bam | head -10
# Extract chromosome names and sort them
grep ">" reference.fasta | cut -d " " -f1 | sort
# Count unique gene types in a GFF file
cut -f3 annotation.gff | sort | uniq -c | sort -rn
Redirection#
# Save output to a file (overwrites existing file)
grep ">" reference.fasta > chromosome_headers.txt
# Append output to a file
echo "Analysis completed at $(date)" >> analysis.log
# Redirect errors to a file
bwa mem reference.fasta reads.fastq 2> alignment_errors.log
# Redirect both output and errors
bwa mem reference.fasta reads.fastq > aligned.sam 2> alignment.log
7. Working with Compressed Files#
Sequencing data is almost always compressed to save storage space. At KCGEB, we use gzip compression for most files.
Compression Commands#
# Compress a file (replaces original with .gz version)
gzip sample_001.fastq
# Decompress a file
gunzip sample_001.fastq.gz
# Keep the original file while compressing
gzip -k sample_001.fastq
# View compressed file without decompressing (use gzcat on macOS)
zcat sample_001.fastq.gz | head
gzcat sample_001.fastq.gz | head # macOS
# Search in compressed files
zgrep "@SRR" sample_001.fastq.gz
# Count lines in compressed file
gzcat sample_001.fastq.gz | wc -l
8. File Permissions#
On shared systems like the KCGEB cluster, understanding permissions ensures your data is protected while allowing collaborators appropriate access.
Understanding Permission Strings#
-rw-r--r-- 1 trainee kcgeb 2.5G Jan 15 14:30 sample.fastq
Position |
Meaning |
|---|---|
|
File type (- = file, d = directory) |
|
Owner permissions (read, write, no execute) |
|
Group permissions (read only) |
|
Others permissions (read only) |
Changing Permissions#
# Make a script executable
chmod +x run_pipeline.sh
# Give group members read and write access
chmod g+rw project_data/
# Remove read access for others
chmod o-r sensitive_data.txt
# Set specific permissions (owner: rwx, group: rx, others: none)
chmod 750 scripts/
9. Hands-On Exercise#
Using the training datasets, practise your Linux skills:
Exercise: Working with the Training Data#
# 1. Navigate to the training datasets directory
cd trining_datasets
ls -lh
# 2. Count how many lines are in the compressed FASTQ file
gzcat SRR10532784_1.fastq.gz | wc -l
# Divide by 4 to get number of reads
# 3. Extract the first 100 reads to a new file
gzcat SRR10532784_1.fastq.gz | head -400 > first_100_reads.fastq
# 4. Find all reads that contain the sequence "ATCGATCG"
gzcat SRR10532784_1.fastq.gz | grep "ATCGATCG" | wc -l
# 5. View the VCF header and count the number of samples
bcftools view -h sheep.snp.vcf.gz | tail -1 | tr '\t' '\n' | tail -n +10 | wc -l
# 6. Use samtools to view BAM statistics
samtools flagstat u9_liver_100.bam
# 7. Create a summary report
echo "Training Data Summary" > summary.txt
echo "===================" >> summary.txt
echo "FASTQ reads: $(gzcat SRR10532784_1.fastq.gz | wc -l | awk '{print $1/4}')" >> summary.txt
echo "BAM alignments: $(samtools view -c u9_liver_100.bam)" >> summary.txt
cat summary.txt
10. Summary#
In this session, we covered essential Linux commands for bioinformatics work at KCGEB:
Category |
Commands |
|---|---|
Navigation |
|
File Management |
|
Viewing Files |
|
Searching |
|
Compression |
|
Permissions |
|
Key Takeaways#
Always know where you are (
pwd) before making changesUse
ls -lto check file details before operationsBe careful with
rm- there is no undoCombine commands with pipes for powerful workflows
Compress files to save storage space
Next Session#
In the next session, we will explore NCBI and UCSC databases, learning how to download reference genomes, annotations, and public sequencing data for your analyses.