Navigating NCBI and UCSC Genome Databases#
Learning Objectives#
By the end of this session, you will be able to:
Navigate the NCBI database suite to find genomic resources
Search for and download reference genomes and annotations
Use the UCSC Genome Browser to visualise genomic regions
Download public sequencing data from the SRA
Retrieve gene and protein information programmatically
1. Introduction to Biological Databases#
Biological databases are essential repositories that store and organise vast amounts of genomic, proteomic, and other biological data. The two most widely used resources are:
NCBI (National Centre for Biotechnology Information): A comprehensive collection of databases including GenBank, RefSeq, SRA, and more
UCSC Genome Browser: A powerful visualisation tool and database for genome annotations
Why Use These Databases?#
Task |
Database |
|---|---|
Download reference genomes |
NCBI RefSeq, UCSC |
Find gene sequences |
NCBI Gene, UCSC |
Access raw sequencing data |
NCBI SRA |
Visualise genomic regions |
UCSC Genome Browser |
Find protein sequences |
NCBI Protein, UniProt |
Identify genetic variants |
dbSNP, ClinVar |
2. NCBI Database Overview#
NCBI hosts over 40 interconnected databases. Here are the most relevant for bioinformatics:
Core NCBI Databases#
Database |
URL |
Purpose |
|---|---|---|
GenBank |
ncbi.nlm.nih.gov/genbank |
Primary nucleotide sequence repository |
RefSeq |
ncbi.nlm.nih.gov/refseq |
Curated reference sequences |
SRA |
ncbi.nlm.nih.gov/sra |
Sequence Read Archive (raw data) |
Gene |
ncbi.nlm.nih.gov/gene |
Gene-centred information |
Assembly |
ncbi.nlm.nih.gov/assembly |
Genome assemblies |
Taxonomy |
ncbi.nlm.nih.gov/taxonomy |
Organism classification |
GenBank vs RefSeq#
Understanding the difference is crucial:
Feature |
GenBank |
RefSeq |
|---|---|---|
Submission |
Anyone can submit |
NCBI curated |
Redundancy |
Contains duplicates |
Non-redundant |
Quality |
Variable |
High quality, reviewed |
Accession prefix |
Various (e.g., AB, AY) |
NM_, NR_, XM_, NC_ |
Best for |
All available sequences |
Reference analyses |
3. Searching NCBI#
Using the Web Interface#
Step 1: Go to ncbi.nlm.nih.gov
Step 2: Select the appropriate database from the dropdown menu
Step 3: Enter your search query
Search Query Examples#
Finding a Reference Genome (Assembly Database):
Search: "Ovis aries"[Organism] AND "reference genome"[Filter]
Finding a Gene (Gene Database):
Search: MSTN[Gene Name] AND "Ovis aries"[Organism]
Finding Sequencing Data (SRA Database):
Search: "RNA-Seq"[Strategy] AND "Ovis aries"[Organism] AND "liver"[All Fields]
Using Search Filters#
NCBI supports advanced search syntax:
Syntax |
Example |
Description |
|---|---|---|
|
“Ovis aries”[Organism] |
Filter by species |
|
MSTN[Gene Name] |
Search specific gene |
|
muscle[Title] |
Search in title field |
|
sheep AND muscle NOT fat |
Boolean operators |
|
refseq[Filter] |
Apply specific filters |
4. Downloading Reference Genomes#
From NCBI Assembly Database#
Example: Downloading the Sheep Reference Genome (Oar_v4.0)
Search:
"Ovis aries"[Organism] AND "reference genome"[Filter]Click on the assembly (e.g., Oar_v4.0)
Click “Download Assembly” button
Select file types:
*_genomic.fna.gz- FASTA sequence*_genomic.gff.gz- Gene annotations*_genomic.gtf.gz- GTF annotations
Using Command Line (wget/curl)#
# Download sheep reference genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.fna.gz
# Download corresponding annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.gff.gz
# Decompress
gunzip GCF_000298735.2_Oar_v4.0_genomic.fna.gz
Using NCBI Datasets Tool#
NCBI provides a command-line tool called datasets for easier downloads:
# Install NCBI datasets (conda)
conda install -c conda-forge ncbi-datasets-cli
# Download genome by accession
datasets download genome accession GCF_000298735.2 --include genome,gff3
# Download by organism name
datasets download genome taxon "Ovis aries" --reference
5. Accessing the Sequence Read Archive (SRA)#
The SRA stores raw sequencing data from published studies. This is invaluable for:
Replicating published analyses
Meta-analyses across studies
Training and testing pipelines
Understanding SRA Accession Numbers#
Prefix |
Level |
Example |
Description |
|---|---|---|---|
SRP/ERP/DRP |
Study |
SRP012345 |
Entire project |
SRS/ERS/DRS |
Sample |
SRS123456 |
Biological sample |
SRX/ERX/DRX |
Experiment |
SRX123456 |
Library/experiment |
SRR/ERR/DRR |
Run |
SRR1234567 |
Actual sequencing run |
Our training FASTQ file SRR10532784 is an SRA run accession.
Finding Data in SRA#
Example: Finding RNA-Seq data for sheep liver
Go to ncbi.nlm.nih.gov/sra
Search:
"Ovis aries"[Organism] AND "RNA-Seq"[Strategy] AND liver[All Fields]Filter by:
Source: TRANSCRIPTOMIC
Platform: ILLUMINA
Access: Public
Downloading SRA Data#
# Install SRA Toolkit
conda install -c bioconda sra-tools
# Download and convert to FASTQ (recommended method)
fasterq-dump SRR10532784
# For paired-end data, this creates:
# SRR10532784_1.fastq (forward reads)
# SRR10532784_2.fastq (reverse reads)
# Download multiple runs
fasterq-dump SRR10532784 SRR10532785 SRR10532786
# Compress the output
gzip *.fastq
6. UCSC Genome Browser#
The UCSC Genome Browser (genome.ucsc.edu) is a powerful tool for visualising genomic data and downloading annotations.
Key Features#
Interactive genome visualisation
Multiple annotation tracks
Custom track upload
Table Browser for data export
BLAT sequence search
Navigating to a Region#
Example: Viewing the MSTN (myostatin) gene in sheep
Go to genome.ucsc.edu
Click “Genome Browser” or “Genomes”
Select assembly: Sheep (oviAri4)
In the search box, enter:
MSTNClick “Go”
Position Format#
UCSC uses specific position formats:
chr2:118,171,687-118,180,018 # Chromosome 2, specific coordinates
chr2:118171687-118180018 # Without commas
MSTN # Gene name
7. UCSC Table Browser#
The Table Browser allows you to download annotations and sequences in various formats.
Accessing the Table Browser#
Select your assembly and track
Define region and output format
Download
Download Examples#
Example 1: Download all RefSeq genes as BED file
Assembly: Sheep oviAri4
Group: Genes and Gene Predictions
Track: NCBI RefSeq
Table: refGene
Region: genome
Output format: BED
Command Line Downloads#
# Download chromosome sizes for sheep
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.chrom.sizes
# Download 2bit genome file (compact format)
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.2bit
8. BLAT Sequence Search#
BLAT (BLAST-Like Alignment Tool) quickly maps sequences to a genome. It is faster than BLAST for finding locations of known sequences.
Using BLAT#
Select assembly
Paste your sequence
Click “Submit”
Example: Finding a Primer Location#
You have designed a PCR primer and want to verify its location:
Primer sequence: ATGCGATCGATCGATCGATCG
BLAT will show:
Chromosome and coordinates
Strand orientation
Alignment score
Number of mismatches
When to Use BLAT vs BLAST#
Use BLAT |
Use BLAST |
|---|---|
Finding known sequence in genome |
Finding similar sequences |
Mapping primers or probes |
Identifying homologues |
Quick lookups |
Sensitive searches |
Same species |
Cross-species |
9. Hands-On Exercise#
Exercise: Tracing the Origin of Our Training Data#
Our training datasets include sheep genomic data. Let’s trace where they came from:
Task 1: Find information about SRR10532784
Go to NCBI SRA: ncbi.nlm.nih.gov/sra
Search:
SRR10532784Answer these questions:
What organism is this from?
What type of sequencing (RNA-Seq, WGS, etc.)?
What tissue/sample type?
What sequencing platform was used?
Task 2: Find the sheep reference genome
Go to NCBI Assembly: ncbi.nlm.nih.gov/assembly
Search:
"Ovis aries"[Organism]Find the current reference genome
Note the assembly accession (GCF_…)
Task 3: Explore the VCF data context
Our sheep VCF contains SNP data. Use bcftools to examine it:
# How many chromosomes have variants?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u | wc -l
# What chromosomes are represented?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u
# How many samples were sequenced?
bcftools query -l trining_datasets/sheep.snp.vcf.gz | wc -l
10. Summary#
In this session, we covered:
NCBI Databases: GenBank, RefSeq, SRA, Gene, Assembly
Search Strategies: Using filters and Boolean operators
Downloading Data: Web interface and command-line tools
UCSC Genome Browser: Navigation and visualisation
Table Browser: Exporting annotations and sequences
BLAT: Quick sequence mapping
Quick Reference#
Task |
Resource |
Tool/Method |
|---|---|---|
Download reference genome |
NCBI |
datasets, wget |
Download raw reads |
SRA |
fasterq-dump |
View genomic region |
UCSC |
Genome Browser |
Export annotations |
UCSC |
Table Browser |
Map sequence to genome |
UCSC |
BLAT |
Next Session#
In the next session, we will learn how to organise bioinformatics projects for reproducibility and collaboration.