Navigating NCBI and UCSC Genome Databases#

Learning Objectives#

By the end of this session, you will be able to:

Navigate the NCBI database suite to find genomic resources
Search for and download reference genomes and annotations
Use the UCSC Genome Browser to visualise genomic regions
Download public sequencing data from the SRA
Retrieve gene and protein information programmatically

1. Introduction to Biological Databases#

Biological databases are essential repositories that store and organise vast amounts of genomic, proteomic, and other biological data. The two most widely used resources are:

NCBI (National Centre for Biotechnology Information): A comprehensive collection of databases including GenBank, RefSeq, SRA, and more
UCSC Genome Browser: A powerful visualisation tool and database for genome annotations

Why Use These Databases?#

Task	Database
Download reference genomes	NCBI RefSeq, UCSC
Find gene sequences	NCBI Gene, UCSC
Access raw sequencing data	NCBI SRA
Visualise genomic regions	UCSC Genome Browser
Find protein sequences	NCBI Protein, UniProt
Identify genetic variants	dbSNP, ClinVar

2. NCBI Database Overview#

NCBI hosts over 40 interconnected databases. Here are the most relevant for bioinformatics:

Core NCBI Databases#

Database	URL	Purpose
GenBank	ncbi.nlm.nih.gov/genbank	Primary nucleotide sequence repository
RefSeq	ncbi.nlm.nih.gov/refseq	Curated reference sequences
SRA	ncbi.nlm.nih.gov/sra	Sequence Read Archive (raw data)
Gene	ncbi.nlm.nih.gov/gene	Gene-centred information
Assembly	ncbi.nlm.nih.gov/assembly	Genome assemblies
Taxonomy	ncbi.nlm.nih.gov/taxonomy	Organism classification

GenBank vs RefSeq#

Understanding the difference is crucial:

Feature	GenBank	RefSeq
Submission	Anyone can submit	NCBI curated
Redundancy	Contains duplicates	Non-redundant
Quality	Variable	High quality, reviewed
Accession prefix	Various (e.g., AB, AY)	NM_, NR_, XM_, NC_
Best for	All available sequences	Reference analyses

3. Searching NCBI#

Using the Web Interface#

Step 1: Go to ncbi.nlm.nih.gov

Step 2: Select the appropriate database from the dropdown menu

Step 3: Enter your search query

Search Query Examples#

Finding a Reference Genome (Assembly Database):

Search: "Ovis aries"[Organism] AND "reference genome"[Filter]

Finding a Gene (Gene Database):

Search: MSTN[Gene Name] AND "Ovis aries"[Organism]

Finding Sequencing Data (SRA Database):

Search: "RNA-Seq"[Strategy] AND "Ovis aries"[Organism] AND "liver"[All Fields]

Using Search Filters#

NCBI supports advanced search syntax:

Syntax	Example	Description
`[Organism]`	“Ovis aries”[Organism]	Filter by species
`[Gene Name]`	MSTN[Gene Name]	Search specific gene
`[Title]`	muscle[Title]	Search in title field
`AND`, `OR`, `NOT`	sheep AND muscle NOT fat	Boolean operators
`[Filter]`	refseq[Filter]	Apply specific filters

4. Downloading Reference Genomes#

From NCBI Assembly Database#

Example: Downloading the Sheep Reference Genome (Oar_v4.0)

Go to ncbi.nlm.nih.gov/assembly
Search: "Ovis aries"[Organism] AND "reference genome"[Filter]
Click on the assembly (e.g., Oar_v4.0)
Click “Download Assembly” button
Select file types:
- *_genomic.fna.gz - FASTA sequence
- *_genomic.gff.gz - Gene annotations
- *_genomic.gtf.gz - GTF annotations

Using Command Line (wget/curl)#

# Download sheep reference genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.fna.gz

# Download corresponding annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.gff.gz

# Decompress
gunzip GCF_000298735.2_Oar_v4.0_genomic.fna.gz

Using NCBI Datasets Tool#

NCBI provides a command-line tool called datasets for easier downloads:

# Install NCBI datasets (conda)
conda install -c conda-forge ncbi-datasets-cli

# Download genome by accession
datasets download genome accession GCF_000298735.2 --include genome,gff3

# Download by organism name
datasets download genome taxon "Ovis aries" --reference

5. Accessing the Sequence Read Archive (SRA)#

The SRA stores raw sequencing data from published studies. This is invaluable for:

Replicating published analyses
Meta-analyses across studies
Training and testing pipelines

Understanding SRA Accession Numbers#

Prefix	Level	Example	Description
SRP/ERP/DRP	Study	SRP012345	Entire project
SRS/ERS/DRS	Sample	SRS123456	Biological sample
SRX/ERX/DRX	Experiment	SRX123456	Library/experiment
SRR/ERR/DRR	Run	SRR1234567	Actual sequencing run

Our training FASTQ file SRR10532784 is an SRA run accession.

Finding Data in SRA#

Example: Finding RNA-Seq data for sheep liver

Go to ncbi.nlm.nih.gov/sra
Search: "Ovis aries"[Organism] AND "RNA-Seq"[Strategy] AND liver[All Fields]
Filter by:
- Source: TRANSCRIPTOMIC
- Platform: ILLUMINA
- Access: Public

Downloading SRA Data#

# Install SRA Toolkit
conda install -c bioconda sra-tools

# Download and convert to FASTQ (recommended method)
fasterq-dump SRR10532784

# For paired-end data, this creates:
# SRR10532784_1.fastq (forward reads)
# SRR10532784_2.fastq (reverse reads)

# Download multiple runs
fasterq-dump SRR10532784 SRR10532785 SRR10532786

# Compress the output
gzip *.fastq

6. UCSC Genome Browser#

The UCSC Genome Browser (genome.ucsc.edu) is a powerful tool for visualising genomic data and downloading annotations.

Key Features#

Interactive genome visualisation
Multiple annotation tracks
Custom track upload
Table Browser for data export
BLAT sequence search

Navigating to a Region#

Example: Viewing the MSTN (myostatin) gene in sheep

Go to genome.ucsc.edu
Click “Genome Browser” or “Genomes”
Select assembly: Sheep (oviAri4)
In the search box, enter: MSTN
Click “Go”

Position Format#

UCSC uses specific position formats:

chr2:118,171,687-118,180,018    # Chromosome 2, specific coordinates
chr2:118171687-118180018        # Without commas
MSTN                             # Gene name

7. UCSC Table Browser#

The Table Browser allows you to download annotations and sequences in various formats.

Accessing the Table Browser#

Go to genome.ucsc.edu/cgi-bin/hgTables
Select your assembly and track
Define region and output format
Download

Download Examples#

Example 1: Download all RefSeq genes as BED file

Assembly: Sheep oviAri4
Group: Genes and Gene Predictions
Track: NCBI RefSeq
Table: refGene
Region: genome
Output format: BED

Command Line Downloads#

# Download chromosome sizes for sheep
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.chrom.sizes

# Download 2bit genome file (compact format)
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.2bit

8. BLAT Sequence Search#

BLAT (BLAST-Like Alignment Tool) quickly maps sequences to a genome. It is faster than BLAST for finding locations of known sequences.

Using BLAT#

Go to genome.ucsc.edu/cgi-bin/hgBlat
Select assembly
Paste your sequence
Click “Submit”

Example: Finding a Primer Location#

You have designed a PCR primer and want to verify its location:

Primer sequence: ATGCGATCGATCGATCGATCG

BLAT will show:

Chromosome and coordinates
Strand orientation
Alignment score
Number of mismatches

When to Use BLAT vs BLAST#

Use BLAT	Use BLAST
Finding known sequence in genome	Finding similar sequences
Mapping primers or probes	Identifying homologues
Quick lookups	Sensitive searches
Same species	Cross-species

9. Hands-On Exercise#

Exercise: Tracing the Origin of Our Training Data#

Our training datasets include sheep genomic data. Let’s trace where they came from:

Task 1: Find information about SRR10532784

Go to NCBI SRA: ncbi.nlm.nih.gov/sra
Search: SRR10532784
Answer these questions:
- What organism is this from?
- What type of sequencing (RNA-Seq, WGS, etc.)?
- What tissue/sample type?
- What sequencing platform was used?

Task 2: Find the sheep reference genome

Go to NCBI Assembly: ncbi.nlm.nih.gov/assembly
Search: "Ovis aries"[Organism]
Find the current reference genome
Note the assembly accession (GCF_…)

Task 3: Explore the VCF data context

Our sheep VCF contains SNP data. Use bcftools to examine it:

# How many chromosomes have variants?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u | wc -l

# What chromosomes are represented?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u

# How many samples were sequenced?
bcftools query -l trining_datasets/sheep.snp.vcf.gz | wc -l

10. Summary#

In this session, we covered:

NCBI Databases: GenBank, RefSeq, SRA, Gene, Assembly
Search Strategies: Using filters and Boolean operators
Downloading Data: Web interface and command-line tools
UCSC Genome Browser: Navigation and visualisation
Table Browser: Exporting annotations and sequences
BLAT: Quick sequence mapping

Quick Reference#

Task	Resource	Tool/Method
Download reference genome	NCBI	datasets, wget
Download raw reads	SRA	fasterq-dump
View genomic region	UCSC	Genome Browser
Export annotations	UCSC	Table Browser
Map sequence to genome	UCSC	BLAT

Next Session#

In the next session, we will learn how to organise bioinformatics projects for reproducibility and collaboration.

Navigating NCBI and UCSC Genome Databases

Contents