Skip to main content
Ctrl+K
Bioinformatics Course - Home
  • Preface

Practical Introduction to Bioinformatics

  • Introduction to Bioinformatics File Formats
  • Linux Command Line Essentials for Bioinformatics
  • Navigating NCBI and UCSC Genome Databases
  • Organising Bioinformatics Projects

Journal Club

  • From Blind Screens to Base-Perfect Genomes
  • Sheep Population Genetics Analysis
  • .ipynb

Navigating NCBI and UCSC Genome Databases

Contents

  • Learning Objectives
  • 1. Introduction to Biological Databases
    • Why Use These Databases?
  • 2. NCBI Database Overview
    • Core NCBI Databases
    • GenBank vs RefSeq
  • 3. Searching NCBI
    • Using the Web Interface
    • Search Query Examples
    • Using Search Filters
  • 4. Downloading Reference Genomes
    • From NCBI Assembly Database
    • Using Command Line (wget/curl)
    • Using NCBI Datasets Tool
  • 5. Accessing the Sequence Read Archive (SRA)
    • Understanding SRA Accession Numbers
    • Finding Data in SRA
    • Downloading SRA Data
  • 6. UCSC Genome Browser
    • Key Features
    • Navigating to a Region
    • Position Format
  • 7. UCSC Table Browser
    • Accessing the Table Browser
    • Download Examples
    • Command Line Downloads
  • 8. BLAT Sequence Search
    • Using BLAT
    • Example: Finding a Primer Location
    • When to Use BLAT vs BLAST
  • 9. Hands-On Exercise
    • Exercise: Tracing the Origin of Our Training Data
  • 10. Summary
    • Quick Reference
  • Next Session
  • Additional Resources

Navigating NCBI and UCSC Genome Databases#

Learning Objectives#

By the end of this session, you will be able to:

  1. Navigate the NCBI database suite to find genomic resources

  2. Search for and download reference genomes and annotations

  3. Use the UCSC Genome Browser to visualise genomic regions

  4. Download public sequencing data from the SRA

  5. Retrieve gene and protein information programmatically

1. Introduction to Biological Databases#

Biological databases are essential repositories that store and organise vast amounts of genomic, proteomic, and other biological data. The two most widely used resources are:

  • NCBI (National Centre for Biotechnology Information): A comprehensive collection of databases including GenBank, RefSeq, SRA, and more

  • UCSC Genome Browser: A powerful visualisation tool and database for genome annotations

Why Use These Databases?#

Task

Database

Download reference genomes

NCBI RefSeq, UCSC

Find gene sequences

NCBI Gene, UCSC

Access raw sequencing data

NCBI SRA

Visualise genomic regions

UCSC Genome Browser

Find protein sequences

NCBI Protein, UniProt

Identify genetic variants

dbSNP, ClinVar

2. NCBI Database Overview#

NCBI hosts over 40 interconnected databases. Here are the most relevant for bioinformatics:

Core NCBI Databases#

Database

URL

Purpose

GenBank

ncbi.nlm.nih.gov/genbank

Primary nucleotide sequence repository

RefSeq

ncbi.nlm.nih.gov/refseq

Curated reference sequences

SRA

ncbi.nlm.nih.gov/sra

Sequence Read Archive (raw data)

Gene

ncbi.nlm.nih.gov/gene

Gene-centred information

Assembly

ncbi.nlm.nih.gov/assembly

Genome assemblies

Taxonomy

ncbi.nlm.nih.gov/taxonomy

Organism classification

GenBank vs RefSeq#

Understanding the difference is crucial:

Feature

GenBank

RefSeq

Submission

Anyone can submit

NCBI curated

Redundancy

Contains duplicates

Non-redundant

Quality

Variable

High quality, reviewed

Accession prefix

Various (e.g., AB, AY)

NM_, NR_, XM_, NC_

Best for

All available sequences

Reference analyses

3. Searching NCBI#

Using the Web Interface#

Step 1: Go to ncbi.nlm.nih.gov

Step 2: Select the appropriate database from the dropdown menu

Step 3: Enter your search query

Search Query Examples#

Finding a Reference Genome (Assembly Database):

Search: "Ovis aries"[Organism] AND "reference genome"[Filter]

Finding a Gene (Gene Database):

Search: MSTN[Gene Name] AND "Ovis aries"[Organism]

Finding Sequencing Data (SRA Database):

Search: "RNA-Seq"[Strategy] AND "Ovis aries"[Organism] AND "liver"[All Fields]

Using Search Filters#

NCBI supports advanced search syntax:

Syntax

Example

Description

[Organism]

“Ovis aries”[Organism]

Filter by species

[Gene Name]

MSTN[Gene Name]

Search specific gene

[Title]

muscle[Title]

Search in title field

AND, OR, NOT

sheep AND muscle NOT fat

Boolean operators

[Filter]

refseq[Filter]

Apply specific filters

4. Downloading Reference Genomes#

From NCBI Assembly Database#

Example: Downloading the Sheep Reference Genome (Oar_v4.0)

  1. Go to ncbi.nlm.nih.gov/assembly

  2. Search: "Ovis aries"[Organism] AND "reference genome"[Filter]

  3. Click on the assembly (e.g., Oar_v4.0)

  4. Click “Download Assembly” button

  5. Select file types:

    • *_genomic.fna.gz - FASTA sequence

    • *_genomic.gff.gz - Gene annotations

    • *_genomic.gtf.gz - GTF annotations

Using Command Line (wget/curl)#

# Download sheep reference genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.fna.gz

# Download corresponding annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.gff.gz

# Decompress
gunzip GCF_000298735.2_Oar_v4.0_genomic.fna.gz

Using NCBI Datasets Tool#

NCBI provides a command-line tool called datasets for easier downloads:

# Install NCBI datasets (conda)
conda install -c conda-forge ncbi-datasets-cli

# Download genome by accession
datasets download genome accession GCF_000298735.2 --include genome,gff3

# Download by organism name
datasets download genome taxon "Ovis aries" --reference

5. Accessing the Sequence Read Archive (SRA)#

The SRA stores raw sequencing data from published studies. This is invaluable for:

  • Replicating published analyses

  • Meta-analyses across studies

  • Training and testing pipelines

Understanding SRA Accession Numbers#

Prefix

Level

Example

Description

SRP/ERP/DRP

Study

SRP012345

Entire project

SRS/ERS/DRS

Sample

SRS123456

Biological sample

SRX/ERX/DRX

Experiment

SRX123456

Library/experiment

SRR/ERR/DRR

Run

SRR1234567

Actual sequencing run

Our training FASTQ file SRR10532784 is an SRA run accession.

Finding Data in SRA#

Example: Finding RNA-Seq data for sheep liver

  1. Go to ncbi.nlm.nih.gov/sra

  2. Search: "Ovis aries"[Organism] AND "RNA-Seq"[Strategy] AND liver[All Fields]

  3. Filter by:

    • Source: TRANSCRIPTOMIC

    • Platform: ILLUMINA

    • Access: Public

Downloading SRA Data#

# Install SRA Toolkit
conda install -c bioconda sra-tools

# Download and convert to FASTQ (recommended method)
fasterq-dump SRR10532784

# For paired-end data, this creates:
# SRR10532784_1.fastq (forward reads)
# SRR10532784_2.fastq (reverse reads)

# Download multiple runs
fasterq-dump SRR10532784 SRR10532785 SRR10532786

# Compress the output
gzip *.fastq

6. UCSC Genome Browser#

The UCSC Genome Browser (genome.ucsc.edu) is a powerful tool for visualising genomic data and downloading annotations.

Key Features#

  • Interactive genome visualisation

  • Multiple annotation tracks

  • Custom track upload

  • Table Browser for data export

  • BLAT sequence search

Navigating to a Region#

Example: Viewing the MSTN (myostatin) gene in sheep

  1. Go to genome.ucsc.edu

  2. Click “Genome Browser” or “Genomes”

  3. Select assembly: Sheep (oviAri4)

  4. In the search box, enter: MSTN

  5. Click “Go”

Position Format#

UCSC uses specific position formats:

chr2:118,171,687-118,180,018    # Chromosome 2, specific coordinates
chr2:118171687-118180018        # Without commas
MSTN                             # Gene name

7. UCSC Table Browser#

The Table Browser allows you to download annotations and sequences in various formats.

Accessing the Table Browser#

  1. Go to genome.ucsc.edu/cgi-bin/hgTables

  2. Select your assembly and track

  3. Define region and output format

  4. Download

Download Examples#

Example 1: Download all RefSeq genes as BED file

  • Assembly: Sheep oviAri4

  • Group: Genes and Gene Predictions

  • Track: NCBI RefSeq

  • Table: refGene

  • Region: genome

  • Output format: BED

Command Line Downloads#

# Download chromosome sizes for sheep
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.chrom.sizes

# Download 2bit genome file (compact format)
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.2bit

8. BLAT Sequence Search#

BLAT (BLAST-Like Alignment Tool) quickly maps sequences to a genome. It is faster than BLAST for finding locations of known sequences.

Using BLAT#

  1. Go to genome.ucsc.edu/cgi-bin/hgBlat

  2. Select assembly

  3. Paste your sequence

  4. Click “Submit”

Example: Finding a Primer Location#

You have designed a PCR primer and want to verify its location:

Primer sequence: ATGCGATCGATCGATCGATCG

BLAT will show:

  • Chromosome and coordinates

  • Strand orientation

  • Alignment score

  • Number of mismatches

When to Use BLAT vs BLAST#

Use BLAT

Use BLAST

Finding known sequence in genome

Finding similar sequences

Mapping primers or probes

Identifying homologues

Quick lookups

Sensitive searches

Same species

Cross-species

9. Hands-On Exercise#

Exercise: Tracing the Origin of Our Training Data#

Our training datasets include sheep genomic data. Let’s trace where they came from:

Task 1: Find information about SRR10532784

  1. Go to NCBI SRA: ncbi.nlm.nih.gov/sra

  2. Search: SRR10532784

  3. Answer these questions:

    • What organism is this from?

    • What type of sequencing (RNA-Seq, WGS, etc.)?

    • What tissue/sample type?

    • What sequencing platform was used?

Task 2: Find the sheep reference genome

  1. Go to NCBI Assembly: ncbi.nlm.nih.gov/assembly

  2. Search: "Ovis aries"[Organism]

  3. Find the current reference genome

  4. Note the assembly accession (GCF_…)

Task 3: Explore the VCF data context

Our sheep VCF contains SNP data. Use bcftools to examine it:

# How many chromosomes have variants?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u | wc -l

# What chromosomes are represented?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u

# How many samples were sequenced?
bcftools query -l trining_datasets/sheep.snp.vcf.gz | wc -l

10. Summary#

In this session, we covered:

  1. NCBI Databases: GenBank, RefSeq, SRA, Gene, Assembly

  2. Search Strategies: Using filters and Boolean operators

  3. Downloading Data: Web interface and command-line tools

  4. UCSC Genome Browser: Navigation and visualisation

  5. Table Browser: Exporting annotations and sequences

  6. BLAT: Quick sequence mapping

Quick Reference#

Task

Resource

Tool/Method

Download reference genome

NCBI

datasets, wget

Download raw reads

SRA

fasterq-dump

View genomic region

UCSC

Genome Browser

Export annotations

UCSC

Table Browser

Map sequence to genome

UCSC

BLAT

Next Session#

In the next session, we will learn how to organise bioinformatics projects for reproducibility and collaboration.

Additional Resources#

  • NCBI Education Resources

  • UCSC Genome Browser User Guide

  • SRA Handbook

  • NCBI Datasets Documentation

previous

Linux Command Line Essentials for Bioinformatics

next

Organising Bioinformatics Projects

Contents
  • Learning Objectives
  • 1. Introduction to Biological Databases
    • Why Use These Databases?
  • 2. NCBI Database Overview
    • Core NCBI Databases
    • GenBank vs RefSeq
  • 3. Searching NCBI
    • Using the Web Interface
    • Search Query Examples
    • Using Search Filters
  • 4. Downloading Reference Genomes
    • From NCBI Assembly Database
    • Using Command Line (wget/curl)
    • Using NCBI Datasets Tool
  • 5. Accessing the Sequence Read Archive (SRA)
    • Understanding SRA Accession Numbers
    • Finding Data in SRA
    • Downloading SRA Data
  • 6. UCSC Genome Browser
    • Key Features
    • Navigating to a Region
    • Position Format
  • 7. UCSC Table Browser
    • Accessing the Table Browser
    • Download Examples
    • Command Line Downloads
  • 8. BLAT Sequence Search
    • Using BLAT
    • Example: Finding a Primer Location
    • When to Use BLAT vs BLAST
  • 9. Hands-On Exercise
    • Exercise: Tracing the Origin of Our Training Data
  • 10. Summary
    • Quick Reference
  • Next Session
  • Additional Resources

By Talal Al-Yazeedi

© Copyright 2026, Talal Al-Yazeedi.