Genomes Download (FTP) FAQ

  1. What is the easiest way to download data for multiple genome assemblies?
  2. What is the best protocol to use to download large data sets?
  3. Why has the NCBI genomes FTP site been reorganized?
  4. What are the highlights of the redesigned FTP site?
  5. Will the content of the old FTP site go away? What is the timeline for transitioning to the new FTP site?
  6. How can I stay informed about changes to the NCBI genomes FTP site?
  7. Are all genomes available in NCBI nucleotide available on the FTP site?
  8. Are files on the FTP site updated following annotation updates?
  9. My organism of interest is available in both GenBank and RefSeq. Is the genome the same? Which one should I use?
  10. How are the new FTP directories structured?
  11. What is the file content within each specific assembly directory?
  12. How can I find the sequence and annotation of my genome of interest?
  13. Where can I find information to help me chose between the many different assemblies for a species?
  14. How can I download only the current version of each assembly?
  15. How can I download RefSeq data for all complete bacterial genomes?
  16. How can I download all genome assemblies from the Human Microbiome Project, or other project?
  17. Why was the sequence identifier format in the FASTA files changed?
  18. Why do some species directory names start with an underscore?
  19. Do you provide assembly data formatted for use by sequence read alignment pipelines?
  20. Are repetitive sequences in eukaryotic genomes masked?
  21. How do alignment programs treat the lower-case masking in genomic fasta files?
  22. How can sequence with lower-case masking be converted to unmasked sequence?
  23. How can sequence with lower-case masking be converted to sequence masked with Ns?
  24. Firefox truncates long FTP directory and file names. How can I see the full names?
  25. Do ftp://ftp.ncbi.nlm.nih.gov/ and ftp://ftp.ncbi.nih.gov/ provide the same content?
  26. Why does my FTP client not handle some FTP directories or files correctly?
  1. What is the easiest way to download data for multiple genome assemblies?

    The genome download service in the Assembly resource makes it easy to download data for multiple genomes without having to write scripts. To use the download service, run a search in Assembly, use facets to refine the set of genome assemblies of interest, open the "Download Assemblies" menu, choose the source database (GenBank or RefSeq), choose the file type, then click the Download button to start the download. An archive file will be saved to your computer that can be expanded into a folder containing the genome data files from your selections.

    For example, to download genomic FASTA sequence for all RefSeq bacterial complete genome assemblies:

    • Start with an "all[filter]" query on Assembly
    • Select "Bacteria" from the "Organism group" facet in the left-hand sidebar
    • Select "Complete genome" from the "Assembly level" facet in the left-hand sidebar
    • Click on the "Download Assemblies" button to open the download menu
    • Leave "Source database" set to RefSeq
    • Select "Genomic FASTA" from the "File type" menu
    • Wait for the "calculating size..." message to be replaced by an estimated size
    • Click Download, you may get a pop-up window asking if/where you want to save the genome_assemblies.tar archive file
    • After the download has finished, expand the tar archive
    • The resulting folder named "genome_assemblies" will contain:
      • a report.txt file that provides a summary of what was downloaded
      • a folder named like "ncbi-genomes-YYYY-MM-DD", where YYYY-MM-DD is the date of the download, containing:
        • a README.txt file
        • an md5checksums.txt file
        • many data files with names like *_genomic.fna.gz, in which the first part of the name is the assembly accession followed by the assembly name

    Simple variations on these steps can be used to obtain different file types or data for different sets of genome assemblies. If "All file types (including assembly structure directory)" is selected from the "File type" menu, the "ncbi-genomes-YYYY-MM-DD" folder will contain a folder for each of the selected genome assemblies containing all the content from the FTP directory for that assembly.

    The genome download service is best for small to moderately sized data sets. Selecting very large numbers of genome assemblies may result in a download that takes a very long time (depending on the speed of your internet connection). Scripting using rsync is the recommended protocol to use for downloading very large data sets (see below).

  2. What is the best protocol to use to download large data sets?

    We recommend using the rsync file transfer program from a Unix command line to download large data files because it is much more efficient than older protocols. The next best options for downloading multiple files are to use the HTTPS protocol, or the even older FTP protocol, using a command line tool such as wget or curl. Web browsers are very convenient options for downloading single files even though they will use the FTP protocol because of how our URLs are constructed. Other FTP clients are also widely available but do not all correctly handle the symbolic links used widely on the genomes FTP site (see below).

    To use rsync

    Replace the "ftp:" at the beginning of the FTP path with "rsync:". E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following rsync command:

    rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1 my_dir/

    A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz could be downloaded using the following rsync command:

    rsync --copy-links --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz my_dir/

    To use HTTPS

    Replace the "ftp:" at the beginning of the FTP path with "https:". Also append a '/' to the path if it is a directory. E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following wget command:

    wget --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/ -P my_dir/

    A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz could be downloaded using either of the following commands:

    wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz -P my_dir/

    curl --remote-name --remote-time https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz

    To use FTP

    Append a '/' to the path if it is a directory. E.g. If the FTP path is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1, then the directory and its contents could be downloaded using the following wget command:

    wget --recursive --no-host-directories --cut-dirs=6 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/ -P my_dir/

    A file with FTP path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz could be downloaded using either of the following commands:

    wget --timestamping ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz -P my_dir/

    curl --remote-name --remote-time ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1/GCF_001696305.1_UCN72.1_genomic.gbff.gz

  3. Why has the NCBI genomes FTP site been reorganized?

    Historically, the genomes FTP site has been populated by different process flows and NCBI working groups leading to undesirable differences in available content and file formats. Also, data for GenBank genomes and RefSeq genomes were located in different areas of the NCBI FTP site that had different organization.

    NCBI has redesigned the genomes FTP site to expand the content and facilitate data access through an organized predictable directory hierarchy with consistent file names and formats. The updated site provides greater support for downloading assembled genome sequences and/or corresponding annotation data. The new FTP site structure provides a single entry point to access content representing either GenBank or RefSeq data.

  4. What are the highlights of the redesigned FTP site?

    The updated genomes FTP provides more uniformity across species. It offers a consistent core set of files for the genome sequence and annotation products of all organisms and assemblies in scope.

    The reorganized genomes FTP site supports download needs such as:

    • Retrieve the unmasked or soft-masked genome sequence for a specific genome assembly
    • Retrieve GenBank or RefSeq Gene, RNA and protein annotation for a specific organism and a specific assembly, or a specific RefSeq annotation release
    • Retrieve annotation in either GenBank flat-file or GFF format
    • Matching sequence identifiers in FASTA & GFF files to facilitate RNA-Seq and other analyses
    • Confirm downloaded content is complete using provided md5checksums
  5. Will the content of the old FTP site go away? What is the timeline for transitioning to the new FTP site?

    The initial release of the redesigned genomes FTP site in August 2014 added three new directories, namely ‘genbank’, ‘refseq’, and ‘all’ to the existing ftp area – ftp://ftp.ncbi.nlm.nih.gov/genomes/. These directories provide a core set of files representing both sequence and annotation content in several formats (see below). Additional file formats will be added in future updates.

    The content of most of the old directories on the ftp://ftp.ncbi.nlm.nih.gov/genomes/ site, and the content previously at ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ is no longer being updated. Many old directories from these two areas were moved to archival subdirectories within the /genomes/ area on 2 December 2015. More old directories will be moved to the archive in 2018. Details of what FTP directories and files were moved are as follows.

    • All directories and files from ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank
    • The following directories from ftp://ftp.ncbi.nlm.nih.gov/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/
      • Aedes_aegypti
      • Anopheles_gambiae
      • Arabidopsis_lyrata
      • Arabidopsis_thaliana
      • ASSEMBLY_BACTERIA
      • Bacteria
      • Bacteria_DRAFT
      • Branchiostoma_floridae
      • Caenorhabditis_elegans
      • Chloroplasts
      • CLUSTERS
      • Drosophila_melanogaster
      • Drosophila_pseudoobscura
      • Fungi
      • Medicago_truncatula
      • MITOCHONDRIA
      • Physcomitrella_patens
      • PLANTS
      • Plasmids
      • Populus_trichocarpa
      • Protozoa
      • Sorghum_bicolor
    • The file old_genomeID2nucGI from ftp://ftp.ncbi.nlm.nih.gov/genomes/ was archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/
    • The IDS directory from ftp://ftp.ncbi.nlm.nih.gov/genomes/ was moved to ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/
  6. How can I stay informed about changes to the NCBI genomes FTP site?

    Subscribe to the genomes-announce mail list.

  7. Are all genomes available in NCBI nucleotide available on the FTP site?

    Genome sequence and annotation data is provided for organisms in scope for NCBI’s Assembly resource. Data are provided for both GenBank and RefSeq assembly versions. The FTP directories for the latest version in each assembly chain, and directories for many older assembly versions, include a core set of files and formats plus additional files relevant to the data content of the specific assembly. Directories for old assembly versions that predate the genomes FTP site reorganization contain only the assembly report, assembly stats & assembly status files.

  8. Are files on the FTP site updated following annotation updates?

    Yes, the FTP files for the latest version of an assembly are updated after the annotation on any of the sequences in the assembly changes.

    The FTP files for the latest version of an assembly may also be updated:

    • to make the files conform to the latest specifications for a particular data format
    • to correct errors in conversion of the primary data from the NCBI databases into the various FTP file formats

    Files for old versions of assemblies will not usually be updated, consequently, most users will want to download data only for the latest version of each assembly. For more information, see "How can I download only the current version of each assembly?".

  9. My organism of interest is available in both GenBank and RefSeq. Is the genome the same? Which one should I use?

    GenBank content includes genome assemblies that are submitted to members of the International Nucleotide Sequence Database Collaboration. GenBank submissions may or may not include annotation information which, when provided, was generated by different groups using different methods. Note that for prokaryotes, GenBank annotation may have been generated using NCBI’s prokaryotic genome annotation service. In contrast, RefSeq genomes are selected from, and are a subset of, the available GenBank genomes and annotation data is available for all RefSeq genomes, except for some viruses. RefSeq annotation content originates from NCBI’s prokaryotic, eukaryotic, organelle, or viral annotation pipelines, or is propagated from the GenBank submission.

    For some assemblies, both GenBank and RefSeq content may be available. RefSeq genomes are a copy of the submitted GenBank assembly. In some cases the assemblies are not completely identical as RefSeq has chosen to add a non-nuclear organelle unit to the assembly or to drop very small contigs or reported contaminants. Equivalent RefSeq and GenBank assemblies, whether or not they are identical, and RefSeq to GenBank sequence ID mapping, can be found in the assembly report files available on the FTP site or by download from the Assembly resource.

  10. How are the new FTP directories structured?

    The base structure of the revised genomes ftp site includes several main directory areas that provide sequence and annotation content, or report files. Sequence and annotation content is further organized by major taxonomic groupings, then by species, then by assembly. Sequence content is defined by the Assembly resource. The revised genomes FTP site provides directories for:

    • GenBank content organized by taxonomic group, then by species and assembly
    • RefSeq content organized by taxonomic group, then by species and assembly
    • all (union of GenBank and RefSeq) organized by individual assembly
    • Assembly reports
    • Genome reports

    Within the GenBank and RefSeq directories, the directory hierarchy is:

    • Taxonomic group
      • Genus_species
        • All assemblies
          • Individual assemblies
        • Latest assembly versions
          • Individual assemblies
        • RefSeq representative genomes (if any)
          • Individual assemblies
        • RefSeq reference genomes (if any)
          • Individual assemblies
        • Future: additional groupings will be added in the future. For example:
          • Annotation release data sets from NCBI’s eukaryotic annotation pipeline

    The first layer of organization consists of the following directories:

    1. genbank: content includes primary submissions of assembled genome sequence and associated annotation data, if any, as exchanged among members of the International Nucleotide Sequence Database Collaboration, of which NCBI’s GenBank database is a member. The GenBank directory area includes genome sequence data for a larger number of organisms than the RefSeq directory area; however, some assemblies are unannotated. The subdirectory structure includes:
      • archaea
      • bacteria
      • fungi
      • invertebrate
      • metagenomes
      • other – this directory is only provided for GenBank and includes submissions of synthetic genomes.
      • plant
      • protozoa
      • vertebrate_mammalian
      • vertebrate_other
      • viral
    2. refseq: content includes assembled genome sequence and RefSeq annotation data. All RefSeq genomes have annotation. RefSeq annotation data may be calculated by NCBI annotation pipelines or propagated from the GenBank submission. The RefSeq directory area includes fewer organisms than the GenBank directory area because not all genome assemblies are selected for the RefSeq project. Subdirectories include:
    3. all: content is the union of GenBank and RefSeq assemblies. The two directories under "all" are named for the accession prefix (GCA or GCF) and these directories contain another three levels of directories named for digits 1-3, 4-6 & 7-9 of the assembly accession. The next level is the data directories for individual assembly versions. 'all' contains many directories for old versions of assemblies; these are archival and will not be updated to add new file formats or to refresh the data.
    4. ASSEMBLY_REPORTS: content consists of four summary report files that include meta-data details of all the latest GenBank assemblies, all the latest RefSeq assemblies, the historical GenBank assemblies, or the historical RefSeq assemblies. These summary files provide a ftp path that can be used to retrieve the sequence and annotation data. Another file provides the expected genome assembly size range for different species as applied to submissions to GenBank.
    5. GENOME_REPORTS: content consists of summary reports of genome sequencing projects, associated annotation statistics, and some defined reference datasets within the RefSeq project. Reports are provided by the Genomes resource.

    Example directory hierarchy:

    The directory hierarchy for the Genbank Escherichia coli K-12 subst. MG1655 genome, which has the assembly accession GCA_000005845.2 and default assembly name of ‘ASM584v2’ looks like this:

    • genomes
      • genbank
        • bacteria
          • Escherichia_coli
            • all_assembly_versions
              • GCA_000005845.2_ASM584v2 – this directory layer is named using the pattern: [Assembly accession.version]_[assembly name]

    The directory hierarchy for the annotated human reference genome looks like this:

    • genomes
      • refseq
        • vertebrate_mammalian
          • Homo_sapiens
            • all_assembly_versions
            • latest_assembly_versions
            • reference
              • GCF_000001405.33_GRCh38.p7
  11. What is the file content within each specific assembly directory?

    Assembly directories for all current assemblies, and for many previous assembly versions, include a core set of files and formats plus additional files relevant to the data content of the specific assembly. Directories for old assembly versions that predate the genomes FTP site reorganization contain only the assembly report, assembly stats & assembly status files. All data files are named according to the pattern:
    [assembly accession.version]_[assembly name]_content.[format]

    assembly_status.txt

    A text file reporting the current status of this version of the assembly ("latest", "replaced", or "suppressed"). Any assembly anomalies are also reported.

    *_assembly_report.txt

    Tab-delimited text file reporting the name, role and sequence accession.version for objects in the assembly. The file header contains meta-data for the assembly including: assembly name, assembly accession.version, scientific name of the organism and its taxonomy ID, assembly submitter, and sequence release date.

    *_assembly_stats.txt

    Tab-delimited text file reporting statistics for the assembly including: total length, ungapped length, contig & scaffold counts, contig-N50, scaffold-L50, scaffold-N50, scaffold-N75 & scaffold-N90.

    *_assembly_regions.txt

    Provided for assemblies that include alternate or patch assembly units. Tab-delimited text file reporting the location of genomic regions and the alt/patch scaffolds placed within those regions.

    *_assembly_structure directory

    Contains AGP files that define how component sequences are organized into scaffolds and/or chromosomes. Other files define how scaffolds and chromosomes are organized into non-nuclear and other assembly-units, and how any alternate or patch scaffolds are placed relative to the chromosomes. Only present if the assembly has internal structure.

    *_cds_from_genomic.fna.gz

    FASTA format of the nucleotide sequences corresponding to all CDS features annotated on the assembly, based on the genome sequence.

    *_feature_count.txt.gz

    Tab-delimited text file reporting counts of gene, RNA, CDS, and similar features, based on data reported in the *_feature_table.txt.gz file.

    *_feature_table.txt.gz

    Tab-delimited text file reporting locations and attributes for a subset of annotated features. Included feature types are: gene, CDS, RNA (all types), operon, C/V/N/S_region, and V/D/J_segment. Replaces the .ptt & .rnt format files that were provided in the old genomes FTP directories.

    *_genomic.fna.gz

    FASTA format of the genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case. The genomic.fna.gz file includes all top-level sequences in the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds, unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds that are part of the chromosomes are not included because they are redundant with the chromosome sequences; sequences for these placed scaffolds are provided under the assembly_structure directory.

    *_genomic.gbff.gz

    GenBank flat file format of the genomic sequence(s) in the assembly. This file includes both the genomic sequence and the CONTIG description (for CON records), hence, it replaces both the .gbk & .gbs format files that were provided in the old genomes FTP directories.

    *_genomic.gff.gz

    Annotation of the genomic sequence(s) in Generic Feature Format Version 3 (GFF3). Additional information about NCBI's GFF files is available at ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt.

    *_protein.faa.gz

    FASTA format of the accessioned protein products annotated on the genome assembly.

    *_protein.gpff.gz

    GenPept format of the accessioned protein products annotated on the genome assembly.

    *_rm.out.gz

    RepeatMasker output; Provided for Eukaryotes.

    *_rm.run

    Documentation of the RepeatMasker version, parameters, and library; Provided for Eukaryotes.

    *_rna.fna.gz

    FASTA format of accessioned RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA products are not instantiated as a separate accessioned record in GenBank and are provided for some RefSeq genomes, most notably the eukaryotes.).

    *_rna.gbff.gz

    GenBank flat file format of RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant.

    *_rna_from_genomic.fna.gz

    FASTA format of the nucleotide sequences corresponding to all RNA features annotated on the assembly, based on the genome sequence.

    *_translated_cds.faa.gz

    FASTA sequences of individual CDS features annotated on the genomic records, conceptually translated into protein sequence. The sequence corresponds to the translation of the nucleotide sequence provided in the *_cds_from_genomic.fna.gz file.

    *_wgsmaster.gbff.gz

    GenBank flat file format of the WGS master for the assembly (present only if a WGS master record exists for the sequences in the assembly).

    annotation_hashes.txt

    Tab-delimited text file reporting hash values for different aspects of the annotation data. The hashes are useful to monitor for when annotation has changed in a way that is significant for a particular use case and warrants downloading the updated records.

    md5checksums.txt

    File checksums are provided for all data files in the directory.

  12. How can I find the sequence and annotation of my genome of interest?

    Genome assemblies of interest can be found using one of two methods.

    Using the NCBI Assembly resource

    Genome assemblies of interest can be found using the search bar, advanced search page or browse by organism table provided by the Assembly resource

    GenBank or RefSeq data for the assembly can be obtained by following the links to the FTP site from the "Access the data" section of the right-hand sidebar.

    Using the assembly summary report files

    Download the relevant assembly summary files that report assembly meta-data.

    Search the meta-data fields, or filter the files, to find assemblies of interest (see README_assembly_summary.txt for a description of the columns) .

    The field named "ftp_path" provides the path to the FTP directory containing the data for each assembly.

  13. Where can I find information to help me chose between the many different assemblies for a species?

    There can be many different genome assemblies available for species with medical, agricultural or scientific relevance. The Genus_species directories under the "genbank" and "refseq" directory trees each contain an assembly_summary.txt file that provides general information on all assembly versions included in the directory, such as release date, submitter organization, assembly level and status. See for example ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Sulfolobus_islandicus/assembly_summary.txt

    After assemblies of interest have been identified using the data from the species-specific assembly_summary.txt file, they can be accessed via the "all_assembly_versions" directory for that species.

    Alternatively, any assemblies that the NCBI Reference Sequence (RefSeq) group has selected to be reference or representative genomes can be readily accessed via the directories named "reference" or "representative" in the Genus_species directories under the "genbank" and "refseq" directory trees.

  14. How can I download only the current version of each assembly?

    Any changes to the sequences included in a particular assembly accession result in an increment of the assembly version, which means that an assembly accession.version (e.g. GCF_000001405.28) represents a fixed set of sequences. It also means that a particular assembly may have several versions, where only the most recent version is considered to be "latest", and earlier versions are marked as either "replaced" or "suppressed". In some cases the last version of an assembly may be "suppressed", for example if it was removed from the RefSeq collection due to changes in scope or quality concerns.

    Only FTP files for the "latest" version of an assembly are updated when annotation is updated, new file formats are added or improvements to existing formats are released. Consequently, most users will want to download data only for the latest version of each assembly. You can select data from only the latest assemblies in several ways:

    1. Use the Assembly database and select the "Latest" filter from the left sidebar, or add the term 'AND "latest"[Filter]' to your query.
    2. Use the /genbank or /refseq FTP paths to navigate to the species level directory and then select assemblies from the "latest_assembly_versions" subdirectory. See "How are the new FTP directories structured?" for more details.
    3. Use either of the two master assembly summary files, or the assembly_summary.txt file for the species or taxonomic group of interest (see above), select those assemblies that are marked as "latest" in the version_status column (11), and then use the FTP path indicated in column 20 to download the data.
  15. How can I download RefSeq data for all complete bacterial genomes?

    The easiest way to download RefSeq data for all complete bacterial genomes is the use the genome download service in the Assembly resource, as described above.

    Alternatively, the assembly summary report files provide information that can be used to identify a set of assemblies of interest along with their FTP file paths. For example, to obtain the GenBank flat file format annotation for all complete bacterial genomes in the NCBI Reference Sequences collection (RefSeq):

    Variants of these instructions can be used to download all draft bacterial genomes in RefSeq (assembly_level is not "Complete Genome"), all RefSeq reference or representative bacterial genomes (refseq_category (column 5) is "reference genome" or "representative genome"), etc.

    1. Download the /refseq/bacteria/assembly_summary.txt file
    2. List the FTP path (column 20) for the assemblies of interest, in this case those that have "Complete Genome" assembly_level (column 12) and "latest" version_status (column 11). One way to do this would be using the following awk command:
      awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths
    3. Append the filename of interest, in this case "*_genomic.gbff.gz" to the FTP directory names. One way to do this would be using the following awk command:
      awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gbff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths
    4. Use a script to download the data file for each FTP path in the list
    Also see the Downloading Genomic Data Factsheet.
  16. How can I download all genome assemblies from the Human Microbiome Project, or other project?

    All genomes assemblies linked to a particular BioProject can be downloaded using the genome download service in the Assembly resource described above.

    The following example will download all reference genomes for the Human Microbiome Project (HMP), which has the BioProject accession PRJNA28331.

    • Search in BioProject for PRJNA28331
    • Follow the link to "Assembly" under "Related information" in the right-hand sidebar
    • Click on the "Download Assemblies" button to open the download menu
    • Select the "Source database", either GenBank or RefSeq
    • Select a "File type", e.g. "Genomic FASTA"
    • Wait for the "calculating size..." message to be replaced by an estimated size
    • Click Download, you may get a pop-up window asking if/where you want to save the genome_assemblies.tar archive file
    • After the download has finished, expand the tar archive
  17. Why was the sequence identifier format in the FASTA files changed?

    We changed the sequence identifier format in the FASTA files to make our datasets more usable by the community.

    NCBI has traditionally used a compound FASTA sequence identifier string in which multiple IDs were separated by '|' characters. This format provides more information but requires that the individual sequence identifiers be parsed out of the compound string. The FASTA files on the redesigned genomes FTP site have a simple sequence identifier string that is just the sequence accession.version, for example:
    >U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome
    >NC_000001.11 Homo sapiens chromosome 1, GRCh38 Primary Assembly

    This sequence identifier is identical to that used in the GFF annotation files on the genomes FTP site. Providing sequence and annotation files with matching sequence identifiers supports their use in commonly used RNA-Seq analysis packages and in other analysis pipelines that rely on simple string comparison to match sequence identifiers.

  18. Why do some species directory names start with an underscore?

    Certain symbols and punctuation marks have a special meaning to computer operating systems, consequently, they can cause problems if they are included as part of directory or file names. Examples include spaces, (, ), [, ] and '. Whenever one or more of these special characters appears in the organism name they are replaced by an underscore.

    Taxonomy places square brackets around the genus for some species to indicate that they are misclassified. The current names continue to be used with square brackets until the species has been formally renamed. The square brackets around the genus are converted to underscores when a directory name is created for one of these misclassified species resulting in a directory name that begins with an underscore.

  19. Do you provide assembly data formatted for use by sequence read alignment pipelines?

    Genomic FASTA with modified sequence identifiers and index files convenient for analysis with Next Generation Sequencing tools are currently provided for the Genome Reference Consortium's human and mouse assemblies: GRCh38 & GRCm38.p3. RefSeq annotation in GFF3 format with sequence identifiers matching those in the FASTA files are also provided to facilitate use in RNA-Seq analysis pipelines.

    The four analysis sets provided for GRCh38 (no_alt_analysis_set, full_analysis_set, full_plus_hs38d1_analysis_set, no_alt_plus_hs38d1_analysis_set) and the two analysis sets provided for GRCm38.p3 (no_alt_analysis_set, full_analysis_set) differ from the corresponding full assemblies by one or more of the following:

    • omission of alternate locus and patch scaffolds that cause complications for sequence read alignment programs that are not alt-aware
    • hard masking of duplicate copies the pseudo-autosomal regions and centromeric arrays
    • addition of "decoy" sequences

    Index files generated by BWA, Samtools and Bowtie are provided. See the GRCh38 README or GRCm38 README for a full description.

  20. Are repetitive sequences in eukaryotic genomes masked?

    Repetitive sequences in eukaryotic genome assembly sequence files, as identified by WindowMasker, have been masked to lower-case.

    The location and identity of repeats found by RepeatMasker are also provided in a separate file. These spans could be used to mask the genomic sequences if desired. Be aware, however, that many less studied organisms do not have good repeat libraries available for RepeatMasker to use.

  21. How do alignment programs treat the lower-case masking in genomic fasta files?

    Alignment programs typically have parameters that control whether the program will ignore lower-case masking, treat it as soft-masking (i.e. only for finding initial matches) or treat it as hard-masking. The program's documentation should indicate the default behavior.

    By default NCBI BLAST will ignore lower-case masking but this can be changed by adding options to the blastn command-line.
    To have blastn treat lower-case masking in the query sequence as soft-masking add:

          -lcase_masking

    To have blastn treat lower-case masking in the query sequence as hard-masking add:

          -lcase_masking -soft_masking false
  22. How can sequence with lower-case masking be converted to unmasked sequence?

    Here are two examples of commands that will remove lower-case masking:

    perl -pe '/^[^>]/ and $_=uc' genomic.fna > genomic.unmasked.fna

    -or-

    awk '{if(/^[^>]/)$0=toupper($0);print $0}' genomic.fna > genomic.unmasked.fna
  23. How can sequence with lower-case masking be converted to sequence masked with Ns?

    Here are two examples of commands that will convert lower-case masking to masking with Ns (hard-masked):

    perl -pe '/^[^>]/ and $_=~ s/[a-z]/N/g' genomic.fna > genomic.N-masked.fna

    -or-

    awk '{if(/^[^>]/)gsub(/[a-z]/,"N");print $0}' genomic.fna > genomic.N-masked.fna
  24. Firefox truncates long FTP directory and file names. How can I see the full names?

    The Firefox web browser is unable to display long FTP directory and file names in http mode. The problem can be circumvented by changing the URL from "http://ftp..." to "ftp://ftp...".

  25. Do ftp://ftp.ncbi.nlm.nih.gov/ and ftp://ftp.ncbi.nih.gov/ provide the same content?

    These two paths are equivalent hence they do currently provide the same content, however, ftp://ftp.ncbi.nlm.nih.gov/ is the preferred path and the abbreviated path, ftp://ftp.ncbi.nih.gov/, may not be supported indefinitely.

  26. The NCBI genomes FTP site makes extensive use of symbolic links to provide alternative paths to the same FTP files without duplicating the data. Many FTP clients have incomplete implementation of the FTP symbolic link specification or other bugs causing them to incorrectly treat symbolic links as files or directories. This may lead to the following problems:

    • a symbolic link to a file is presented as a folder/directory
    • a symbolic link to a directory is presented as a file
      • never-the-less, clicking on the "file" may still reveal it to be a folder/directory that can be browsed
    • a symbolic link is copied as an alias instead of being resolved

    To avoid these problems:

    • download files using either the rsync or HTTPS protocols instead of the FTP protocol (see above)
    • if using wget, append a '/' after the directory/folder name
    • try a different FTP client:
      • use a web browser that correctly shows what is a file, a directory or a symlink, such as Chrome or Firefox
      • for FileZilla
        • Windows: use the latest version of FileZilla
        • Mac OSX: the bug causing symlinks to be shown as files has been reported on FileZilla ticket #4490 but has not yet been fixed
Support Center

Last updated: 2017-06-12T15:44:55-04:00