Upcoming Changes to EST and GSS Databases

Featured


Update: NCBI is now in the process of merging EST and GSS records into the Nucleotide database, and we expect to complete this process in early 2019. Accession.version and GI identifiers will not change during this process.

As of December 1, 2018, all records from the databases for Expressed Sequence Tags (EST) and Genome Survey Sequences (GSS) will reside in NCBI’s Nucleotide database. This change will provide a single point of access for all GenBank sequence data with a common look and feel.

Read more to learn about how this change affects these resources:

  • Websites (Entrez)
  • APIs (E-utilities)
  • FTP sites
  • Submission procedures
  • BLAST
  • TSA (have a look if you’re not familiar!)

Continue reading

Join NCBI at PAG in San Diego, January 12–16, 2019


Next week, NCBI staff will attend the Plant and Animal Genome (PAG) Conference. We have several activities planned, including 1 booth (#223), 4 workshops and 2 posters.

Read on to learn more about what you can look forward to if you’re attending PAG this year. (Note: The listed times are Pacific time.)

Continue reading

Apply now to join the Seattle Biological Data Science FHackathon February 4-6, 2019


From February 4-6, 2019, the NCBI will help with a data science hackathon at the Fred Hutchinson Cancer Research Center in Seattle. To apply, complete this form (approximately 10 minutes to complete). Initial applications are due Friday, January 11th by 11 pm ET.

The hackathon will focus on genomics as well as general data science. This event is for researchers, including students and postdocs, who have already engaged in the use of large datasets or in the development of pipelines for analyses from high-throughput experiments. Some projects are available to other non-scientific developers, mathematicians, or librarians.

Continue reading

BLAST+ 2.8.1 with New Databases and Better Performance


BLAST+ 2.8.1 is now available for download from our FTP site. This the first production release of standalone BLAST to support the new BLAST v5 databases (BLASTDBv5), which are also now available. The new databases have taxonomy information for the database sequences built-in.  This gives you the following important advantages over the v4 databases.

  1. The ability to limit your search by taxonomic group — species level as well as higher taxa.
  2. Improved performance when limiting BLAST search with accessions.
  3. Retrieval of sequences by taxonomic group from a BLAST database with blastdbcmd.

There are some additional enhancements to the search program options.

  1. A new option (-subject_besthit) culls HSPs on a per subject sequence basis by removing HSPs that are completely enveloped by another HSP. This is an experimental option and is subject to change.
  2. Use of the -max_target_seqs option for formats 0-4 is now allowed. The number of alignments and descriptions will be set to the max_target_seqs.
  3.  BLAST now issues a warning  about the possibility not seeing all equivalent matches if -max_target_seqs is set to less than five.

The new release also includes a few bug fixes.  Please see the release notes for additional details and, as always, write to us at blast-help@ncbi.nlm.nih.gov with any questions.

Virus Hunting Data Science Hackathon next week in San Diego


From January 9th – 11th, the NCBI will help run a bioinformatics hackathon in Southern California hosted by the Computational Sciences Research Center at San Diego State University. We reached out to the global computational biology and virology community as part of this effort to make data more accessible.

The hackathon teams look forward to leveraging metagenomic datasets in the cloud to find data based on organismal content and update taxonomy – but most of all – hunt down new viruses!

Follow along with the event with NCBI tweets and see our work on GitHub.

Update single records easily with ClinVar’s Single SCV Update


The ClinVar Team is happy to announce a new online form in the ClinVar Submission Portal, the Single SCV Update, which makes it easier for you to update a single record.

ClinVar_SIngle_SCV_2The new ClinVar Single SCV Update form showing the sections for editing the evaluation date, clinical significance, condition, and citations.

Continue reading

IgBLAST 1.12 now available


We’ve released a new version of IgBLAST, v1.12. This new version increases the allowed distance between V gene end and J gene start positions (from 90 bp to 150 bp) as well as between V gene end and D gene start positions (from 55 to 120 bp) to accommodate extremely long VDJ junctions found in some antibodies.

IgBLAST 1.12 uses the 1-based sequence coordinate system that reflects the change in the new AIRR Rearrangement Schema. Also, it includes fixes for minor bugs found in previous versions.

The new executables are available on the NCBI FTP site.

For more information, please read:

IgBLAST facilitates the analysis of immunoglobulin and T cell receptor variable domain sequences.

500 organisms annotated with the Eukaryotic Genome Annotation Pipeline


This month, the NCBI Eukaryotic Genome Annotation Pipeline annotated its 500th organism! The lucky winner is Pocillopora damicornis, a stony reef-building coral frequently used as an experimental model, whose larval dispersal and development are affected by environmental changes in the oceans.

Stony coral (Pocillopora damicornis)

Continue reading

Visit NCBI at the ASCB | EMBO meeting in San Diego, December 9-11, 2018


Going to the ASCB | EMBO meeting? Stop by the NCBI booth (#327) to learn about all that NCBI has to offer, ask questions, and provide feedback on how we can better meet your needs for research and teaching.

Booth #327, Exhibit Hall:

  • Sunday, December 9, 9:30 AM – 4:00 PM
  • Monday, December 10, 9:30 AM – 4:00 PM
  • Tuesday, December 11, 9:30 AM – 4:00 PM

Visit the booth anytime during exhibit hours to discuss any topic or just to say hello. We’re also offering specific times at the booth for focused conversations about using specific sets of NCBI resources in your research and teaching.

Discussion Sessions:

Sunday

  • 12:30 PM  NCBI BLAST in research and teaching 

Monday

  • 12:30 PM   Jupyter notebooks to teach scripting and NCBI resources

Tuesday

  • 12:30 PM    EDirect  for command-line access  to NCBI databases
  • 2:00 PM    Jupyter notebooks to teach scripting and NCBI resources

To stay up-to-date about NCBI at ASCB or in general, follow us on Twitter at @NCBI ‏.

 

Adapting flatfile parsers for GenBank’s new accession formats


As previously announced, GenBank and other INSDC members will expand the accession formats used for sequencing projects by the end of this year. We’re introducing these new formats to accommodate the growth of Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) sequencing sequences. More details about those changes are available on NCBI Insights.

You may have to adjust your code and databases to accommodate the new formats’ longer length. In particular, the first line of the flatfile format, referred to as the LOCUS line, includes the “Locus Name” (usually identical to the accession number), which may now grow to as long as 20 characters. See section 3.4.4 of the GenBank release notes for examples of how the LOCUS line might change.

Since 2003, the GenBank release notes have recommended that flatfile parsers use a whitespace-separated tokens approach to accommodate changes like the one described in section 3.4.4. If your flatfile parsers rely solely on position, you may have to make modifications. From our internal testing, it appears BioPython and BioPerl properly handle most of the examples shown in section 3.4.4, and only have issues with the last theoretical examples where the sequence length no longer ends at position 40. We do recommend adjusting code to accommodate those theoretical examples for future-proofing.

Please write to the helpdesk with any questions about the new formats.