NCBI Handbook Chapter     NAR 2002 Paper     FAQ     Email GEO  
   NCBI > GEO > Info

   

GEO Overview

  1. General
  2. Query and Browse
  3. Data Download and Format
  4. Data Analysis
  5. Deposit and Update

1. General Overview

GEO serves as a public repository for a wide range of high-throughput experimental data. These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE), and mass spectrometry proteomic data.

At the most basic level of organization of GEO, there are four entity types that may be supplied by users:

Submitter

A submitter entity contains contact and authentication information about the submitter. This information is kept only so that the source of data in GEO can be properly referenced. A submitter entity may have relationships to many platforms, many samples, and many series.

Platform

A platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each platform record is assigned a unique and stable GEO accession number (GPLxxx). A platform may reference many samples that have been submitted by multiple submitters.

Sample

A sample record describes the conditions under which an individual sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each sample record is assigned a unique and stable GEO accession number (GSMxxx). A sample entity must reference only one platform and may be included in multiple series.

Series

A series record defines a set of related samples considered to be part of a group, how the samples are related, and if and how they are ordered. A series provides a focal point and description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each series record is assigned a unique and stable GEO accession number (GSExxx).

GEO DataSets

GEO DataSets (GDSxxx) are curated sets of GEO sample data. A GDS record represents a collection of biologically and statistically comparable GEO samples and forms the basis of GEO's suite of data display and analysis tools. Samples within a GDS refer to the same platform, that is, they share a common set of probe elements. Value measurements for each sample within a GDS are assumed to be calculated in an equivalent manner, that is, considerations such as background processing and normalization are consistent across the dataset. Information reflecting experimental design is provided through GDS subsets.


2. Query and Browse Overview

GEO data can be retrieved in several ways:

  • To look at a particular GEO record for which you have the accession number, use the Accession Display bar found at the foot of the GEO homepage and at the top of each GEO record. This tool has several options for selecting the format and amount of data to view (see the Data Download and Format overview below).
  • To query all GEO submissions in a specific field, or over all fields, use either the Entrez GDS or Entrez GEO interfaces. Entrez GDS queries all GEO DataSet annotation, allowing identification of experiments of interest. Entrez GEO queries precomputed gene expression/molecular abundance profiles, allowing identification of genes or sequences or profiles of interest. As with any other Entrez database, a simple Boolean phrase may be entered and restricted to any number of supported attribute fields, enabling effective query and mining of GEO data.
  • To browse lists of GEO data and experiments, use either the GDS browser or view the list of current GEO repository contents.


3. Data Download and Format Overview

GEO data can be viewed and downloaded in several formats:

  • GEO records Several options are provided on the Accession Display bar (found at the foot of the GEO homepage and at the top of each GEO record) for the retrieval and display of original GEO records. The Scope feature allows display of a single accession number (Self) or any (Platform, Sample, or Series) or all (Family) records related to that accession. Amount dictates the quantity of data displayed, with choices including metadata only, metadata and the first 20 rows of the data table, data table only, or full metadata/data table records. Format controls whether records are displayed in HTML or in SOFT format. SOFT (Simple Omnibus Format in Text) is an ASCII text format that was designed to be a machine readable representation of data retrieved from, or submitted to, GEO. SOFT is also a line-based format, making it easy to parse using commonly available text processing and formatting languages. For a complete description of SOFT format, see the SOFT guide.
  • GDS records Each GDS record has three options for the download of that dataset. The complete SOFT document contains all information for that dataset, including dataset description, type, organism, subset allocation, etc., as well as a data table containing identifiers and values. The data only option allows download of the data table only, whereas the quick view provides dataset descriptive information and the first 20 rows of the data table. The full text tab-delimited data tables provided with these downloads may prove suitable for upload into your favorite microarray analysis software package or database/spreadsheet application.
  • Both GDS and GEO data are available for bulk download via FTP. GEO DataSets may be downloaded in complete GDS SOFT format, whereas complete original GEO records, partitioned by GEO platform, may be downloaded in SOFT format.


4. Data Analysis Overview

Several features are provided to assist with the exploration, visualization, and analysis of GEO data:

  • GEO data may be interrogated using Entrez GEO and Entrez GDS. Entrez GEO queries precomputed gene expression/molecular abundance profiles, whereas Entrez GDS queries all experimental annotation. As with any other NCBI Entrez database, a simple Boolean phrase may be entered and restricted to any number of supported attribute fields, enabling effective query and mining. Experiments of interest may be located using Entrez GDS with attributes such as keywords, platform type, author, organism, etc. Individual gene expression and molecular abundance profiles of interest may be located using Entrez GEO with attributes such as gene name, GenBank accession number, keywords, abundance, variability, etc. Data may also be located based on sequence similarity using the GEO BLAST feature.
  • Related data of interest may be located using the Profile neighbors and Sequence neighbors links found on Entrez GEO documents. Profile neighbors retrieves other genes/molecules that show a similar profile shape over that dataset, possibly inferring some common function or regulatory elements. Sequence neighbors searches all GEO datasets for related genes based on nucleotide sequence similarity, and thus may be useful in identifying sequence homologs such as related gene family members, or for cross-species comparisons.
  • GDS records contain links to features that help describe and visualize an entire dataset. Sample Hierarchical Trees depict the relationship between the samples within a dataset, that is, which samples are more closely related to each other based on correlation coefficient clustering. The Value Distribution depicts a "Box and whiskers" plot for each sample within a dataset, allowing an overview of the distribution of values that may help determine the quality and comparability of a dataset.


5. Deposit and Update Overview

Once you have established your own private GEO account, there are two ways in which data may be submitted to GEO:

  • Interactive Web forms. GEO's Web deposit site provides a simple step-by-step procedure for deposit of individual records. A brief Web deposit guide is provided below. A detailed Web deposit guide for submitting data is also available.
  • Batch Direct Deposit in SOFT format. If your data are already in a database, or if you have many samples to submit, it is likely that submission of data via Direct Deposit in SOFT format is the most convenient deposit route. This process was designed for rapid batch submission of data; SOFT files may be readily produced from common spreadsheet and database applications. A detailed SOFT guide and examples of SOFT documents are available for review.

Web deposit brief

Step 1

Create a GEO account for yourself by entering your contact information. This is publicly accessible information that is necessary so that proper credit can be given for data.

Step 2

Check to see if your platform already exists in GEO. If your experiments were performed using commercial arrays (e.g., Affymetrix), it may not be necessary to submit a platform record. If you find the relevant platform already deposited in GEO (view all commercial nucleotide platforms), take a note of its GEO accession number and go directly to Step 4. (Note that no platform definition is required for SAGE data; again, go directly to Step 4).

Step 3

Submit the platform definition. A platform record consists of a data table listing the elements (e.g., cDNAs, oligonucleotides, ORFs, antibodies) present on the array and accompanying array descriptive information (but NO hybridization measurements). You will first be asked to specify the platform category from a pull-down menu (e.g., non-commercial nucleotide). Next you must provide the platform data table in text, tab-delimited format. The first row of the data table must contain the column headers; thereafter each row identifies a single element, or "spot", on the array. Requirements for platform data tables are that an identifier column be named "ID" and that each ID be unique within the platform. Where possible, some form of sequence identifier should be included for each ID, e.g., a GenBank accession number, clone ID, ORF, or actual sequence; this information is extracted for Entrez GEO indexing purposes. Any number of additional descriptive columns may also be supplied. For example, for a non-commercial nucleotide platform, the data table might look like the following:

ID	GB_ACC	GENE_SYMBOL	GENE_NAME
1	U83857	API5	apoptosis inhibitor 5
2	M61764	TUBG1	tubulin, gamma 1
3	NM_012094	PRDX5	peroxiredoxin 5

After the data table has passed validation, you will be asked to supply the platform title, organism, description, authors, and keywords. The "Description" field may hold very large volumes of data, and it is encouraged that submitters provide a thorough report of the platform manufacture and design.

Step 4

Submit hybridization data (or SAGE tag count data) as sample records. A sample record references one platform and describes the abundance measurements of a single hybridization/experimental condition. You will first be asked to specify the experiment type from a pull-down menu (e.g., dual channel) and the reference platform GEO accession number. Next you must provide the sample data table in text, tab-delimited format. The first row of the data table must contain the column headers. Sample data tables require a column named "ID_REF", matching the "ID" column of the reference platform, and a "VALUE" column (or "TAG" and "COUNT" for SAGE data). For dual channel experiments, VALUES will reflect normalized log ratio measurements. For single channel experiments, VALUES will be normalized (scaled) signal count data (not log transformed).

GEO's data display and analysis tools are effective only when using normalized VALUES. If the median VALUES across samples in your datasets vary considerably, your dataset will be considered non-normalized and will not be incorporated into GEO's query and analysis tools.

For a dual channel sample related to the example platform given above, the data table might look like the following:

ID_REF	VALUE	NormCH1	NormCH2
1	-0.18	1322	1994
2	2.74	5547	2025
3	0.17	489	334

In this example, the spot with ID_REF=2 (with VALUE=2.74) matches GenBank accession M61764 from the platform above. Again, any number of auxiliary columns may be supplied, e.g., supporting measurements and calculations, quality evaluation or flags. Special notes for Affymetrix data submitters are provided.

Sample records should be supplied as complete hybridization tables to GEO. This allows the scientific community to review and analyze the entire dataset and is the principle reason why many journals require microarray data deposit in a public database. The appropriate place to present an extracted table of significant differences is as a separate table in a series record that describes the overall experiment (see below).

After the data table has passed validation, you will be asked to supply the sample title, organism, description, authors, and keywords. The "Description" field may hold very large volumes of data, and it is encouraged that submitters provide a thorough report of the sample, which may include a detailed description of the biological source, experimental conditions and treatments, labeling and hybridization protocols, spot quantification, and normalization schemes.

Step 5

After you have submitted all of your sample data, submit a series record. A series brings together a related group of samples and provides a focal point and description of the experiment as a whole. Information reflecting experimental sample subsets may also be specified. Submitters are encouraged to supply information regarding the overall experimental design, aim, summary results, and conclusions. Tables of extracted data, summary conclusions, or analyses may be included in series records. If you want to include such data, email the table to GEO staff at geo@ncbi.nlm.nih.gov, and they will attach it to your series record.

Each record you submit will receive a unique and stable GEO accession number that you may quote in manuscripts. Records may remain private for several months until the data are published. During this period, you may request a "read-only" password (email geo@ncbi.nlm.nih.gov) that allows collaborators or reviewers confidential access to your private data before publication.

Please visit our detailed guide to Web deposit for more information on submitting data via the Web.

Updates

Edits and updates to individual records may be performed at any time by submitters using the update section on the GEO Web deposit/update page. If global edits are required for multiple records, for example, bringing forward the release date or editing a data table header, simply email the details to GEO staff at geo@ncbi.nlm.nih.gov and they will process a batch edit on your behalf.


 Site contents
 
Documentation
Overview  |  FAQ
Web deposit brief
Batch deposit guide
SOFT examples
Linking & citing
DataSet clusters
GEO announce list
Data disclaimer
GEO staff
Query & Browse Query and browse overview
DataSet browser
Repository browser
SAGEmap
FTP site
GEO Profiles
GEO Datasets
Deposit & Update Deposit and update overview
Web deposit
Direct deposit
New account

 
Retrieve GEO accession:  Scope: In: view:    

Depositors only: User   Password         


| NLM | NIH | GEO Help | NCBI Help | Disclaimer | Section 508 |
NCBI Home NCBI Search NCBI SiteMap