GEO Overview
- General
- Query and Browse
- Data Download and Format
- Data Analysis
- Deposit and Update
1. General Overview
GEO serves as a public repository for a wide range of high-throughput experimental data.
These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA, and
protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE),
and mass spectrometry proteomic data.
At the most basic level of organization of GEO, there are four entity types that may be
supplied by users:
Submitter
|
A submitter entity contains contact and authentication information about
the submitter. This information is kept only so that the source of data in GEO can be properly referenced.
A submitter entity may have relationships to many platforms, many samples, and many series.
|
Platform
|
A platform record describes the list of elements on the array (e.g., cDNAs,
oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified
in that experiment (e.g., SAGE tags, peptides). Each platform record is assigned a unique and stable GEO
accession number (GPLxxx). A platform may reference many samples that have been submitted by multiple submitters.
|
Sample
|
A sample record describes the conditions under which an individual sample
was handled, the manipulations it underwent, and the abundance measurement of each element derived from it.
Each sample record is assigned a unique and stable GEO accession number (GSMxxx). A sample entity must reference
only one platform and may be included in multiple series.
|
Series
|
A series record defines a set of related samples considered to be part
of a group, how the samples are related, and if and how they are ordered. A series provides a focal point
and description of the experiment as a whole. Series records may also contain tables describing extracted
data, summary conclusions, or analyses. Each series record is assigned a unique and stable GEO
accession number (GSExxx).
|
GEO DataSets
GEO DataSets (GDSxxx) are curated sets of GEO sample data. A GDS record represents a collection of
biologically and statistically comparable GEO samples and forms the basis of GEO's suite of
data display and analysis tools.
Samples within a GDS refer to the same platform, that is, they share a common set of probe elements.
Value measurements for each sample within a GDS are assumed to be calculated in an equivalent manner, that is,
considerations such as background processing and normalization are consistent across the dataset.
Information reflecting experimental design is provided through GDS subsets.
2. Query and Browse Overview
GEO data can be retrieved in several ways:
- To look at a particular GEO record for which you have the accession number, use
the Accession Display bar found at
the foot of the GEO homepage and at the top of each GEO record. This tool has several options for selecting the format
and amount of data to view (see the Data Download and Format overview below).
- To query all GEO submissions in a specific field, or over all fields, use either the Entrez
GDS or Entrez GEO interfaces. Entrez GDS queries all GEO DataSet annotation, allowing
identification of experiments of interest. Entrez GEO queries precomputed gene
expression/molecular abundance profiles, allowing identification of genes or sequences or
profiles of interest. As with any other Entrez database, a simple Boolean phrase may be entered
and restricted to any number of supported attribute fields, enabling effective query and mining of GEO data.
- To browse lists of GEO data and experiments, use either the
GDS browser or view the list of current
GEO repository contents.
3. Data Download and Format Overview
GEO data can be viewed and downloaded in several formats:
- GEO records
Several options are provided on the Accession Display bar (found at
the foot of the GEO homepage and at the top of each GEO record) for the retrieval and display of original GEO records.
The Scope feature allows display of a single accession number (Self) or any (Platform, Sample, or Series)
or all (Family) records related to that accession. Amount dictates the quantity of data displayed, with choices
including metadata only,
metadata and the first 20 rows of the data table, data table only, or full metadata/data table records.
Format controls whether records are displayed in
HTML or in SOFT format. SOFT (Simple Omnibus Format in Text) is an
ASCII text format that was
designed to be a machine readable representation of data retrieved from, or submitted to, GEO. SOFT
is also a line-based format, making it easy to parse using commonly available text processing and
formatting languages. For a complete description of SOFT format, see the
SOFT guide.
- GDS records
Each GDS record has three options for the download of
that dataset. The complete SOFT document contains all information for that dataset, including
dataset description, type, organism, subset allocation, etc., as well as a data table containing
identifiers and values. The data only option allows download of the data table only, whereas the
quick view provides dataset descriptive information and the first 20 rows of the data table.
The full text tab-delimited data tables provided with these downloads may prove suitable for
upload into your favorite microarray analysis software package or database/spreadsheet application.
- Both GDS and GEO data are available for bulk download via
FTP. GEO DataSets may be downloaded
in complete GDS SOFT format, whereas complete original GEO records, partitioned by GEO platform, may
be downloaded in SOFT format.
4. Data Analysis Overview
Several features are provided to assist with the exploration, visualization, and analysis
of GEO data:
-
GEO data may be interrogated using Entrez GEO
and Entrez GDS. Entrez GEO queries
precomputed gene expression/molecular abundance profiles, whereas Entrez GDS queries all experimental annotation. As with any other NCBI Entrez
database, a simple Boolean phrase may be entered
and restricted to any number of supported attribute fields, enabling effective query and mining. Experiments
of interest may be located using Entrez GDS with
attributes such as keywords, platform type, author, organism, etc.
Individual gene expression and molecular abundance profiles of interest may be located using
Entrez GEO with attributes such as gene name,
GenBank accession number, keywords, abundance, variability, etc. Data may also be located based on sequence similarity using
the GEO BLAST feature.
- Related data of interest may be located using the Profile neighbors and Sequence
neighbors links found on Entrez GEO documents. Profile neighbors retrieves other genes/molecules
that show a similar profile shape over that dataset, possibly inferring some common function
or regulatory elements. Sequence neighbors searches all GEO datasets for related genes based on
nucleotide sequence similarity, and thus may be useful in identifying sequence homologs such as
related gene family members, or for cross-species comparisons.
- GDS records contain links to features that help describe and visualize an entire dataset.
Sample Hierarchical Trees depict the relationship between the samples within a dataset, that is,
which samples are more closely related to each other based on correlation coefficient clustering.
The Value Distribution depicts a "Box and whiskers" plot for each sample within a dataset, allowing an
overview of the distribution of values that may help determine the quality and comparability of a
dataset.
5. Deposit and Update Overview
Once you have established your own private GEO account,
there are two ways in which data may be submitted to GEO:
- Interactive Web forms. GEO's Web deposit site provides
a simple step-by-step procedure for deposit of individual records. A brief Web deposit guide is provided
below. A detailed Web deposit guide for submitting data is also available.
- Batch Direct Deposit in SOFT format. If your data are already in a database, or if you have many
samples to submit, it is likely that submission of data via
Direct Deposit in SOFT format
is the most convenient deposit route. This process was designed for rapid batch submission of data;
SOFT files may be readily produced from common spreadsheet and database applications. A detailed
SOFT guide and examples of SOFT documents
are available for review.
Web deposit brief
Step 1
|
Create a GEO account for yourself by
entering your contact information. This is publicly accessible information that is necessary
so that proper credit can be given for data.
|
Step 2
|
Check to see if your platform already exists in GEO. If your experiments were performed using commercial
arrays (e.g., Affymetrix), it may not be necessary to submit a platform record. If you find the relevant
platform already deposited in GEO (view all
commercial nucleotide platforms),
take a note of its GEO accession number and go directly to Step 4. (Note that no platform definition is
required for SAGE data; again, go directly to Step 4).
|
Step 3
|
Submit the platform definition. A platform record consists of a data table listing the elements
(e.g., cDNAs, oligonucleotides, ORFs, antibodies) present on the array and accompanying array
descriptive information (but NO hybridization measurements). You will first be asked to specify
the platform category from a pull-down menu (e.g., non-commercial nucleotide). Next you must
provide the platform data table in text, tab-delimited format. The first row of the data table
must contain the column headers; thereafter each row identifies a single element, or "spot", on the array.
Requirements for platform data tables are that an identifier column be named "ID" and that each ID be unique
within the platform. Where possible, some form of sequence identifier should be included
for each ID, e.g., a GenBank accession number, clone ID, ORF, or actual sequence; this information is
extracted for Entrez GEO indexing purposes. Any number of additional descriptive columns may also
be supplied. For example, for a non-commercial nucleotide platform, the data table might look like the following:
ID GB_ACC GENE_SYMBOL GENE_NAME
1 U83857 API5 apoptosis inhibitor 5
2 M61764 TUBG1 tubulin, gamma 1
3 NM_012094 PRDX5 peroxiredoxin 5
After the data table has passed validation, you will be asked to supply the platform title, organism,
description, authors, and keywords. The "Description" field may hold very large volumes of data, and
it is encouraged that submitters provide a thorough report of the platform manufacture and design.
|
Step 4
|
Submit hybridization data (or SAGE tag count data) as sample records. A sample record references one
platform and describes the abundance measurements of a single hybridization/experimental condition.
You will first be asked to specify the experiment type from a pull-down menu (e.g., dual channel) and
the reference platform GEO accession number. Next you must provide the sample data table in text,
tab-delimited format. The first row of the data table must contain the column headers. Sample data
tables require a column named "ID_REF", matching the "ID" column of the reference platform, and a "VALUE"
column (or "TAG" and "COUNT" for SAGE data). For dual channel experiments, VALUES will reflect
normalized log ratio measurements. For single channel experiments, VALUES will be normalized (scaled) signal
count data (not log transformed).
GEO's data display and analysis tools are effective only when using normalized VALUES.
If the median VALUES across samples in your datasets vary considerably, your dataset will be
considered non-normalized and will not be incorporated into GEO's query and analysis tools.
For a dual channel sample related to the example platform
given above, the data table might look like the following:
ID_REF VALUE NormCH1 NormCH2
1 -0.18 1322 1994
2 2.74 5547 2025
3 0.17 489 334
In this example, the spot with ID_REF=2 (with VALUE=2.74) matches
GenBank accession M61764 from the platform above. Again, any number of
auxiliary columns may be supplied, e.g., supporting measurements and
calculations, quality evaluation or flags. Special notes for Affymetrix data submitters are provided.
Sample records should be supplied as complete hybridization
tables to GEO. This allows the scientific community to review and
analyze the entire dataset and is the principle reason why many
journals require microarray data deposit in a public database. The
appropriate place to present an extracted table of significant
differences is as a separate table in a series record that describes
the overall experiment (see below).
After the data table has passed validation, you will be asked to supply the sample title, organism, description,
authors, and keywords. The "Description" field may hold very large volumes of data, and it is encouraged that
submitters provide a thorough report of the sample, which may include a detailed description of the biological
source, experimental conditions and treatments, labeling and hybridization protocols, spot quantification, and
normalization schemes.
|
Step 5
|
After you have submitted all of your sample data, submit a series record. A series brings together a related
group of samples and provides a focal point
and description of the experiment as a whole. Information reflecting experimental sample subsets may also be specified. Submitters are encouraged to supply information regarding
the overall experimental design, aim, summary results, and conclusions. Tables of extracted data, summary conclusions,
or analyses may be included in series records. If you want to include such data, email the table to GEO
staff at geo@ncbi.nlm.nih.gov,
and they will attach it to your series record.
|
Each record you submit will receive a unique and stable GEO accession number that you may quote in manuscripts.
Records may remain private for several months until the data are published. During this period, you may request a
"read-only" password (email geo@ncbi.nlm.nih.gov)
that allows collaborators or reviewers confidential access to your private data before publication.
Please visit our detailed guide to Web deposit for more information on
submitting data via the Web.
Updates
Edits and updates to individual records may be performed at any time by submitters using the update
section on the GEO Web deposit/update page. If global edits are required
for multiple records, for example, bringing forward the release date or editing a data table header,
simply email the details to GEO staff at
geo@ncbi.nlm.nih.gov and they will process a batch edit on your behalf.
|
|
Site contents |
|
|
|
|
|
|
|
|
|
|
|
|