Introduction

The DACC has been working with IGVF members in preparation to assist with submission of your data to the IGVF Portal. If you have data ready to be submitted, please initiate contact with the DACC at igvf-portal-help@lists.stanford.edu to get your submission process started. Once notified, the wrangling team at DACC will reach out and help set up the submission process by providing each submitter with an API access key to the portal, instructions on collecting metadata and tools available to help with data submission.

Submission workshop recording (IGVF Consortium meeting 2023): https://drive.google.com/file/d/1bXG186nJbHjZwAvzLieA5FrRvv1ghi2C/view?usp=sharing

The submission workshop video covers a general guide for submitters; however, we still recommend going through the documentation for a more detailed and specific help, examples, and guidelines.

Video timestamps:

  1. 0:00-14:35 Introduction
  2. 14:36-40:44 DACC data model
  3. 40:45-1:22:18 Submission overview

Submission Process

After contacting DACC wranglers, the lab and the wranglers will start discussing data modeling. This could take multiple zoom meeting as it is an ongoing process. DACC wranglers will then take the understanding of the data and will make updates to the system. The Lab's submitters will start data submissions on our test server (sandbox). Any submissions to sandbox is for practice only. However, the wranglers will still review the submissions to offer any feedback. If there are any concerns from both sides that come up at this point, the process starts over again until the data model and submissions are finalized. Once the submissions look good, the data submitter can proceed to submit on our production server. Any data submissions up to this point is not yet released to the public. It is only available to the consortium. There will be another review on production before the Lab and DACC wranglers both agree to release the data to the public.

submission process

API access key pairs

API access key pairs are used to authenticate a user before giving access to submit data. Please provide your wrangler an email address associated with a gmail or github account. They will make sure you have the appropriate permissions to submit data for the appropriate lab.

To request the key pairs, log in on the bottom left of the side toolbar. Once successfully logged in, click "Profile".

access key pairs

In your "User Profile" page, click on "Create Access Key". Your Access Key ID and Access Key Secret will appear in a pop-up window. Please make a note of them as they are shared only once. Once the pop-up window is closed, you will have no way to retrieve it again. However, new key pairs can be requested if the previous pair is lost.

Collecting Metadata for submission

Providing rich, reliable metadata is essential for maintaining high standards set by the IGVF consortium and making the Portal a valuable resource for the scientific community. Our current data model includes multiple object types (for example: tissue, primary_cell, human_donor, etc. ) Each component has its own set of metadata properties specifically designed to capture the relations of components to each other. All metadata prepared will be reviewed by the wrangling team at DACC. Any data submitted to the portal becomes accessible for internal IGVF consortium members. However, it is not going to become publicly available (“released”) until the DACC has finished the review of the submitted data and received approval from the submitting lab.

The data model supports the submission of objects classified under the following general categories: samples, donors, file sets, files, ontology terms, and other.

SamplesDonorsFiles/File SetsFilesOntology TermsOther types:
in_vitro_system, primary_cell, tissue, whole_organism, technical_sample, multiplexed_samplehuman_donor, rodent_donoranalysis_set, curated_set, measurement_set, construct_library, auxiliary_set, model, predictionreference_file, sequence_file, configuration_file, signal_file, alignment_fileassay_term, phenotype_term, sample_term, platform_termaward, analysis_steps, biomarker, document, gene, image, lab, modification, page, phenotypic_feature, publication, technical_sample, software, software_version, source, treatment, users, human_genomic_variant, workflow

*Note: The data model is being actively developed, see github schemas for further detail.

Seqspec for single cell assays

If you are submitting data resulting from a single-cell assay (such as scRNA-seq, 10X multiome, SHARE-seq, MULTI-seq, etc.), you should define (generate) a machine-readable YAML file describing your genomic library sequence and structure. The YAML file should be submitted as a configuration_file and linked from the corresponding sequencing_file(s) metadata to allow processing of your data.

For any additional help generating a YAML file, please contact Sina Booeshaghi and Lior Pachter . For any other additional question on submitting a configuration_file, please contact your wrangler.

IGVF Schemas

The IGVF data model includes schemas organized by object type that list the different properties (metadata) describing the associated experimental artifact. These pages are key resources to refer to while preparing spreadsheets for submission.

Data Submission Tools

There are two tools available for submitters to use:

  1. GoogleSheet (AppScript) - a web-browser google spreadsheet utilizing an embedded script to facilitate submission.
  2. igvf_utils - prepackaged python scripts that take tab-separated files (tsv) as input.

Each object type, also known as profiles, will need its own spreadsheet as it has its own set of metadata properties. Please note that although primary_cells, tissue, etc. are categorized as biosamples, they are considered as different object types in our system. Same concept applies to human_donor and rodent_donor. For that reason, multiple sheets will have to be prepared depending on the number of object types being submitted.

Submission Examples

Let’s go through the human_donor and tissue schema, assess which properties are needed, what type of property it is and assign an example value.

*Please remember that for any property that links to another object, an identifier of an existing object on the Portal will have to be provided for reference. If you are unsure of what identifier to use, please contact the wrangling team.

-Human_donor Example-

Descriptions of both required and optional properties for human_donor can be found here in JSON format. Required properties must be described to successfully submit an object record. Optional properties are recommended to provide if they are available and applicable.

{
  "title": "Human Donor",
  "$id": "/profiles/human_donor.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "Derived schema submitting human donors.",
  "type": "object",
  "required": [
    "award",
    "lab",
    "taxa"
  ]
}...
Required PropertyTypeCommentsExample Value
awardstringLink to an associated award or grant object./awards/HG012012
labstringLink to an associated lab./labs/john-doe
taxastring (enum)Donor’s taxa.Homo sapiens
Optional PropertyTypeCommentsExample Value
phenotypic_featuresarray of stringsList of links to the associated phenotypic features of the donor.[“HP:0000726”, “MONDO:0004975”]
ethnicitiesarray of strings (enums)http://bioportal.bioontology.org/ontologies/HANCESTRO terms are used.[“Hispanic”, “Arab”]

*Note: Not all optional properties are listed here in the example, for more properties see schema page.

-Tissue Example-

{
  "title": "Tissue",
  "$id": "/profiles/tissue.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "Schema for submitting a tissue sample",
  "type": "object",
  "required": [
    "award",
    "lab",
    "source",
    "donors",
    "biosample_term"
  ]
}...
Required PropertyTypeDescriptionExample Value
awardstringGrant associated with the submission./awards/HG012012
labstringLab associated with the submission./labs/john-doe
sourcestringSample provider lab or a vendor./sources/atcc
donorsarray of stringsDonor(s) the sample was derived from.[“IGVFDO1645ZWSY”, “IGVFDO2416VXNA”]
biosample_termstringOntology term identifying a biosample. Links to Sample Term object (unique identifier)./sample-terms/UBERON_0000955
Optional PropertyTypeDescriptionExample Value
pmiintegerPost-mortem Interval, the amount of time that elapsed since the death of the donor.3
pmi_unitsstringThe unit in which the PMI time was reported. Enum list includes: second, minute, hour, day, week.day
preservation_methodstringThe method by which the tissue was preserved. Enum list includes cryopreservation, flash-freezing.flash-freezing
date_obtainedstringDate harvested. Date should be submitted as YYYY-MM-DD.2022-04-02

Submitting Objects

In the sheets or tab-separated files, the first row is designated as the header containing the names of each property to be submitted. Following rows are designated for object records. Multiple records can be submitted at once.

-human_donor Example Input Sheet-

aliasesawardlabtaxa
john-doe:donor_01/awards/HG012012/labs/john-doeHomo sapiens
john-doe:donor_02/awards/HG012012/labs/john-doeHomo sapiens

Understanding Identifiers and the Importance of the Alias Identifier

For every object that is submitted to the portal, the system automatically generates a unique identifier (uuid). For a subset of objects in addition to the uuid an accession is generated, following the format IGVF[SM|DO][0]9]{4}[A-Z]{4}, where [SM|DO] refer to the object type. The examples human_donor and tissue will have accessions automatically generated, IGVFDO[0]9]{4}[A-Z]{4} and IGVFSM[0]9]{4}[A-Z]{4}, respectively.

IMPORTANT: While accessions and unique identifiers (UUIDs) are automatically generated and can be used to find your object of interest, we highly encourage the use of aliases property, another form of a unique identifier. Aliases are not assigned by the system and provide an opportunity for submitters to assign an identifier that makes sense for internal records such as the identifier coming from the lab's LIMS system.

Aliases are to be formatted in the following way: ‘[lab name]:[chosen identifier]’ (e.g. john-doe:experiment_01).

*Note: These three types of IDs (uuid, accession, and aliases) can be used interchangeably to refer to an object in the spreadsheets used for object submission or modification.

Reviewing Submissions

Following successful submission, appending the object type followed by an identifier of the object such as uuid, accession, or alias to the URL of the server will allow you to view your object.

Examples: appending identifier to url

Updating Submitted Objects

If your objects have a metadata error(s) you need to fix, you can easily patch your object property values. The first column header in your spreadsheet should be either accession (for Google Sheets Submitter) or record_id (for igvf_utils). The property(s) to be updated should be specified in the next columns.

Example: for the tissue pmi and pmi_units properties, both records initially specified as 3 days will be changed to 5 weeks.

accessionpmipmi_units
john-doe:tissue_015week
john-doe:tissue_025week

Order of Submission Matters

IMPORTANT: The order of submission by object type matters! Objects can be related or linked to each other. Creation of these relationships depends on the proper order of submission. For example, a tissue object relates to a specific donor object (a unique identifier must be specified), see the example above. Therefore, the donor(s) needs to be submitted first, otherwise you will not be able to reference them upon submission, causing an error if the donor property is required.

Current order of object types for submissions link

Non-submittable Objects (admin-only)

There is a subset of objects that are not submittable for curation purposes (i.e. preventing duplicates, misuse of objects, etc.) Here is a list:

  1. assay_term
  2. phenotype_term
  3. platform_term
  4. sample_term
  5. source
  6. gene
  7. award
  8. lab
  9. human_genomic_variant

Please contain your wrangler if you would like to submit new objects within the object types listed.