IGVF

Introduction

The DACC has been working with IGVF members in preparation to assist with submission of your data to the IGVF Portal. If you have data ready to be submitted, please initiate contact with the DACC at igvf-portal-help@lists.stanford.edu to get your submission process started. Once notified, the wrangling team at DACC will reach out and help set up the submission process by providing each submitter with an API access key to the portal, instructions on collecting metadata and tools available to help with data submission.

Submission workshop recording (IGVF Consortium meeting 2023): https://drive.google.com/file/d/1bXG186nJbHjZwAvzLieA5FrRvv1ghi2C/view?usp=sharing

The submission workshop video covers a general guide for submitters; however, we still recommend going through the documentation for a more detailed and specific help, examples, and guidelines.

Video timestamps:

0:00-14:35 Introduction
14:36-40:44 DACC data model
40:45-1:22:18 Submission overview

Submission Process

After contacting DACC wranglers, the lab and the wranglers will start discussing data modeling. This could take multiple zoom meeting as it is an ongoing process. DACC wranglers will then take the understanding of the data and will make updates to the system. The Lab's submitters will start data submissions on our test server (sandbox). Any submissions to sandbox is for practice only. However, the wranglers will still review the submissions to offer any feedback. If there are any concerns from both sides that come up at this point, the process starts over again until the data model and submissions are finalized. Once the submissions look good, the data submitter can proceed to submit on our production server. Any data submissions up to this point is not yet released to the public. It is only available to the consortium. There will be another review on production before the Lab and DACC wranglers both agree to release the data to the public.

submission process

API access key pairs

API access key pairs are used to authenticate a user before giving access to submit data. Please provide your wrangler an email address associated with a gmail or github account. They will make sure you have the appropriate permissions to submit data for the appropriate lab.

To request the key pairs, log in on the bottom left of the side toolbar. Once successfully logged in, click "Profile".

access key pairs

In your "User Profile" page, click on "Create Access Key". Your Access Key ID and Access Key Secret will appear in a pop-up window. Please make a note of them as they are shared only once. Once the pop-up window is closed, you will have no way to retrieve it again. However, new key pairs can be requested if the previous pair is lost.

Collecting Metadata for submission

Providing rich, reliable metadata is essential for maintaining high standards set by the IGVF consortium and making the Portal a valuable resource for the scientific community. Our current data model includes multiple object types (for example: tissue/organ, primary_cell, human_donor, etc. ) Each component has its own set of metadata properties specifically designed to capture the relations of components to each other. All metadata prepared will be reviewed by the wrangling team at DACC. Any data submitted to the portal becomes accessible for internal IGVF consortium members. However, it is not going to become publicly available (“released”) until the DACC has finished the review of the submitted data and received approval from the submitting lab.

The data model supports the submission of objects classified under the following general categories: samples, donors, file sets, files, ontology terms, and other.

Samples	Donors	Files/File Sets	Files	Ontology Terms	Other types:
in_vitro_system, primary_cell, tissue/organ, whole_organism, technical_sample, multiplexed_sample	human_donor, rodent_donor	analysis_set, curated_set, measurement_set, construct_library, auxiliary_set, model, prediction	reference_file, sequence_file, configuration_file, signal_file, alignment_file	assay_term, phenotype_term, sample_term, platform_term	award, analysis_steps, biomarker, document, gene, image, lab, modification, page, phenotypic_feature, publication, technical_sample, software, software_version, source, treatment, users, human_genomic_variant, workflow

*Note: The data model is being actively developed, see github schemas for further detail.

Seqspec for sequencing data

If you are submitting raw sequencing data (such as Illumina, PacBio, NanoPore, etc.), the corresponding sequence specification YAML configuration files (seqspec) should be submitted along with the sequence files. Furthermore, the seqspec YAML files are required to uniformly process single-cell assays (such as scRNA-seq, 10X multiome, SHARE-seq, MULTI-seq, etc.). A detailed submission help for seqspec is linked at: https://data.igvf.org/help/data-submission/seqspec-submission/.

This type of machine-readable YAML files describes your genomic library sequence and structure for standardized data processing, see Bioinformatics, Volume 40, Issue 4, April 2024. The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of.

IGVF Schemas

The IGVF data model includes schemas organized by object type that list the different properties (metadata) describing the associated experimental artifact. These pages are key resources to refer to while preparing spreadsheets for submission.

Data Submission Tools

There are two tools available for submitters to use:

GoogleSheet (AppScript) - a web-browser google spreadsheet utilizing an embedded script to facilitate submission.
igvf_utils - prepackaged python scripts that take tab-separated files (tsv) as input.

Each object type, also known as profiles, will need its own spreadsheet as it has its own set of metadata properties. Please note that although primary_cells, tissue/organ, etc. are categorized as biosamples, they are considered as different object types in our system. Same concept applies to human_donor and rodent_donor. For that reason, multiple sheets will have to be prepared depending on the number of object types being submitted.

Submission Examples

Let’s go through the human_donor and tissue/organ schema, assess which properties are needed, what type of property it is and assign an example value.

*Please remember that for any property that links to another object, an identifier of an existing object on the Portal will have to be provided for reference. If you are unsure of what identifier to use, please contact the wrangling team.

-Human_donor Example-

Descriptions of both required and optional properties for human_donor can be found here in JSON format. Required properties must be described to successfully submit an object record. Optional properties are recommended to provide if they are available and applicable.

{
  "title": "Human Donor",
  "$id": "/profiles/human_donor.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "Derived schema submitting human donors.",
  "type": "object",
  "required": [
    "award",
    "lab",
    "taxa"
  ]
}...

Required Property	Type	Comments	Example Value
award	string	Link to an associated award or grant object.	/awards/HG012012
lab	string	Link to an associated lab.	/labs/john-doe
taxa	string (enum)	Donor’s taxa.	Homo sapiens

Optional Property	Type	Comments	Example Value
phenotypic_features	array of strings	List of links to the associated phenotypic features of the donor.	[“HP:0000726”, “MONDO:0004975”]
ethnicities	array of strings (enums)	http://bioportal.bioontology.org/ontologies/HANCESTRO terms are used.	[“Hispanic”, “Arab”]

*Note: Not all optional properties are listed here in the example, for more properties see schema page.

-Tissue/Organ Example-

{
  "title": "Tissue/Organ",
  "$id": "/profiles/tissue.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "Schema for submitting a tissue/organ sample",
  "type": "object",
  "required": [
    "award",
    "lab",
    "source",
    "donors",
    "biosample_term"
  ]
}...

Required Property	Type	Description	Example Value
award	string	Grant associated with the submission.	/awards/HG012012
lab	string	Lab associated with the submission.	/labs/john-doe
source	string	Sample provider lab or a vendor.	/sources/atcc
donors	array of strings	Donor(s) the sample was derived from.	[“IGVFDO1645ZWSY”, “IGVFDO2416VXNA”]
biosample_term	string	Ontology term identifying a biosample. Links to Sample Term object (unique identifier).	/sample-terms/UBERON_0000955

Optional Property	Type	Description	Example Value
pmi	integer	Post-mortem Interval, the amount of time that elapsed since the death of the donor.	3
pmi_units	string	The unit in which the PMI time was reported. Enum list includes: second, minute, hour, day, week.	day
preservation_method	string	The method by which the tissue/organ was preserved. Enum list includes cryopreservation, flash-freezing.	flash-freezing
date_obtained	string	Date harvested. Date should be submitted as YYYY-MM-DD.	2022-04-02

Submitting Objects

In the sheets or tab-separated files, the first row is designated as the header containing the names of each property to be submitted. Following rows are designated for object records. Multiple records can be submitted at once.

-human_donor Example Input Sheet-

aliases	award	lab	taxa
john-doe:donor_01	/awards/HG012012	/labs/john-doe	Homo sapiens
john-doe:donor_02	/awards/HG012012	/labs/john-doe	Homo sapiens

Understanding Identifiers and the Importance of the Alias Identifier

For every object that is submitted to the portal, the system automatically generates a unique identifier (uuid). For a subset of objects in addition to the uuid an accession is generated, following the format IGVF[SM|DO][0]9]{4}[A-Z]{4}, where [SM|DO] refer to the object type. The examples human_donor and tissue/organ will have accessions automatically generated, IGVFDO[0]9]{4}[A-Z]{4} and IGVFSM[0]9]{4}[A-Z]{4}, respectively.

IMPORTANT: While accessions and unique identifiers (UUIDs) are automatically generated and can be used to find your object of interest, we highly encourage the use of aliases property, another form of a unique identifier. Aliases are not assigned by the system and provide an opportunity for submitters to assign an identifier that makes sense for internal records such as the identifier coming from the lab's LIMS system.

Aliases are to be formatted in the following way: ‘[lab name]:[chosen identifier]’ (e.g. john-doe:experiment_01).

*Note: These three types of IDs (uuid, accession, and aliases) can be used interchangeably to refer to an object in the spreadsheets used for object submission or modification.

Reviewing Submissions

Following successful submission, appending the object type followed by an identifier of the object such as uuid, accession, or alias to the URL of the server will allow you to view your object.

Examples: appending identifier to url

Updating Submitted Objects

If your objects have a metadata error(s) you need to fix, you can easily patch your object property values. The first column header in your spreadsheet should be either accession (for Google Sheets Submitter) or record_id (for igvf_utils). The property(s) to be updated should be specified in the next columns.

Example: for the tissue/organ pmi and pmi_units properties, both records initially specified as 3 days will be changed to 5 weeks.

accession	pmi	pmi_units
john-doe:tissue_01	5	week
john-doe:tissue_02	5	week

Order of Submission Matters

IMPORTANT: The order of submission by object type matters! Objects can be related or linked to each other. Creation of these relationships depends on the proper order of submission. For example, a tissue/organ object relates to a specific donor object (a unique identifier must be specified), see the example above. Therefore, the donor(s) needs to be submitted first, otherwise you will not be able to reference them upon submission, causing an error if the donor property is required.

Current order of object types for submissions link

Non-submittable Objects (admin-only)

There is a subset of objects that are not submittable for curation purposes (i.e. preventing duplicates, misuse of objects, etc.) Here is a list:

assay_term
phenotype_term
platform_term
sample_term
source
gene
award
lab
human_genomic_variant

Please contain your wrangler if you would like to submit new objects within the object types listed.