IGVF

Welcome to a user’s guide to IGVF Utils Submission

IGVF_utils includes tools that are useful to any IGVF Consortium submitting group, as well as the general community working with IGVF data. Library and scripts are coded in Python.

See the wiki to get started.

API and script documentation are available on Read the Docs.

igvf_utils is a fork of encode_utils, a Python API client originally designed for the ENCODE project. The tool has been slightly updated to be compatible with the IGVF database.

IGVF Submission (IGVF_utils) repository

Download and Set Up of Tools

Follow these step-by-step instructions to download igvf_utils and its dependencies. This toolset was developed by Nathaniel Watson of Snyder lab. Submissions and patches can be done with a prepackaged script from this toolkit: /MetaDataRegistration/iu_register.py. The installation instructions will help you set up a Python environment and place a copy of this script in your /bin directory for easy use.

Configuring iu_register.py

Next, you'll need to add your API access keys (these should have been provided to you by your data wrangler). Access keys give you permissions to the server. There are a few environment variables you can set: lab, award, and your access keys. See configuration page for more details. If you have the award and lab properties specified, they will automatically be included as properties in your objects.

You can do this each session:

$ export IGVF_API_KEY='xxxxxx'
$ export IGVF_SECRET_KEY='xxxxxxxx'

Or once globally:

Navigate to where environment variables are stored on your machine. For example, this would be in the .bash_profile or .zshrc files for Mac OS users.

$ cd ~/
$ open .bash_profile

Add the lines:

export IGVF_API_KEY='xxxxxx'
export IGVF_SECRET_KEY='xxxxxxxxx'

You may also specify environment variables corresponding to your lab and award objects' identifiers in the IGVF_AWARD and IGVF_LAB variables respectively in your .bash_profile, which will allow you to skip the award and lab fields when submitting object metadata with iu_register.py:

export IGVF_AWARD='xxxxx'
export IGVF_LAB='your-lab-name'

The Submission example later in this document is written assuming that the IGVF_AWARD and IGVF_LAB environment variables have been defined. If you choose not to do this, be aware that lab and award are required properties you'll have to include in your tab-separated files for gene, phenotypic_feature, and other objects.

Command and Arguments

igvf_utils has a prepackaged script to allow you to submit to the Portal easily: iu_register.py. Proper installation should place the script in your /bin directory, so once you are in your python environment, you'll be able to run it from your terminal.

There are several arguments you'll need to include when using this script.

Required arguments:

mode (-m)

Specification of the server you're submitting to. The IGVF server is referred to as the production server -m prod, and a special server mirroring production for testing purposes, Sandbox, is referred to as the Sandbox server -m sandbox.

The DACC strongly encourages you to start with submission to the sandbox server, only proceeding to submission to the production server after validating correctness of the submission to the sandbox server with your data wrangler's help. Again, the two available modes are:

-m prod

-m sandbox

profile (-p)

Specification of the object type being submitted, for example:

-p in_vitro_system

-p human_donor

-p tabular_file

infile (-i)

Specification of the tab-separated values file containing the metadata values for the object(s) being submitted. For example:

-i tab_separated_file.txt

full command iu_register.py -m sandbox -p human_donor -i tab_separated_file.txt

Optional arguments:

Dry run (-d)

Specification of a special mode of submission execution that will not result in actual object(s) creation on the Portal, but will validate to some extent the metadata properties specified in the tab-separated values file.

Submit object without aliases (--no-aliases)

By default, iu_register.py requires specification of aliases for any object that is being submitted. Use the flag --no-aliases if you are interested in submitting objects without an alias. The use of this flag is discouraged by the DACC.

Help (-h)

Log files with information relevant to your submissions will automatically be generated once you start to use the script.

Please continue to the submission examples to see how to format the tab_separated_file.txt (metadata) for several objects, and example commands for submission of these objects using iu_register.py.

Submission Rules for most Objects

The tool iu_register.py requires the property aliases for any object the property is available in. It must be unique and is an opportunity to use identifiers from something like LIMS.

Please refer to Submission Overview for more details on object identifiers. Note: You can use these three types of IDs (uuid, accession, and aliases) to refer to an object.

POST: Submission Examples

Example 1: Submitting in_vitro_system (biosample)

This is the schema or profile description of the in_vitro_system object you will be submitting, along with metadata properties you will use to link or relate it to other relevant objects.

Further details on IGVF data model and object metadata can be found in the Submission Overview.

The metadata for our In Vitro System example is collected in a tab-separated file, example1.txt:

source	donors	taxa	biosample_term	classification	aliases
/sources/atcc/	/human-donors/IGVFDO2722UPGC/	Homo sapiens	/sample-terms/EFO_0007598/	cell line	john-doe:Hap1

Note: award or lab properties were not included because they were specified as environment variables!

Submission to the IGVF sandbox server is executed using this iu_register.py command:

$ iu_register.py -m sandbox -p in_vitro_system -i /local/path/to/example1.txt

Example 2: Submitting Documents

This is the schema or profile description of the document object you will be submitting, along with metadata properties you will use to link or relate it to other relevant objects.

The metadata for our Document example is collected in a tab-separated file, example2.txt:

document_type	description	attachment	aliases
plasmid map	Plasmid map example.	{"path": "/local/path/to/document.pdf"}	john-doe:document_example

Submission to the IGVF sandbox server is executed using this iu_register.py command:

$ iu_register.py -m sandbox -p document -i /local/path/to/example2.txt

This uploads the metadata (example2.txt) to create an object for a document file on the server. The file itself is also automatically uploaded to an S3 IGVF bucket using the path you provided in attachment.

Example 3: Submitting Files

We have several different file types (for details refer to: https://data.igvf.org/profiles/). Here, we will specifically look at an example to submit a sequence_file. This is the schema or profile description of the sequence_file object you will be submitting, along with metadata properties you will use to link or relate it to other relevant objects.

The metadata for our Sequence File example is collected in a tab-separated file, example3.txt:

md5sum	file_format	file_set	content_type	sequencing_run	submitted_file_name	aliases
71ffd7ed2bcd12ec3f01887606778db3	fastq	/measurement-sets/IGVFDS4213KCOQ/	reads	3	/local/path/to/fastq/file.fastq.gz	john-doe:sequence_file_example

Submission to the IGVF sandbox server is executed using this iu_register.py command:

$ iu_register.py -m sandbox -p sequence_file -i /local/path/to/example3.txt

This uploads the metadata (example3.txt) to create an object for a sequence_file on the server. The file itself is also automatically uploaded to an S3 IGVF bucket using the path you provided in submitted_file_name.

Note: Some files (depending on their file format) need to be gzipped before being uploaded. As of July 2024, the following files need to be gzipped: .bed, .bedpe, .csv, .dat, .fasta, .fastq, .gaf, .gds, .gff, .gtf, .obo, .owl, .pairs, .sam, .tagAlign, .tar, .tsv, .txt, .vcf, .xml and .yaml. To find the latest list of files please refer to https://github.com/IGVF-DACC/igvfd/blob/dev/src/igvfd/types/file.py#L41-L81

Note: In reality, you will likely post many objects at once, listing multiple objects within the metatdata example.txt file. Because the iu_register.py command requires the profile (object type) to be specified in the command, only one type of object can be submitted at a time.

PATCH: Editing Submissions

If your objects posted successfully to the Portal, but there is an error that needs to be fixed, you can edit using a patch.

Supply the corrected metadata for your objects (one profile or object type at a time, as mentioned above) in a tab-separated file (input_file.txt). The first column in the input file should be labeled as record_id which provides any valid identifier (uuid, accession, or aliases) of the record to be updated. The property(s) that you want to edit should be specified in the following columns.

The --patch flag can be used to add a new property value or modify existing properties. --patch -w will overwrite any existing values specified.

Changing the description property for our document file object from earlier would require an input_file like this:

record_id	description
john-doe:file_example	Plasmid map alternative description.

And you would overwrite the old description entry with this command:

$ iu_register.py -m sandbox -p document -i path/to/input_file.txt --patch -w

Using Googlesheet (AppsScript) as an Input File

If you're using the Googlesheet (AppsScript) as an input file for igvf_utils, you'll have to make a few adjustments.

For POST:

remove any columns that start with '#' (for eg. #response, #response_time, #skip)
remove any columns that do not have a value. Your file should only include columns of property that is applicable to the object.

For PATCH:

remove any columns as stated in the 'POST' section.
the first column should be labeled record_id which provides any valid identifier (uuid, accession, or aliases) of the record to be updated. The following column(s) should contain property(s) that need patching.

Review and Troubleshoot Submissions

The Portal server will return a standard HTTP response status code after running the command. For successful submission, either 200 or 201 will be returned. If unsuccessful, any other numbers such as 404, 422, or 409 will be returned.

Common Errors:

ERROR 422: Unprocessable Entity, an object that doesn’t exist was referred to or sometimes a required property is missing.
ERROR 409: Conflict, a new object clashes with one that already exists on the Portal. For example, aliases cannot be reused.

For further troubleshooting, the interacting server will generate 3 files automatically:

log_iu_sandbox_error.txt -> Provides a brief record of failed actions (post, patch, etc) on the sandbox server.
log_iu_sandbox_debug.txt -> Provides detailed record of failed actions on the sandbox server, including error codes.
log_iu_sandbox_posted.txt -> Provides record of objects successfully posted to the sandbox server.

Other Tasks

With a few lines of Python code, you can use other tools that igvf_utils has to offer. Import the tools, and connect to a server (sandbox) shown below:

$ python
>>> import igvf_utils as iu
>>> from igvf_utils.connection import Connection
>>> conn = Connection("api.sandbox.igvf.org") # For sandbox
>>> conn = Connection("api.data.igvf.org") # For production server

Then, you can re-upload a file that was posted but not uploaded to AWS:

conn.upload_file(file_id="accession", file_path="full/path/to/file") where file_id is the accession (example: TSTFI63300087) of the file object you are trying to re-upload and file_path is the corrected file path for the (new) file you're re-uploading

NOTE: You CANNOT reupload a file if the upload_status is "validated" or "validation exempted". Also, only those files that are "in progress" or "preview" statuses are allowed to be reuploaded. For all other scenarios, please contact your data wranglers for further information.

To remove properties completely from an object:

>>> conn.remove_props(rec_id="[accession]", props=["target"])

Documentation and Resources

Schema profiles for objects to see available properties.
IGVF-utils documentation and a GitHub wiki for the igvf_utils toolset.
More links to data organization, etc (coming soon).