IGVF_utils includes tools that are useful to any IGVF Consortium submitting group, as well as the general community working with IGVF data. Library and scripts are coded in Python.
See the wiki to get started.
API and script documentation are available on Read the Docs.
igvf_utils is a fork of encode_utils, a Python API client originally designed for the ENCODE project. The tool has been slightly updated to be compatible with the IGVF database.
Follow these step-by-step instructions to download igvf_utils and its dependencies. This toolset was developed by Nathaniel Watson of Snyder lab. Submissions and patches can be done with a prepackaged script from this toolkit: /MetaDataRegistration/iu_register.py. The installation instructions will help you set up a Python environment and place a copy of this script in your /bin directory for easy use.
Next, you'll need to add your API access keys (these should have been provided to you by your data wrangler). Access keys give you permissions to the server. There are a few environment variables you can set: lab, award, and your access keys. See configuration page for more details. If you have the award and lab properties specified, they will automatically be included as properties in your objects.
You can do this each session:
$ export IGVF_API_KEY='xxxxxx'
$ export IGVF_SECRET_KEY='xxxxxxxx'
Or once globally:
Navigate to where environment variables are stored on your machine. For example, this would be in the .bash_profile or .zshrc files for Mac OS users.
$ cd ~/
$ open .bash_profile
Add the lines:
export IGVF_API_KEY='xxxxxx'
export IGVF_SECRET_KEY='xxxxxxxxx'
You may also specify environment variables corresponding to your lab and award objects' identifiers in the IGVF_AWARD and IGVF_LAB variables respectively in your .bash_profile, which will allow you to skip the award and lab fields when submitting object metadata with iu_register.py:
export IGVF_AWARD='xxxxx'
export IGVF_LAB='your-lab-name'
The Submission example later in this document is written assuming that the IGVF_AWARD and IGVF_LAB environment variables have been defined. If you choose not to do this, be aware that lab and award are required properties you'll have to include in your tab-separated files for gene, phenotypic_feature, and other objects.
igvf_utils has a prepackaged script to allow you to submit to the Portal easily: iu_register.py. Proper installation should place the script in your /bin directory, so once you are in your python environment, you'll be able to run it from your terminal.
There are several arguments you'll need to include when using this script.
Required arguments:
Specification of the server you're submitting to. The IGVF server is referred to as the production server -m prod
, and a special server mirroring production for testing purposes, Sandbox, is referred to as the Sandbox server -m sandbox
.
The DACC strongly encourages you to start with submission to the sandbox server, only proceeding to submission to the production server after validating correctness of the submission to the sandbox server with your data wrangler's help. Again, the two available modes are:
-m prod
-m sandbox
Specification of the object type being submitted, for example:
-p in_vitro_system
-p human_donor
-p tabular_file
Specification of the tab-separated values file containing the metadata values for the object(s) being submitted. For example:
-i tab_separated_file.txt
iu_register.py -m sandbox -p human_donor -i tab_separated_file.txt
Optional arguments:
Specification of a special mode of submission execution that will not result in actual object(s) creation on the Portal, but will validate to some extent the metadata properties specified in the tab-separated values file.
By default, iu_register.py requires specification of aliases for any object that is being submitted. Use the flag --no-aliases
if you are interested in submitting objects without an alias. The use of this flag is discouraged by the DACC.
Log files with information relevant to your submissions will automatically be generated once you start to use the script.
Please continue to the submission examples to see how to format the tab_separated_file.txt (metadata) for several objects, and example commands for submission of these objects using iu_register.py.
The tool iu_register.py requires the property aliases
for any object the property is available in. It must be unique and is an opportunity to use identifiers from something like LIMS.
Please refer to Submission Overview for more details on object identifiers. Note: You can use these three types of IDs (uuid, accession, and aliases) to refer to an object.
This is the schema or profile description of the in_vitro_system object you will be submitting, along with metadata properties you will use to link or relate it to other relevant objects.
Further details on IGVF data model and object metadata can be found in the Submission Overview.
The metadata for our In Vitro System example is collected in a tab-separated file, example1.txt:
source | donors | taxa | biosample_term | classification | aliases |
---|---|---|---|---|---|
/sources/atcc/ | /human-donors/IGVFDO2722UPGC/ | Homo sapiens | /sample-terms/EFO_0007598/ | cell line | john-doe:Hap1 |
Note: award or lab properties were not included because they were specified as environment variables!
Submission to the IGVF sandbox server is executed using this iu_register.py command:
$ iu_register.py -m sandbox -p in_vitro_system -i /local/path/to/example1.txt
This is the schema or profile description of the document object you will be submitting, along with metadata properties you will use to link or relate it to other relevant objects.
The metadata for our Document example is collected in a tab-separated file, example2.txt:
document_type | description | attachment | aliases |
---|---|---|---|
plasmid map | Plasmid map example. | {"path": "/local/path/to/document.pdf"} | john-doe:document_example |
Submission to the IGVF sandbox server is executed using this iu_register.py command:
$ iu_register.py -m sandbox -p document -i /local/path/to/example2.txt
This uploads the metadata (example2.txt) to create an object for a document file on the server.
The file itself is also automatically uploaded to an S3 IGVF bucket using the path you provided in attachment
.
We have several different file types (for details refer to: https://data.igvf.org/profiles/). Here, we will specifically look at an example to submit a sequence_file. This is the schema or profile description of the sequence_file object you will be submitting, along with metadata properties you will use to link or relate it to other relevant objects.
The metadata for our Sequence File example is collected in a tab-separated file, example3.txt:
md5sum | file_format | file_set | content_type | sequencing_run | submitted_file_name | aliases |
---|---|---|---|---|---|---|
71ffd7ed2bcd12ec3f01887606778db3 | fastq | /measurement-sets/IGVFDS4213KCOQ/ | reads | 3 | /local/path/to/fastq/file.fastq.gz | john-doe:sequence_file_example |
Submission to the IGVF sandbox server is executed using this iu_register.py command:
$ iu_register.py -m sandbox -p sequence_file -i /local/path/to/example3.txt
This uploads the metadata (example3.txt) to create an object for a sequence_file on the server.
The file itself is also automatically uploaded to an S3 IGVF bucket using the path you provided in submitted_file_name
.
Note: Some files (depending on their file format) need to be gzipped before being uploaded. As of July 2024, the following files need to be gzipped: .bed, .bedpe, .csv, .dat, .fasta, .fastq, .gaf, .gds, .gff, .gtf, .obo, .owl, .pairs, .sam, .tagAlign, .tar, .tsv, .txt, .vcf, .xml and .yaml. To find the latest list of files please refer to https://github.com/IGVF-DACC/igvfd/blob/dev/src/igvfd/types/file.py#L41-L81
Note: In reality, you will likely post many objects at once, listing multiple objects within the metatdata example.txt file. Because the iu_register.py command requires the profile (object type) to be specified in the command, only one type of object can be submitted at a time.
If your objects posted successfully to the Portal, but there is an error that needs to be fixed, you can edit using a patch.
Supply the corrected metadata for your objects (one profile or object type at a time, as mentioned above) in a tab-separated file (input_file.txt). The first column in the input file should be labeled as record_id
which provides any valid identifier (uuid, accession, or aliases) of the record to be updated. The property(s) that you want to edit should be specified in the following columns.
The --patch
flag can be used to add a new property value or modify existing properties. --patch -w
will overwrite any existing values specified.
Changing the description
property for our document file object from earlier would require an input_file like this:
record_id | description |
---|---|
john-doe:file_example | Plasmid map alternative description. |
And you would overwrite the old description entry with this command:
$ iu_register.py -m sandbox -p document -i path/to/input_file.txt --patch -w
If you're using the Googlesheet (AppsScript) as an input file for igvf_utils, you'll have to make a few adjustments.
For POST:
For PATCH:
record_id
which provides any valid identifier (uuid, accession, or aliases) of the record to be updated. The following column(s) should contain property(s) that need patching.The Portal server will return a standard HTTP response status code after running the command. For successful submission, either 200 or 201
will be returned. If unsuccessful, any other numbers such as 404, 422, or 409
will be returned.
Unprocessable Entity
, an object that doesn’t exist was referred to or sometimes a required property is missing.Conflict
, a new object clashes with one that already exists on the Portal. For example, aliases cannot be reused.For further troubleshooting, the interacting server will generate 3 files automatically:
With a few lines of Python code, you can use other tools that igvf_utils has to offer. Import the tools, and connect to a server (sandbox) shown below:
$ python
>>> import igvf_utils as iu
>>> from igvf_utils.connection import Connection
>>> conn = Connection("api.sandbox.igvf.org") # For sandbox
>>> conn = Connection("api.data.igvf.org") # For production server
Then, you can re-upload a file that was posted but not uploaded to AWS:
conn.upload_file(file_id="accession", file_path="full/path/to/file")
where file_id is the accession (example: TSTFI63300087) of the file object you are trying to re-upload and file_path is the corrected file path for the (new) file you're re-uploading
Or remove properties completely from an object:
>>> conn.remove_props(rec_id="[accession]", props=["target"])