This general submission help provides a general overview of generation, submission, and validation of seqspec documents to the IGVF data portal.
If you are submitting raw sequencing data (such as Illumina, PacBio, NanoPore, etc.), the corresponding sequence specification YAML configuration files (seqspec) should be submitted along with the sequence files. Furthermore, the seqspec YAML files are required to uniformly process single-cell assays (such as scRNA-seq, 10X multiome, SHARE-seq, MULTI-seq, etc.).
This type of machine-readable YAML files describes your genomic library sequence and structure for standardized data processing, see Bioinformatics, Volume 40, Issue 4, April 2024. The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of
.
Detailed documentations on seqspec installation, generation, validation, and usage are included in seqspec docs. Additional tutorials are available from a 2024 seqspec Jamboree session with all content at Google Drive.
The seqspec code repository (for the YAML file generation and other functionalities) is available on GitHub seqspec repo. Furthermore, please check seqspec JSON schema for a list of accepted seqspec enums, such as assay terms, region types, primer IDs, etc.
For any additional help generating a YAML file, please contact Sina Booeshaghi and Lior Pachter. For any other additional question on submitting a configuration_file, please contact your wranglers.
The seqspec YAML files are required to uniformly process single-cell assays (see assay terms above). The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of
. The seqspec YAML files need to be gzipped prior to submission.
Please make sure that the file_set
of a seqspec YAML file matches the file_set
of the sequence files in the seqspec_of
content.
If a seqspec has onlist file references in its library_spec
sections that have the property region_type:barcode
(see section above), the submitters must also add the same list of files to the linked Measurement Sets' onlist_files
and onlist_method
.
There are a few important seqspec YAML styles guidelines to follow for the generation and submission of seqspec YAML files.
library_spec
portion will be different, a new seqspec describing this new library structure is needed./assay-terms/OBI_0002762/
, single-nucleus ATAC-seq/assay-terms/OBI_0003109/
, single-nucleus RNA sequencing assay/assay-terms/OBI_0002631/
, single-cell RNA sequencing assay/assay-terms/OBI_0002764/
, single-cell ATAC-seq/assay-terms/OBI_0003660/
, in vitro CRISPR screen using single-cell RNA-seqThe example seqspec used is IGVFFI1157AYPH.
This section describes the experimental setup of one or more sequencing runs. Many of the information listed in this section are also collected as metadata when submitting sequence files and file sets to the data portal. Therefore, the seqspec validation check omits this section except the seqspec_version
.
seqspec_version
in this section is expected to be v0.3.0
if a seqspec YAML file will be submitted to the IGVF data portal.-modality
section should only contain one modality. Please do not combine multiple modalities (e.g., RNA and ATAC) into the same seqspec YAML file.This section describes the relevant read FASTQ files generated by sequencing runs. When referencing sequencing FASTQ files in this section, the files and the associated read_id
must 1) use IGVF accessions, and 2) have valid IGVF file download URLs.
The library_spec section describes in details the structures and regions of sequencing libraries.
One important feature of the library_spec section is the onlist files. They may be used to generate barcode inclusion lists which will be used to correct barcodes during pipeline runs. Because of this usage, there are specific requirements to follow when including onlist information.
onlist: !Onlist
, that section is expected to have a valid file name and URL reference to a Tabular File with content_type: barcode onlist
on IGVF portal.region_type:barcode
, and 2) these onlist files must be single column without any header (e.g., tabular file IGVFFI0791WXDC).region_type: index5 (or index7)
(or any other region_type that is no barcode
) and onlist:!Onlist
, it is currently acceptable to include URL references to Tabular Files with multiple barcode columns and headers. If any future IGVF uniform pipeline plans on using info from these sections, this requirement is expected to be updated.It is recommended that submitters self-validate their seqspec YAML files before submitting to the IGVF data portal. There are 2 levels of validations done on the IGVF data portal.
Level 1 validation: It is applied to all submitted seqspec YAML files. The process includes seqspec schema check, content check, and read FASTQ file URLs validation.
Level 2 validation: It is currently only applicable to seqspec YAML files used for single cell assays (see the assay terms above), in which the onlist file URLs will be validated.
There are 2 options to validate seqspec files.
# To validate on Level 1 (schema, content, and fastq file references)
seqspec check -s igvf_onlist_skip yaml.gz
# To validate on Level 2 (schema, content, fastq, and onlist file references)
seqspec check -s igvf yaml.gz
Install IGVF-DACC checkfiles at https://github.com/IGVF-DACC/checkfiles.git
and follow the instructions at how to run local checkfiles. This option runs the same file validation system as the IGVF data portal on your local computers. If only to validate seqspec yaml files, simply install all the requirements using the command in Step 1. No additional dependencies are needed. Follow the command for local checkfiles validation of seqspec yaml files in the section titled "Validate seqspec yaml file while skip onlist files check".
# Clone the repo
git clone https://github.com/IGVF-DACC/checkfiles.git
# Install requirements after creating a virtual enviroment
pip install -r src/checkfiles/requirements.txt
# Run seqspec yaml validation on Level 1 (schema, content, and fastq file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --onlist_skip --md5sum f1859dd9d60554a8f8ab63b65b458267
# Run seqspec yaml validation on Level 2 (schema, content, fastq, and onlist file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --md5sum f1859dd9d60554a8f8ab63b65b458267