Sequence Specification (seqspec) overview

This general submission help provides a general overview of generation, submission, and validation of seqspec documents to the IGVF data portal.

General information

If you are submitting raw sequencing data (such as Illumina, PacBio, NanoPore, etc.), the corresponding sequence specification YAML configuration files (seqspec) should be submitted along with the sequence files. Furthermore, the seqspec YAML files are required to uniformly process single-cell assays (such as scRNA-seq, 10X multiome, SHARE-seq, MULTI-seq, etc.).

This type of machine-readable YAML files describes your genomic library sequence and structure for standardized data processing, see Bioinformatics, Volume 40, Issue 4, April 2024. The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of.

Detailed documentations on seqspec installation, generation, validation, and usage are included in seqspec docs. Additional tutorials are available from a 2024 seqspec Jamboree session with all content at Google Drive.

Seqspec YAML files generation and submission

Generation

The seqspec code repository (for the YAML file generation and other functionalities) is available on GitHub seqspec repo. Furthermore, please check seqspec JSON schema for a list of accepted seqspec enums, such as assay terms, region types, primer IDs, etc.

For any additional help generating a YAML file, please contact Sina Booeshaghi and Lior Pachter. For any other additional question on submitting a configuration_file, please contact your wranglers.

Submission

The seqspec YAML files are required to uniformly process single-cell assays (see assay terms above). The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of. The seqspec YAML files need to be gzipped prior to submission.

Please make sure that the file_set of a seqspec YAML file matches the file_set of the sequence files in the seqspec_of content.

If a seqspec has onlist file references in its library_spec sections that have the property region_type:barcode (see section above), the submitters must also add the same list of files to the linked Measurement Sets' onlist_files and onlist_method.

Key seqspec yaml file information

There are a few important seqspec YAML styles guidelines to follow for the generation and submission of seqspec YAML files.

  1. The seqspec YAML file must be v0.3.0.
  2. All references to FASTQ files or onlist files, if used in a seqspec, must use IGVF data portal accessions.
  3. Labs are still expected to submit one seqspec per library structure regardless of whether their experiments are single cell or not. This means if the library_spec portion will be different, a new seqspec describing this new library structure is needed.
Submit seqspec for single cell assays
  1. If a seqspec YAML file is used in single cell assays, it must follow the style of one seqspec YAML file per sequencing run. The following terms currently fall under single cell experiments and associated data will be analyzed using the single cell uniform pipeline.
    • /assay-terms/OBI_0002762/, single-nucleus ATAC-seq
    • /assay-terms/OBI_0003109/, single-nucleus RNA sequencing assay
    • /assay-terms/OBI_0002631/, single-cell RNA sequencing assay
    • /assay-terms/OBI_0002764/, single-cell ATAC-seq
    • /assay-terms/OBI_0003660/, in vitro CRISPR screen using single-cell RNA-seq
Submit seqspec for NON-single cell assays
  1. If a seqspec YAML file is NOT used in single cell assays (see assay terms above), the current suggested submission method is to create one valid seqspec YAML file using one sequencing run results. Upload the seqspec YAML file as a Configuration File linked to a Curated Set. All future raw sequencing files can use this seqspec YAML file even though the YAML only contains one set of FASTQ files. [Examples TBD].

Example seqspec

The example seqspec used is IGVFFI1157AYPH.

Section 1: Assay information

This section describes the experimental setup of one or more sequencing runs. Many of the information listed in this section are also collected as metadata when submitting sequence files and file sets to the data portal. Therefore, the seqspec validation check omits this section except the seqspec_version.

Important info for the !Assay section

  1. The seqspec_version in this section is expected to be v0.3.0 if a seqspec YAML file will be submitted to the IGVF data portal.
  2. The -modality section should only contain one modality. Please do not combine multiple modalities (e.g., RNA and ATAC) into the same seqspec YAML file.

Seqspec version

Section 2: sequence_spec

This section describes the relevant read FASTQ files generated by sequencing runs. When referencing sequencing FASTQ files in this section, the files and the associated read_id must 1) use IGVF accessions, and 2) have valid IGVF file download URLs.

Seqspec FASTQ file reference

Section 3: library_spec

The library_spec section describes in details the structures and regions of sequencing libraries.

library_spec overview

Onlist files

One important feature of the library_spec section is the onlist files. They may be used to generate barcode inclusion lists which will be used to correct barcodes during pipeline runs. Because of this usage, there are specific requirements to follow when including onlist information.

  1. If a seqspec YAML file has a library_spec region that has the property onlist: !Onlist, that section is expected to have a valid file name and URL reference to a Tabular File with content_type: barcode onlist on IGVF portal.
  2. If a seqspec YAML file is linked to single cell Measurement Sets, it is especially important to 1) have valid onlist files in sections with region_type:barcode, and 2) these onlist files must be single column without any header (e.g., tabular file IGVFFI0791WXDC).

seqspec barcode region onlist

  1. If there are also sections that have region_type: index5 (or index7) (or any other region_type that is no barcode) and onlist:!Onlist, it is currently acceptable to include URL references to Tabular Files with multiple barcode columns and headers. If any future IGVF uniform pipeline plans on using info from these sections, this requirement is expected to be updated.

seqspec index region onlist

Validating generated seqspec yaml files

It is recommended that submitters self-validate their seqspec YAML files before submitting to the IGVF data portal. There are 2 levels of validations done on the IGVF data portal.

Level 1 validation: It is applied to all submitted seqspec YAML files. The process includes seqspec schema check, content check, and read FASTQ file URLs validation.

Level 2 validation: It is currently only applicable to seqspec YAML files used for single cell assays (see the assay terms above), in which the onlist file URLs will be validated.

There are 2 options to validate seqspec files.

Option 1: Using seqspec native tool

# To validate on Level 1 (schema, content, and fastq file references)
seqspec check -s igvf_onlist_skip yaml.gz

# To validate on Level 2 (schema, content, fastq, and onlist file references)
seqspec check -s igvf yaml.gz

Option 2: Using IGVF local checkfiles

Install IGVF-DACC checkfiles at https://github.com/IGVF-DACC/checkfiles.git and follow the instructions at how to run local checkfiles. This option runs the same file validation system as the IGVF data portal on your local computers. If only to validate seqspec yaml files, simply install all the requirements using the command in Step 1. No additional dependencies are needed. Follow the command for local checkfiles validation of seqspec yaml files in the section titled "Validate seqspec yaml file while skip onlist files check".

# Clone the repo
git clone https://github.com/IGVF-DACC/checkfiles.git

# Install requirements after creating a virtual enviroment
pip install -r src/checkfiles/requirements.txt

# Run seqspec yaml validation on Level 1 (schema, content, and fastq file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --onlist_skip --md5sum f1859dd9d60554a8f8ab63b65b458267

# Run seqspec yaml validation on Level 2 (schema, content, fastq, and onlist file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --md5sum f1859dd9d60554a8f8ab63b65b458267