The IGVF file schema is separated into 9 categories: (1)Alignment File, (2)Configuration File, (3)Genome Browser Annotation File, (4)Image, (5)Matrix File, (6)Reference File, (7)Sequence File, (8)Signal File, (9)Tabular File (10) Model File.
Each category has its own set of required and optional metadata properties to describe a file. See full schema https://data.igvf.org/profiles under “Files”.
File Category | Schema Link | Use Cases | File Formats | Content Type |
---|---|---|---|---|
Alignment File | alignment_file schema | Results of sequenced reads mapped to reference sequence. | bam | alignments, transcriptome alignments |
Configuration File | configuration_file schema | A file containing configuration settings or information defining the structure of other data files' content. | yaml | seqspec |
Genome Browser Annotation File | genome_browser_annotation schema | A file containing configuration settings or information defining the structure of other data files' content. | bigBed, tabix | peaks |
Image File | image_file | A file containing image data. | jpg, png | detected tissue, low resolution tissue, high resolution tissue, fiducial alignment |
Matrix File | matrix_file | A file containing quantification data in a multi-dimension format. | h5ad, hdf5, mtx, tar | contact matrix, sparse gene count matrix, sparse peak count matrix, sparse transcript count matrix, transcriptome annotations |
Reference File | reference_file schema | A file containing diverse reference related information. | bed, csv, dat, fasta, gaf, gds, gtf, obo, owl, PWM, tar, tsv, txt, vcf, xml | exclusion list, inclusion list, genes, variants, proteins, etc. |
Sequence File | sequence_file schema | Raw sequence files received from the sequencer. | bam, fastq | reads, PacBio subreads, Nanopore reads |
Signal File | signal_file schema | A file containing analyzed sequencing data in signal form using a bigwig format. | bigWig | signal, signal of all reads, signal of unique reads, signal p-value, raw signal, read-depth signal, control signal, fold over change control |
Tabular File | tabular_file schema | A file containing textual data with a tabular structure. | bed, csv, gtf, tsv, txt, vcf | barcode to element mapping, barcode to sample mapping, element quantifications, elements reference, fold over change control, guide quantifications, guide RNA sequences, peaks, protein to protein interaction score, sequence barcodes, variant effects, variant to element mapping |
Model File | model_file schema | A file containing a trained model. | hdf5, json, tar, tsv | edge weights, graph structure, position weight matrix |
*See file format definitions below.
.fasta - text-based file format that uses single-letter codes to represent nucleotide sequences of nucleic acids.
.fastq - text-based file format that stores nucleotide sequences with their corresponding quality scores.
.bam - compressed binary version of Sequence Alignment/Map (SAM) file format, a compact and indexable representation of nucleotide sequence alignments.
.gtf (Gene Transfer Format) - tab-delimited file format that is primarily used for genes/transcripts associated with the sequence. .gtf files contain 9 mandatory columns: (1) Reference sequence, (2) Source, (3) Feature, (4) Start, (5) End, (6) Score, (7) Strand, (8) Phase, (9) Attribute/Group (this field includes a gene_id or transcript_id value)
.bed (Browser Extensible Data) - tab-delimited file format that includes information on sequences that are visualizable in a genome browser. The three mandatory fields in a BED file are:
field name | definition |
---|---|
Chromosome | The name of the chromosome or scaffold that the feature is on. |
Start | The start position of the feature, in 0-based genomic coordinates. |
End | The end position of the feature, in 0-based genomic coordinates. |
.wig - text-based file format that allows for plotting quantitative data as either shades of color (dense mode) or bars of varying height (full and pack mode) on the genome.
.bigBed- indexed binary file format for rapid display of information encoded in .bed files.
.bigWig - indexed binary file format for rapid display of information encoded in .wig files.
.vcf (Variant Call Format) - text-based file format that stores information about genetic variants, such as the position of the variant, the type of variant, the reference allele, the alternative allele, and the quality score of the variant. The 8 mandatory fields in a VCF file are:
field name | definition |
---|---|
#CHROM | The chromosome or contig on which the variant is located. |
POS | The position of the variant on the chromosome. |
ID | A unique identifier for the variant. |
REF | The reference allele at the variant position. |
ALT | The alternative allele(s) at the variant position. |
QUAL | The quality score of the variant. |
FILTER | A filter that indicates whether or not the variant is considered to be a high-quality variant. |
INFO | A field that contains additional information about the variant, such as the impact of the variant on the protein sequence. |
.hic - a binary file format for storing contact matrices and annotations of chromatin structural features generated from Hi-C or other proximity mapping assays.
.yaml - a file used to store data in a structured manner using human-readable text, often employed for configuration files and data serialization.
.h5ad - a file format used to store annotated single-cell RNA sequencing data, structured in HDF5 format, and commonly used in bioinformatics and computational biology research.
.mtx - a sparse matrix file. It is a commonly used file format for CellRanger outputs.
.dat - a generic data file.
.gaf - a tab-delimited plain text file capturing the association between gene products and GO terms.
.gds - a file that can efficiently store genomic data and provide fast random access to subsets of the data.
.obo - a text file used by OBO-Edit, the open-source, platform-independent application for viewing and editing ontologies.
.owl - a file that can represent rich knowledge about resources and the relationships between them.
.PWM - a position weight matrix file.
.xml - a file written in markup language often used to store hierarchical data or define configurations.
.tabix - an index file used with tab-delimited text files in genomics, enabling rapid retrieval of specific data records based on genomic coordinates.
.hdf5 - a hierarchical data format file used for storing and organizing large amounts of data in a structured manner, commonly used in scientific computing and data-intensive applications.
.tar - an archive format used to consolidate multiple files into a single file for efficient storage.