IGVF

File formats overview

The IGVF file schema is separated into 9 categories: (1)Alignment File, (2)Configuration File, (3)Genome Browser Annotation File, (4)Image, (5)Matrix File, (6)Reference File, (7)Sequence File, (8)Signal File, (9)Tabular File (10) Model File.

Each category has its own set of required and optional metadata properties to describe a file. See full schema https://data.igvf.org/profiles under “Files”.

File Category	Schema Link	Use Cases	File Formats	Content Type
Alignment File	alignment_file schema	Results of sequenced reads mapped to reference sequence.	bam	alignments, transcriptome alignments
Configuration File	configuration_file schema	A file containing configuration settings or information defining the structure of other data files' content.	yaml	seqspec
Genome Browser Annotation File	genome_browser_annotation schema	A file containing configuration settings or information defining the structure of other data files' content.	bigBed, tabix	peaks
Image File	image_file	A file containing image data.	jpg, png	detected tissue/organ, low resolution tissue/organ, high resolution tissue/organ, fiducial alignment
Matrix File	matrix_file	A file containing quantification data in a multi-dimension format.	h5ad, hdf5, mtx, tar	contact matrix, sparse gene count matrix, sparse peak count matrix, sparse transcript count matrix, transcriptome annotations
Reference File	reference_file schema	A file containing diverse reference related information.	bed, csv, dat, fasta, gaf, gds, gtf, obo, owl, PWM, tar, tsv, txt, vcf, xml	exclusion list, inclusion list, genes, variants, proteins, etc.
Sequence File	sequence_file schema	Raw sequence files received from the sequencer.	bam, fastq	reads, PacBio subreads, Nanopore reads
Signal File	signal_file schema	A file containing analyzed sequencing data in signal form using a bigwig format.	bigWig	signal, signal of all reads, signal of unique reads, signal p-value, raw signal, read-depth signal, control signal, fold over change control
Tabular File	tabular_file schema	A file containing textual data with a tabular structure.	bed, csv, gtf, tsv, txt, vcf	barcode to element mapping, barcode to sample mapping, element quantifications, elements reference, fold over change control, guide quantifications, guide RNA sequences, peaks, protein to protein interaction score, sequence barcodes, variant effects, variant to element mapping
Model File	model_file schema	A file containing a trained model.	hdf5, json, tar, tsv	edge weights, graph structure, position weight matrix

*See file format definitions below.

Commonly used file formats in IGVF

.fasta - text-based file format that uses single-letter codes to represent nucleotide sequences of nucleic acids.

.fastq - text-based file format that stores nucleotide sequences with their corresponding quality scores.

.bam - compressed binary version of Sequence Alignment/Map (SAM) file format, a compact and indexable representation of nucleotide sequence alignments.

.gtf (Gene Transfer Format) - tab-delimited file format that is primarily used for genes/transcripts associated with the sequence. .gtf files contain 9 mandatory columns: (1) Reference sequence, (2) Source, (3) Feature, (4) Start, (5) End, (6) Score, (7) Strand, (8) Phase, (9) Attribute/Group (this field includes a gene_id or transcript_id value)

.bed (Browser Extensible Data) - tab-delimited file format that includes information on sequences that are visualizable in a genome browser. The three mandatory fields in a BED file are:

field name	definition
Chromosome	The name of the chromosome or scaffold that the feature is on.
Start	The start position of the feature, in 0-based genomic coordinates.
End	The end position of the feature, in 0-based genomic coordinates.

.wig - text-based file format that allows for plotting quantitative data as either shades of color (dense mode) or bars of varying height (full and pack mode) on the genome.

.bigBed- indexed binary file format for rapid display of information encoded in .bed files.

.bigWig - indexed binary file format for rapid display of information encoded in .wig files.

.vcf (Variant Call Format) - text-based file format that stores information about genetic variants, such as the position of the variant, the type of variant, the reference allele, the alternative allele, and the quality score of the variant. The 8 mandatory fields in a VCF file are:

field name	definition
#CHROM	The chromosome or contig on which the variant is located.
POS	The position of the variant on the chromosome.
ID	A unique identifier for the variant.
REF	The reference allele at the variant position.
ALT	The alternative allele(s) at the variant position.
QUAL	The quality score of the variant.
FILTER	A filter that indicates whether or not the variant is considered to be a high-quality variant.
INFO	A field that contains additional information about the variant, such as the impact of the variant on the protein sequence.

.hic - a binary file format for storing contact matrices and annotations of chromatin structural features generated from Hi-C or other proximity mapping assays.

.yaml - a file used to store data in a structured manner using human-readable text, often employed for configuration files and data serialization.

.h5ad - a file format used to store annotated single-cell RNA sequencing data, structured in HDF5 format, and commonly used in bioinformatics and computational biology research.

.mtx - a sparse matrix file. It is a commonly used file format for CellRanger outputs.

.dat - a generic data file.

.gaf - a tab-delimited plain text file capturing the association between gene products and GO terms.

.gds - a file that can efficiently store genomic data and provide fast random access to subsets of the data.

.obo - a text file used by OBO-Edit, the open-source, platform-independent application for viewing and editing ontologies.

.owl - a file that can represent rich knowledge about resources and the relationships between them.

.PWM - a position weight matrix file.

.xml - a file written in markup language often used to store hierarchical data or define configurations.

.tabix - an index file used with tab-delimited text files in genomics, enabling rapid retrieval of specific data records based on genomic coordinates.

.hdf5 - a hierarchical data format file used for storing and organizing large amounts of data in a structured manner, commonly used in scientific computing and data-intensive applications.

.tar - an archive format used to consolidate multiple files into a single file for efficient storage.