File formats overview

The IGVF file schema is separated into 9 categories: (1)Alignment File, (2)Configuration File, (3)Genome Browser Annotation File, (4)Image, (5)Matrix File, (6)Reference File, (7)Sequence File, (8)Signal File, (9)Tabular File (10) Model File.

Each category has its own set of required and optional metadata properties to describe a file. See full schema https://data.igvf.org/profiles under “Files”.

File CategorySchema LinkUse CasesFile FormatsContent Type
Alignment Filealignment_file schemaResults of sequenced reads mapped to reference sequence.bamalignments, transcriptome alignments
Configuration Fileconfiguration_file schemaA file containing configuration settings or information defining the structure of other data files' content.yamlseqspec
Genome Browser Annotation Filegenome_browser_annotation schemaA file containing configuration settings or information defining the structure of other data files' content.bigBed, tabixpeaks
Image Fileimage_fileA file containing image data.jpg, pngdetected tissue, low resolution tissue, high resolution tissue, fiducial alignment
Matrix Filematrix_fileA file containing quantification data in a multi-dimension format.h5ad, hdf5, mtx, tarcontact matrix, sparse gene count matrix, sparse peak count matrix, sparse transcript count matrix, transcriptome annotations
Reference Filereference_file schemaA file containing diverse reference related information.bed, csv, dat, fasta, gaf, gds, gtf, obo, owl, PWM, tar, tsv, txt, vcf, xmlexclusion list, inclusion list, genes, variants, proteins, etc.
Sequence Filesequence_file schemaRaw sequence files received from the sequencer.bam, fastqreads, PacBio subreads, Nanopore reads
Signal Filesignal_file schemaA file containing analyzed sequencing data in signal form using a bigwig format.bigWigsignal, signal of all reads, signal of unique reads, signal p-value, raw signal, read-depth signal, control signal, fold over change control
Tabular Filetabular_file schemaA file containing textual data with a tabular structure.bed, csv, gtf, tsv, txt, vcfbarcode to element mapping, barcode to sample mapping, element quantifications, elements reference, fold over change control, guide quantifications, guide RNA sequences, peaks, protein to protein interaction score, sequence barcodes, variant effects, variant to element mapping
Model Filemodel_file schemaA file containing a trained model.hdf5, json, tar, tsvedge weights, graph structure, position weight matrix

*See file format definitions below.

Commonly used file formats in IGVF

.fasta - text-based file format that uses single-letter codes to represent nucleotide sequences of nucleic acids.

.fastq - text-based file format that stores nucleotide sequences with their corresponding quality scores.

.bam - compressed binary version of Sequence Alignment/Map (SAM) file format, a compact and indexable representation of nucleotide sequence alignments.

.gtf (Gene Transfer Format) - tab-delimited file format that is primarily used for genes/transcripts associated with the sequence. .gtf files contain 9 mandatory columns: (1) Reference sequence, (2) Source, (3) Feature, (4) Start, (5) End, (6) Score, (7) Strand, (8) Phase, (9) Attribute/Group (this field includes a gene_id or transcript_id value)

.bed (Browser Extensible Data) - tab-delimited file format that includes information on sequences that are visualizable in a genome browser. The three mandatory fields in a BED file are:

field namedefinition
ChromosomeThe name of the chromosome or scaffold that the feature is on.
StartThe start position of the feature, in 0-based genomic coordinates.
EndThe end position of the feature, in 0-based genomic coordinates.

.wig - text-based file format that allows for plotting quantitative data as either shades of color (dense mode) or bars of varying height (full and pack mode) on the genome.

.bigBed- indexed binary file format for rapid display of information encoded in .bed files.

.bigWig - indexed binary file format for rapid display of information encoded in .wig files.

.vcf (Variant Call Format) - text-based file format that stores information about genetic variants, such as the position of the variant, the type of variant, the reference allele, the alternative allele, and the quality score of the variant. The 8 mandatory fields in a VCF file are:

field namedefinition
#CHROMThe chromosome or contig on which the variant is located.
POSThe position of the variant on the chromosome.
IDA unique identifier for the variant.
REFThe reference allele at the variant position.
ALTThe alternative allele(s) at the variant position.
QUALThe quality score of the variant.
FILTERA filter that indicates whether or not the variant is considered to be a high-quality variant.
INFOA field that contains additional information about the variant, such as the impact of the variant on the protein sequence.

.hic - a binary file format for storing contact matrices and annotations of chromatin structural features generated from Hi-C or other proximity mapping assays.

.yaml - a file used to store data in a structured manner using human-readable text, often employed for configuration files and data serialization.

.h5ad - a file format used to store annotated single-cell RNA sequencing data, structured in HDF5 format, and commonly used in bioinformatics and computational biology research.

.mtx - a sparse matrix file. It is a commonly used file format for CellRanger outputs.

.dat - a generic data file.

.gaf - a tab-delimited plain text file capturing the association between gene products and GO terms.

.gds - a file that can efficiently store genomic data and provide fast random access to subsets of the data.

.obo - a text file used by OBO-Edit, the open-source, platform-independent application for viewing and editing ontologies.

.owl - a file that can represent rich knowledge about resources and the relationships between them.

.PWM - a position weight matrix file.

.xml - a file written in markup language often used to store hierarchical data or define configurations.

.tabix - an index file used with tab-delimited text files in genomics, enabling rapid retrieval of specific data records based on genomic coordinates.

.hdf5 - a hierarchical data format file used for storing and organizing large amounts of data in a structured manner, commonly used in scientific computing and data-intensive applications.

.tar - an archive format used to consolidate multiple files into a single file for efficient storage.