RSAT - matrix-quality manual

NAME

convert-variation

VERSION

$program_version

DESCRIPTION

Ensure inter-conversions between different formats of polymorphic variations.

/!\ To convert to VCF format, raw genomic sequence must be installed (download-ensembl-genome).

AUTHORS

Jeremy.Delerce@univ-amu.fr

CATEGORY

util

USAGE

 covert-variations -i filename -from format -to format [-species #] [-v #] [-o #]

SUPPORTED FORMAT

GVF, VCF, varBed

Genome Variant Format (GVF)

"The Genome Variant Format (GVF) is a type of GFF3 file with additional pragmas and attributes specified. The GVF format has the same nine column tab delimited format as GFF3 and all of the requirements and restrictions specified for GFF3 apply to the GVF specification as well." (quoted from the Sequence Ontology)

http://www.sequenceontology.org/resources/gvf_1.00.html

A GVF file starts with a header providing general information about the file content: format version, date, data source, length of the chromosomes / contigs covered by the variations.

 ##gff-version 3
 ##gvf-version 1.07
 ##file-date 2014-09-21
 ##genome-build ensembl GRCh38
 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/so.obo?revision=1.283
 ##data-source Source=ensembl;version=77;url=http://e77.ensembl.org/Homo_sapiens
 ##file-version 77
 ##sequence-region Y 1 57227415
 ##sequence-region 17 1 83257441
 ##sequence-region 6 1 170805979
 ##sequence-region 1 1 248956422
 ## [...]

This header is followed by the actual description of the variations, in a column-delimited format compying with the GFF format.

 Y       dbSNP   SNV     10015   10015   .       +       .       ID=1;variation_id=23299259;Variant_seq=C,G;Dbxref=dbSNP_138:rs113469508;allele_string=A,C,G;evidence_values=Multiple_observations;Reference_seq=A
 Y       dbSNP   SNV     10146   10146   .       +       .       ID=2;variation_id=26647928;Reference_seq=C;Variant_seq=G;evidence_values=Multiple_observations,1000Genomes;allele_string=C,G;Dbxref=dbSNP_138:rs138058540;global_minor_allele_frequency=0|0.0151515|33
 Y       dbSNP   SNV     10153   10153   .       +       .       ID=3;variation_id=21171339;Reference_seq=C;Variant_seq=G;evidence_values=Multiple_observations,1000Genomes;allele_string=C,G;Dbxref=dbSNP_138:rs111264342;global_minor_allele_frequency=1|0.00229568|5
 Y       dbSNP   SNV     10181   10181   .       +       .       ID=4;variation_id=47159994;Reference_seq=C;Variant_seq=G;evidence_values=1000Genomes;allele_string=C,G;Dbxref=dbSNP_138:rs189980076;global_minor_allele_frequency=0|0.00137741|3

The last column contains a lot of relevant information, but is not very easy to read. We should keep in mind that this format was initially defined to describe generic genomic features, so all the specific attributes come in the last column (description).

For this reasons, we developed a custom tab-delimited format to store variations, which we call varBed (see description below).

Variant Call Format (VCF)

http://en.wikipedia.org/wiki/Variant_Call_Format

This format was defined for the 1000 genomes project. It is no longer maintained. The converter supports it merely for the sake of backwards compatibility.

RSAT variation format (varBed)

Tab-delimited format with a specific column order, used as input by retrieve-variation-seq.

This format presents several advantages for scanning variations with matrices.

tab-delimited organization

Each field comes in a separate column -> the parsing does not require to further parse the last column of the GVF file.

File separated per chromosome

This is a matter of organization rather than an intrinsic property of the format (we could as well have used chromosome-separated GVF files, or whole-genomes RSAT variant files), but it speeds up the search for variants.

Combined variations

When several variants are mutually overlapping, install-ensembl-variations enables to compute all possible combinations of variations. However, this option may require considerable computer resources (computing time, storage), so we inactivate it by default. To support combinatory variants, install-ensembl-variations must be called with the option -task combine.

OUTPUT FORMAT

A tab delimited on selected output format

SEE ALSO

retrieve-variation-seq

retrieve-variation-seq retrieves variant information and sequences using ensembl variation files obtained with the program download-ensembl-variations.

WISH LIST

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

-i inputfile

Variation files in tab format

If no input file is specified, the standard input is used. This allows to use the command within a pipe.

This option is mutually exclusive with option -u.

-u input_file_URL

Use as input a file available on a remote Web server.

This option is mutually exclusive with option -i.

-from #

Format of the input file vcf,gvf,varBed

-to #

Format of the output file vcf,gvf,varBed

-species species_name

Species where variation are coming from (Homo_sapiens, Mus_musculus).

-release #

The version of ensembl

-assembly #

The version of the assembly of the species

-o outputfile

If no output file is specified, the standard output is used. This allows to use the command within a pipe.