RSAT - variation-info manual

NAME

variation-info

VERSION

$program_version

DESCRIPTION

Taking as input variation IDs (rs numbers) or regions in a given genome variation-info will retrieve the varians information varBed format.

AUTHORS

Alejandra Medina Rivera <amedina\@liigh.unam.mx>
Jacques van Helden <Jacques.van-Helden\@univ-amu.fr>

CATEGORY

util

USAGE

variation-info [-i inputfile] [-o outputfile] [-v #] [-format variatio_format] [- col ID_column ] [-mml #] [-o output_file] [...]

INPUT FORMAT

Genomic coordinate file

The option -i allows to specify a genomic coordinate file in bed format. The program only takes into account the 3 first columns of the bed file, which specify the genomic coordinates.

Note (from Jacques van Helden): the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), but the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end.

Example of bed file

 chr1   3473041 3473370
 chr1   4380371 4380650
 chr1   4845581 4845781
 chr1   4845801 4846260

The definition of the BED format is provided on the UCSC Genome Browser web site (http://genome.ucsc.edu/FAQ/FAQformat#format1).

This program only takes into account the 3 first columns, which specify the genomic coordinates.

1. chrom

The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2. chromStart

The starting position of the feature in the chromosome or scaffold. For RSAT programs, the first base in a chromosome is numbered 1 (this differs from the UCSC-specific zero-based notation for the start).

Note from Jacques van Helden: the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), and the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end. We find this representation completely counter-intuitive, and we herefore decided to adopt a "normal" convention, where:

start and end position represent the first and last positions included in the region of interest.
start and end positions are provided in one-based notation (first base of a chromosome or contig has coordinate 1).
3. chromEnd

The ending position of the feature in the chromosome or scaffold.

Variation file

See download-ensembl-variation output format.

Variation ID list

A tab delimited file with id of variation in column.

OUTPUT FORMAT

varBed format is a tab delimited file that facilitates access to relevant variant information. The file includes the following columns:

1) chr

Chromosome number (without "chr")

2) start
 Possition of the variations
3) end

Possition of the variation

4) strand

strand were the variation was annotates

5) id

variant ID, rs number

6) ref

Reference allele

7) alt

Alternative allele

8) so_term validate

validation of the variant, 1 if it had evidence

9) minor_allele_freq

Frequency of the alternative allele

10) is_supvar

1 if this variant was constructed using overlaped variants

11) in_supvar

1 if this this variant is overlaping with other anntotated variants

SEE ALSO

download-ensembl-genome

Installe organims from ensembl genomes.

download-ensembl-variations

Get variation coordiantes from ensembl. Variants information obtained with this tool are retrived by variation-info.

convert-variations

Convert between diferent variation data file types. variation-info retrieves variants in varBed format, <convert-variations> can be used to convert to vcf anf gvf formats.

retrieve-variations-seq

Given a set of regions, varian IDs (rsNumber) or variants in varBed format <retrieve-variation-seq> will retrive the corresponding genomic sequence sorounding the genetic variants.

variation-scan

Scan variation sequences with one or several position-specific scoring matrices.

WISH LIST

wish 1

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

-species species_name

Species name. This name must correspond to the species of the variation/bed/id file if provided.

-species_suffix

Species name. This name must correspond to the species of the variation/bed/id file if provided.

-e_version #

The version of ensembl database (e.g. 72).

Note: each Ensembl version contains a specific assembly version for each species. When the option -e_version is used, the option -assembly should thus in principle not be used.

-assembly #

Assembly version (e.g. GRCh37 for the assembly 37 of the Human genome).

Note: genome assemblies can cover several successive ensemble versions. In case of ambiguity, the latest corresponding ensembl version is used.

-i inputfile

If no input file is specified, the standard input is used. This allows to use the command within a pipe.

-format variation_format

Format of the input file

Supported formats:

varBed

Format of variation files used by all RSAT scripts.

id

tab-delimited file with all variation IDs in a given column, which can be specified by the option -col.

bed

General format for the description of genomic features (see https://genome.ucsc.edu/FAQ/FAQformat.html#format1).

-col #

Column containing the variation IDs with the input format "id".

Default : 1

-o outputfile

If no output file is specified, the standard output is used. This allows to use the command within a pipe.