RSAT - matrix-quality manual

NAME

retrieve-variation-seq

VERSION

$program_version

DESCRIPTION

Given a set of IDs for polymorphic variations, retrieve the corresponding variants and their flanking sequences, in order to scan them wiht the tool variation-scan.

AUTHORS

Jeremy Delerce (M2 thesis project 2013)
Alejandra Medina Rivera <amedina@lcg.unam.mx>
Jacques van Helden <Jacques.van-Helden\@univ-amu.fr>

CATEGORY

util

USAGE

 retrieve-snp-seq -species species_name (-release # | -assembly assembly_version)  \
   [-i #inputfile] [-format variation_format] \
   [-col ID_column] [-mml #] [-o outputfile] [-v #] [...]

Example

  Get variation sequence of Homo_sapiens from a bed file
    retrieve-snp-seq -v 2 \
      -species Homo_sapiens -release 72
      -i $RSAT/public_html/demo_files/sample_regions_for_variations_hg19.bed \
      -mml 30 \
      -o variations.varBed

INPUT FORMAT

Genomic coordinate file

The option -i allows to specify a genomic coordinate file in bed format. The program only takes into account the 3 first columns of the bed file, which specify the genomic coordinates.

Note (from Jacques van Helden): the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), but the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end.

Example of bed file

 chr1   3473041 3473370
 chr1   4380371 4380650
 chr1   4845581 4845781
 chr1   4845801 4846260

The definition of the BED format is provided on the UCSC Genome Browser web site (http://genome.ucsc.edu/FAQ/FAQformat#format1).

This program only takes into account the 3 first columns, which specify the genomic coordinates.

1. chrom

The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2. chromStart

The starting position of the feature in the chromosome or scaffold. For RSAT programs, the first base in a chromosome is numbered 1 (this differs from the UCSC-specific zero-based notation for the start).

Note from Jacques van Helden: the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), and the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end. We find this representation completely counter-intuitive, and we herefore decided to adopt a "normal" convention, where:

start and end position represent the first and last positions included in the region of interest.
start and end positions are provided in one-based notation (first base of a chromosome or contig has coordinate 1).
3. chromEnd

The ending position of the feature in the chromosome or scaffold.

Variation file

See download-ensembl-variation output format.

Variation ID list

A tab delimited file with id of variation in column.

OUTPUT FORMAT

A tab delimited file with the following column content.

1. chrom

The name of the chromosome (e.g. 1, X, 8...)

2. chromStart

The starting position of the feature in the chromosome

3. chromEnd

The ending position of the feature in the chromosome

4. chromStrand

The strand of the feature in the chromosome

5. variation id

ID of the variation

8. SO term

SO Term of the the variation

7. ref variant

Allele of the variation in the reference sequence

8. variant

Allele of the variation in the sequence

10. allele_frequency

Allele frequency

10. sequence

Sequence of lenght L center on the variation

SEE ALSO

download-ensembl-genome

retrieve-variation-seq uses the sequences downloaded from Ensembl using the tool download-ensembl-genome.

download-ensembl-variations

retrieve-variation-seq uses variation coordinates downloaded from Ensembl using the tool download-ensembl-variations.

variation-scan

Scan variation sequences with one or several position-specific scoring matrices.

WISH LIST

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

-species species_name

Species name. This name must correspond to the species of the variation/bed/id file if provided.

-species_suffix

Species name. This name must correspond to the species of the variation/bed/id file if provided.

-release #

The version of ensembl database (e.g. 72).

Note: each Ensembl version contains a specific assembly version for each species. When the option -release is used, the option -assembly should thus in principle not be used.

-assembly #

Assembly version (e.g. GRCh37 for the assembly 37 of the Human genome).

Note: genome assemblies can cover several successive ensemble versions. In case of ambiguity, the latest corresponding ensembl version is used.

#=item -available_species # #Get the list of all locally supported species and genome assemblies. # #=cut # } elsif ($arg eq "-available_species") { # $main::available = 1; # #=pod

-i input_file

Input File.

The input file specifies a list of query variations. Each row corresponds to one query.

The variations can be provided in various formats (see option -format below).

-format variation_format

Format of the input file

Supported formats:

varBed

Format of variation files used by all RSAT scripts.

id

tab-delimited file with all variation IDs in a given column, which can be specified by the option -col.

bed

General format for the description of genomic features (see https://genome.ucsc.edu/FAQ/FAQformat.html#format1).

-mml #

Length of the longest Matrix

-col #

Column containing the variation IDs with the input format "id".

Default : 1

-o outputfile

The output file is in fasta format.

If no output file is specified, the standard output is used. This allows to use the command within a pipe.