retrieve-variation-seq
$program_version
Given a set of IDs for polymorphic variations, retrieve the corresponding variants and their flanking sequences, in order to scan them wiht the tool variation-scan.
retrieve-snp-seq -species species_name (-release # | -assembly assembly_version) \
[-i #inputfile] [-format variation_format] \
[-col ID_column] [-mml #] [-o outputfile] [-v #] [...]
Get variation sequence of Homo_sapiens from a bed file
retrieve-snp-seq -v 2 \
-species Homo_sapiens -release 72
-i $RSAT/public_html/demo_files/sample_regions_for_variations_hg19.bed \
-mml 30 \
-o variations.varBed
The option -i allows to specify a genomic coordinate file in bed format. The program only takes into account the 3 first columns of the bed file, which specify the genomic coordinates.
Note (from Jacques van Helden): the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), but the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end.
chr1 3473041 3473370
chr1 4380371 4380650
chr1 4845581 4845781
chr1 4845801 4846260
The definition of the BED format is provided on the UCSC Genome Browser web site (http://genome.ucsc.edu/FAQ/FAQformat#format1).
This program only takes into account the 3 first columns, which specify the genomic coordinates.
The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
The starting position of the feature in the chromosome or scaffold. For RSAT programs, the first base in a chromosome is numbered 1 (this differs from the UCSC-specific zero-based notation for the start).
Note from Jacques van Helden: the UCSC genome browser adopts a somewhat inconsistent convention for start and end coordinates: the start position is zero-based (first nucleotide of a chromosome/scaffold has coordinate 0), and the end position is considered not included in the selection. This is equivalent to have a zero-based coordinate for the start, and a 1-base coordinate for the end. We find this representation completely counter-intuitive, and we herefore decided to adopt a "normal" convention, where:
The ending position of the feature in the chromosome or scaffold.
See download-ensembl-variation output format.
A tab delimited file with id of variation in column.
A tab delimited file with the following column content.
The name of the chromosome (e.g. 1, X, 8...)
The starting position of the feature in the chromosome
The ending position of the feature in the chromosome
The strand of the feature in the chromosome
ID of the variation
SO Term of the the variation
Allele of the variation in the reference sequence
Allele of the variation in the sequence
Allele frequency
Sequence of lenght L center on the variation
retrieve-variation-seq uses the sequences downloaded from Ensembl using the tool download-ensembl-genome.
retrieve-variation-seq uses variation coordinates downloaded from Ensembl using the tool download-ensembl-variations.
Scan variation sequences with one or several position-specific scoring matrices.
Level of verbosity (detail in the warning messages during execution)
Display full help message
Same as -h
Species name. This name must correspond to the species of the variation/bed/id file if provided.
Species name. This name must correspond to the species of the variation/bed/id file if provided.
The version of ensembl database (e.g. 72).
Note: each Ensembl version contains a specific assembly version for each species. When the option -release is used, the option -assembly should thus in principle not be used.
Assembly version (e.g. GRCh37 for the assembly 37 of the Human genome).
Note: genome assemblies can cover several successive ensemble versions. In case of ambiguity, the latest corresponding ensembl version is used.
#=item -available_species # #Get the list of all locally supported species and genome assemblies. # #=cut # } elsif ($arg eq "-available_species") { # $main::available = 1; # #=pod
Input File.
The input file specifies a list of query variations. Each row corresponds to one query.
The variations can be provided in various formats (see option -format below).
Format of the input file
Supported formats:
Format of variation files used by all RSAT scripts.
tab-delimited file with all variation IDs in a given column, which can be specified by the option -col.
General format for the description of genomic features (see https://genome.ucsc.edu/FAQ/FAQformat.html#format1).
Length of the longest Matrix
Column containing the variation IDs with the input format "id".
Default : 1
The output file is in fasta format.
If no output file is specified, the standard output is used. This allows to use the command within a pipe.