Writing a RNA-Seq workflow with snakemake

Creating a working directory

In your home directory, create a working directory named projects. In this folder, create a directory named t-cell_analysis. Create a directory named samples and a subfolder named raw. Download the following fastq file and move them to the ‘raw’ folder.

Type	link
Control (first read of the pair)	control_R1.fq
Control (second read of the pair)	control_R2.fq
Treated (first read of the pair)	treated_R1.fq
Treated (second read of the pair)	treated_R2.fq

cd    ~/git/abd/practical/snakemake           # go to your home directory (please adapt)
mkdir -p projects/t-cell_analysis/samples/raw # create a directory containing projects
cd projects/t-cell_analysis/samples/raw       # change directory

# Download the file of interest (here using a loop)
# Note that it can be interesting to store the STDERR of wget.
for i in control_R1 control_R2 treated_R1 treated_R2;
  do wget --no-clobber http://pedagogix-tagc.univ-mrs.fr/courses/data/ngs/abd/${i}.fq.gz 2>/dev/null;
done

Creating an index for bowtie2

Here we will work on a subset of the initial read file. The selected reads have been pre-processed and should map to the mouse chromosome 19. We will thus align these reads onto this chromosome. Although bowtie indexes are available for the full genome on bowtie2 website we will here prepared a dedicated index for chromosome 19.

You will need to download the sequence of the chromosome 19 (Mus musculus genome version mm10) and to index it using bowtie2-build. The index should be placed in the projects/indexes/bowtie2/mm10 folder and its prefix should be chr19. The sequence of the chr19 can be downloaded through the UCSC web site or through ensembl.

[Hide/Show results]

cd    ~/git/abd/practical/snakemake # please adapt
mkdir -p projects/indexes/bowtie2/mm10 # Create a directory to store the index
cd projects/indexes/bowtie2/mm10       # go to projects/indexes/bowtie2/mm10
# Download chr19 sequence (mm10 version)
wget --no-clobber http://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/chr19.fa.gz 2> chr19.fa_wget.log
gunzip chr19.fa.gz # uncompress the file
# to get help about bowtie-build type:
# bowtie2-build -h
# To uncomment
bowtie2-build chr19.fa chr19 &> chr19.fa_bowtie2-build.log
ls -rtlh

Downloading a gtf file (Mus musculus genome version mm10)

We will download the GTF file from ensembl (version mm10/GRCm38).

cd    ~/git/abd/practical/snakemake/projects/ # adpat to your needs
mkdir -p annotations/gtf
cd annotations/gtf
curl ftp://ftp.ensembl.org/pub/release-83/gtf/mus_musculus/Mus_musculus.GRCm38.83.chr.gtf.gz | gunzip -c | grep "^19" | sed 's/^19/chr19/' > GRCm38.83.chr19.gtf

Downloading informations about chromosomes

In order to perform some of the tasks below we will need to download a file containg the size of the chromosomes.

Using your internet Browser go to UCSC web site. In the tool menu, select Table Browser. Fill the form as follow:

Clade: Mammal.

Genome: Mouse.

Assembly: mm10.

Group: All Tables.

Table: chromInfo.

Select all fields from selected table for output. Use the following string “chromInfo_mm10.txt” as output file name.

Click on get output

Create a directory projects/annotations/txt and place the downloaded file in that directory.

Creating a Snakefile

In the projects/t-cell_analysis directory, create a file named Snakefile using the touch command. Open this file with emacs or any editor of your choice.

cd ~/projects/t-cell_analysis
touch Snakefile

Setup your Snakefile

First indicate using comments (strings prefixed with # or docstring “”" ….“”“) the name of the author, its affiliation, the aim of the program and the date. Add another command related to the way this workflow should be started. Add another line related to the date of the modifications.

"""
Author: D. Puthier
Affiliation: AMU
Aim: A simple Snakemake workflow to process paired-end stranded RNA-Seq.
Date: Mon Nov  2 14:03:11 CET 2015
Run: snakemake   -s Snakefile   
Latest modification: 
  - todo
"""

Déclaring a set of function.

Python code can be easily included outsides the rules. For example we can define a function message() to print some processing messages on STDERR.

Create a section variables (INDEX, GTF) that point to the appropriate locations (the index name should be provided without extension).

<< Hide | Show >>

##-----------------------------------------------##
## A set of functions
##-----------------------------------------------##
import sys

def message(mes):
  sys.stderr.write("|--- " + mes + "\n")

Set the reference directory for Snakemake.

We will set the directory projects/t-cell_analysis as a reference directory for the Snakefile. Here we create a python variable names WDIR and pass it as argument to Snakemake

# This should be placed in the Snakefile.

##-----------------------------------------------##
## Working directory                             ##
## Adapt to your needs                           ##
##-----------------------------------------------##

BASE_DIR = "/Users/puthier/git/abd/practical/snakemake/projects"
WDIR = BASE_DIR + "/t-cell_analysis"
workdir: WDIR
message("The current working directory is " + WDIR)

Déclaring some variable required for the subsequent steps.

Create a section variables (INDEX, GTF) that point to the appropriate locations (the index name should be provided without extension). We can print the choosen working directory using the simple function message() declared above. Note that in Python strings can be concatened using the + operator.

##--------------------------------------------------------------------------------------##
## Variables declaration                          
## Declaring some variables used by topHat and other tools... 
## (GTF file, INDEX, chromosome length)
##--------------------------------------------------------------------------------------##
# Adapt the path to your needs
INDEX = BASE_DIR + "/indexes/bowtie2/mm10/chr19"
GTF   = BASE_DIR + "/annotations/gtf/GRCm38.83.chr19.gtf"
CHR   = BASE_DIR + "/annotations/txt/chromInfo_mm10.txt"
FASTA = BASE_DIR + "/indexes/bowtie2/mm10/chr19.fa"

Getting the list of file to be processed

To get the list of files of samples to be processed we can use the glob_wildcards() function. We can print the list of samples to be processed using the simple function message() declared above.

##--------------------------------------------------------------------------------------##
## The list of samples to be processed
##--------------------------------------------------------------------------------------##
SAMPLES, = glob_wildcards("samples/raw/{smp}_R1.fq.gz")
NB_SAMPLES = len(SAMPLES)


for smp in SAMPLES:
  message("Sample " + smp + " will be processed")

Our first rule

Each task that should be processed using snakemake is called a rule (these are somehow the targets of the makefile). As a first rule we will ask snakemake to trim the reads. The code to perform this task is the following:


rule trimming:
  input:  fwd="samples/raw/{smp}_R1.fq.gz", rev="samples/raw/{smp}_R2.fq.gz"
  output: fwd="samples/trimmed/{smp}_R1_t.fq", 
          rev="samples/trimmed/{smp}_R2_t.fq", 
          single="samples/trimmed/{smp}_R1_singletons.fq"
  message: """--- Trimming."""
  shell: """
        sickle pe -f {input.fwd} -r {input.rev}  -l 40 -q 20 -t sanger  -o {output.fwd} -p {output.rev} -s {output.single} &> {input.fwd}.log                                                                               
  """

You can try to run the code using the following commands. Snakemake should complain and throw an error telling us that it is not able to resolve {smp} at the moment.

# Adapt the path to your needs
cd /Users/puthier/git/abd/practical/snakemake/projects/t-cell_analysis
/Library/Frameworks/Python.framework/Versions/3.4/bin/snakemake

To be able to run the code, snakemake needs to know which files {smp} refers to. We will provide it by modifying the last chunk of code:

rule final: 
  input: expand("samples/trimmed/{smp}_R1_t.fq", smp=SAMPLES)


rule trimming:
  input:  fwd="samples/raw/{smp}_R1.fq.gz", rev="samples/raw/{smp}_R2.fq.gz"
  output: fwd="samples/trimmed/{smp}_R1_t.fq", 
          rev="samples/trimmed/{smp}_R2_t.fq", 
          single="samples/trimmed/{smp}_R1_singletons.fq"
  message: """--- Trimming."""
  shell: """
        sickle pe -f {input.fwd} -r {input.rev}  -l 25 -q 20 -t sanger  -o {output.fwd} -p {output.rev} -s {output.single} &> {input.fwd}.log                                                                               
  """

You can try to run the code using the following commands. Snakemake should perform the first task.

# Adapt the path to your needs
cd /Users/puthier/git/abd/practical/snakemake/projects/t-cell_analysis
/Library/Frameworks/Python.framework/Versions/3.4/bin/snakemake

NB: at this step, snakemake knows that all the requested tasks have been performed:

# Adapt the path to your needs
cd /Users/puthier/git/abd/practical/snakemake/projects/t-cell_analysis
/Library/Frameworks/Python.framework/Versions/3.4/bin/snakemake

Adding another task

After read trimming we will add a fastqc analysis. The set of rules should now like below:

rule final: 
  input: expand("samples/fastqc/{smp}/{smp}_R1_t.fq_fastqc.zip", smp=SAMPLES)

rule trimming:
  input:  fwd="samples/raw/{smp}_R1.fq.gz", rev="samples/raw/{smp}_R2.fq.gz"
  output: fwd="samples/trimmed/{smp}_R1_t.fq", 
          rev="samples/trimmed/{smp}_R2_t.fq", 
          single="samples/trimmed/{smp}_R1_singletons.fq"
  message: """--- Trimming reads."""
  shell: """                                                                                                                                                                                                                                                                                                                                                            
        sickle pe -f {input.fwd} -r {input.rev}  -l 40 -q 20 -t sanger  -o {output.fwd} -p {output.rev} -s {output.single} &> {input.fwd}.log                                                                               
  """

rule fastqc:
        input:  fwd="samples/trimmed/{smp}_R1_t.fq", 
                rev="samples/trimmed/{smp}_R2_t.fq"
        output: fwd="samples/fastqc/{smp}/{smp}_R1_t.fq_fastqc.zip", rev="samples/fastqc/{smp}/{smp}_R2_t.fq_fastqc.zip"
        message: """--- Quality check of raw data with Fastqc."""
        shell: """                                                                                                                                                                                        
              /opt/FastQC/fastqc --outdir  samples/fastqc/{wildcards.smp} --extract  -f fastq {input.fwd} {input.rev}                                                                                                                                                                                                                                                                                    
     """

Performing mapping with tophat (and indexing BAM file)

For tophat mapping we must take care to indicate the proper library type (fr-firststrand). We will also to reject reads that map to several locations on the genomes (here we won’t consider the mapping quality).

rule tophat:
        input:fwd="samples/trimmed/{smp}_R1_t.fq", 
              rev="samples/trimmed/{smp}_R2_t.fq"
        params: gtf=GTF, index=INDEX
        output: "samples/bam/{smp}.bam"
        shell: """
                mkdir -p samples/tophat/{wildcards.sample}
                tophat2                                         \
                        -o samples/tophat/{wildcards.sample}    \
                        -g 1                                    \
                        --library-type fr-firststrand           \
                        -G {params.gtf}                          \
                        -x 1                                    \
                        -p 5                                    \
                        {params.index}                             \
                        {input.fwd} {input.rev} &> samples/tophat/{wildcards.sample}/run_tophat.log
                cd samples/tophat/{wildcards.sample}
                mv accepted_hits.bam ../../bam/{wildcards.smp}.bam
                """

Selecting reads from positive or negative strand

To select reads from fragments that correspond to positive or negative strands, we need to use the bitwise flag from the SAM/BAM file. You can the SAMTOOLS Explained web site to get information regarding flag value and associated information. Here what we are looking for is the following:

The flags that can be used to select read of interest

rule select_reads_by_strand:
        input: "samples/bam/{smp}.bam"
        output: min="samples/bam/{smp}_min.bam", plus="samples/bam/{smp}_plus.bam"
        shell: """
              samtools  view -f99 -hb {input} > {output.min}_99.bam
              samtools  view -f147 -hb {input} > {output.min}_147.bam
              samtools  view -f83 -hb {input} > {output.plus}_83.bam
              samtools  view -f163 -hb {input} > {output.plus}_163.bam
      
              samtools merge -h {output.min}_99.bam  {output.min} {output.min}_99.bam {output.min}_147.bam
              samtools merge -h {output.plus}_83.bam {output.plus} {output.plus}_83.bam {output.plus}_163.bam
      
              rm -f {output.min}_99.bam  {output.min}_147.bam {output.plus}_83.bam {output.plus}_163.bam
        """

Indexing BAM files

The BAM files need to be indexed to be loaded into the genome browser. This step will be performed using samtools index command. Don’t forget to add bai file to the final rule.

rule index_bam:
        input: full="samples/bam/{smp}.bam", min="samples/bam/{smp}_min.bam", plus="samples/bam/{smp}_plus.bam"
        output: full="samples/bam/{smp}.bam.bai"
        shell: """
              samtools index {input.full}
              samtools index {input.min}
              samtools index {input.plus}
        """

Creating bigwigs for both strands

To create bigwigs for both strands we will use the bam2wig.py command (RSeQC python package). We will then convert the subsequent wig file using wigToBigWig from the Kent utilities

rule create_bigwig:
        input: min="samples/bam/{smp}_min.bam", plus="samples/bam/{smp}_plus.bam"
        output: min="samples/bwig/{smp}_min.bw", plus="samples/bwig/{smp}_plus.bw"
        params: chr=CHR
        shell: """
          bam2wig.py  -i {input.plus} -s {params.chr} -o samples/bwig/{wildcards.smp}_plus &> {output.plus}.log
          wigToBigWig -clip samples/bwig/{wildcards.smp}_plus.wig {params.chr} {output.plus} 2>&1 >> {output.plus}.log
          rm -f samples/bwig/{wildcards.smp}_plus.wig
  
          bam2wig.py  -i {input.min} -s {params.chr} -o samples/bwig/{wildcards.smp}_min  &> {output.min}.log
          wigToBigWig -clip samples/bwig/{wildcards.smp}_min.wig {params.chr} {output.min} 2>&1 >> {output.min}.log
          rm -f samples/bwig/{wildcards.smp}_min.wig
      """

Searching for novel transcripts

We can now use the cufflinks software to try to discover new transcripts inside the dataset. We will also provide cufflinks with the known transcript

rule cufflinks:
        input: bam="samples/bam/{smp}.bam"
        output: gtf="samples/cufflinks/{smp}/transcripts.gtf"
        params:  gtf=GTF
        message: "--- Searching novel transcript with cufflinks."
        shell: """
          cufflinks -g {params.gtf} -p 5 --library-type fr-firststrand  -o samples/cufflinks/{wildcards.smp}  {input.bam} &> {output}.log
        """

Comparing discovered transcripts with known transcripts

This step can be performed using cuffmerge. We need to create a file containing the list of gtf files to merge.

rule cuffmerge:
        input: expand("samples/cufflinks/{smp}/transcripts.gtf", smp=SAMPLES)
        output: "samples/cuffmerge/merged.gtf"
        params: gtf=GTF, fa=FASTA
        message: "--- Comparing transcript to the reference."
        shell:  """
        ls -1 samples/cufflinks/*/transcripts.gtf > samples/cuffmerge/assembly.txt
        cuffmerge  -o  samples/cuffmerge -g {params.gtf} --keep-tmp -s {params.fa} -p 5 samples/cuffmerge/assembly.txt &> {output}.log
        """

Selecting novel transcripts

Now that we have compared the transcript model to the reference we will select a subset of these transcripts based on their class_code. The transcript have been classified by cuffmerge into the following classes

Code	Transcript type (related to the reference)
=	Complete match of intron chain
j	potentially novel isoform (one splice junction shared with a tx from the reference)
o	Generic exonic overlap with the reference
s	An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)
u	Unknown intergenic transcript
x	Exonic overlap with the reference on the opposite strand

We will select transcript with class code u, x, o, s, or j.

Adding a gene name to novel transcripts

The novel transcript are devoid of gene_name attribute in the gtf file. We will use their gene_id as value for this attribute.

rule add_gene_name_to_unknown:
        input: "samples/cuffmerge/novel_transcript.gtf"
        output: "samples/cuffmerge/novel_transcript_gn.gtf"
        params: gtf=GTF, fa=FASTA
        message: "--- Adding gene name to novel transcript."
        run:
          import re 
          fh_in = open(input[0], "r")
          fh_out = open(output[0], "w")
          for line in fh_in:
            line = line.rstrip("\n")
            if not re.search("gene_name", line):
              gene_id = re.match('.*gene_id "(.*?)"', line).group(1)
              fh_out.write(line + ' gene_name "' + gene_id + '";\n')

Merge novel and known transcripts

Now we can merge the novel and known transcripts using the cat command.

rule merge_novel_and_known:
        input: novel="samples/cuffmerge/novel_transcript_gn.gtf", known=GTF
        output: "samples/new_annotation/all_transcripts.gtf"
        params: gtf=GTF, fa=FASTA
        message: "--- Merging known and novel transcripts."
        shell:  """
        cat {input.novel} {input.known} > {output}
        """

Quantification

Finally, we will quantify the abundance of all gene. The gene_name attribute in the gtf file will be used to perform this task.

rule quantification_with_featureCounts:
        input: novel="samples/new_annotation/all_transcripts.gtf", bam=expand("samples/bam/{smp}.bam", smp=SAMPLES)
        output: "results/counts/gene_counts.txt",  "results/counts/gene_counts_mini.txt"
        shell: """
        featureCounts -p -s 2 -T 15 -t exon -g gene_id -a {input.novel} -o {output[0]} {input.bam} &> {output[0]}.log
        cut -f 1,7- {output[0]}| awk 'NR > 1' | awk '{{gsub("samples/bam/","",$0); print}}'  > {output[1]}
        """

Some (quick )diagnostic plots with R

rule diagnostic_plot:
        input: "results/counts/gene_counts_mini.txt"
        output: "results/diagnostic_plot/diagnostic.pdf"
        run: R("""
            dir.create("results/diagnostic_plot")
            data <- read.table("{input}", 
                                sep="\t", 
                                header=T, 
                                row.names=1)
            data <- data[rowSums(data) > 0, ]
            data <- log2(data + 1)
            pdf("{output}")
            dev.null <- apply(data, 2, hist, border="white", col="blue")
            boxplot(data, color="blue", pch=16)
            pairs(data, pch=".", col="blue")
            dev.off()
            cat("etc...")
      
        """)

Displaying the workflow

One can easily get an overview of the produced workflow by providing snakemake with the –graph argument. This will produce a graph file in Graphviz format. A layout for this graph can be produced using the dot command.

We can produce this graph for all treated samples.

# Adapt the path as needed
cd /Users/puthier/git/abd/practical/snakemake/projects/t-cell_analysis
/Library/Frameworks/Python.framework/Versions/3.4/bin/snakemake --dag 2> /dev/null | dot -T png > ../../img/workflow_bysample.png

A graph illustrating the different steps of the workflow

Or as a single, global, output.

# Adapt the path as needed
cd /Users/puthier/git/abd/practical/snakemake/projects/t-cell_analysis
/Library/Frameworks/Python.framework/Versions/3.4/bin/snakemake --rulegraph 2> /dev/null | dot -T png > ../../img/workflow.png