methods

>>> GO to summary >>> Go to data description >>> GO to Results

Methods

Table of Contents

Region scoring
Machine learning approaches
Choose the best model: Prevent overfitting using cross-validationn
Model performance
Controls
Access to R scripts ( functions, main script), basic bash commands to parse files generated with R scripts, and python scripts (build the training set matrix from matrix-scan results here, select genes in a list that are annotated in GO here)

Region scoring
We scanned regions with 259 Position Weight Matrices (PWMs) of Drosophila melanogaster collected in Fly Factor Survey (254 matrices) and from peak-motis analysis of ChIP-seq experiments on Tin, Mad, dTCF, Doc and Pnr. We used matrix-scan (command) in order to predict sites (p-value > 10^-3). We then computed the sum of the -log₁₀(p-value) for each region and each matrix. We normalized this scores by the length of the region (influence of sequence length: example of scanning with Eip78C matrix).
Machine learning approaches
Machine learning approaches are used to build model that are able to discriminate between positive and negative sets combining and weighting information given by some predictors. This model is then used to predict new positive features. The negative and positive regions are labeled by a binary response vector. The resulting matrix is named the training set. Whatever the chosen method, the goal is to define a function as close as possible to the response vector. The methods assign to each descriptive feature a weight corresponding to the impact of the feature in the discrimination between positive and negative regions. These weights are named regressor parameters or β. The choice of such parameters is a critical step and is made by different way of optimization depending on the method. Here, we used three different aproaches:

Linear regression: We first based our approach on the study of Narlikar et al. 2010 where the authors identified over 40000 novel human heart enhancers using linear regression methods (R package LARS, Efron et al. 2004). Briefly, they collected 77 in vivo validated vertebrate heart enhancers. They combined de novo binding site specificities using motif discovery analysis, Markov sequence feature characterization (different order of Markov model) and binding sites predicted with 41 known binding motifs. LARS algorithm, with LASSO constraint, allow to performe predictor selection.
The hypothesis defined by linear regression approach is a linear function (Figure 2A). The β parameters are chosen by minimizing the residual sum of squared error between the hypothesis and the response vector . This is computed by the cost function which is elliptic (Figure 2B).

Figure 2: Schematic description of the linear regression method. A. Linear function h(x) where x are the descriptive features and β are the regressor parameters to optimize. The graph represents the distribution of the value of the descriptive feature x₁ (x-axis) for the training set in fonction of their labels (y-axis). The dashed arrows indicate the error between h(x) and the response vector, the colors of circles correspond to different populations. B. Cost function. The graph represents the cost function in the case of one descriptive feature. Some combination of β₀ (intercept) and β₁ produce the same cost (defined by the ellipses). The best solution is indicated by ^β.

LARS allows to apply the LASSO constraint (Figure 3). This contraint restrict the parameter space into a rotated square that puts some regressors to 0 and thus select β parameters. This constrain force the sum of absolute values of β to be lower than a paramater (t). More t decreases, more the probability of putting some regressor to 0 increases. The linear regression is not a classification method as the hypothesis returns values lower than 0 or higher than 1. It implies a difficulty in the interpretation of the results.

Figure 3: Schematic representation of the lasso constraint. t is the shrinking parameter.

Logistic regression: We then chose to test the logistic regression (R package glmnet Freidman et al. 2008) which is based on the sigmoid function and return for a given sequence, the probability to be a heart enhancer or not knowing the β parameters and the descriptive values (Figure 4A). In the glmnet algorithm, the β parameters are optimized by maximizing the log-likelihood of the β parameters knowing the descriptive features values and the response vector (Figure 4B). The β parameter space is also constrained by LASSO (Figure 3) and the β parameter variance is restricted using the regularisation parameter λ.

Figure 4: Schematic description of the logistic regression method. A. Sigmoid function g(x) where β^Tx is the linear function defined in Figure 2. The graph represents the sigmoid function where the x-axis represents the values of the β^Tx function and the y-axis the probability for a given region to be positive or not given the β parameters and the values of it descriptive features. B. Cost function.

Support Vector Machine (SVM): package e1071 Meyer et al. 2012

Linear kernel: The SVM with linear kernel assumes that there exists a linear hyper plane that separates positive and negative data. As it exist an infinity of such plans (Figure 5A), the SVM will maximize the margin between the positive/negative sets and the plane (Figure 5B). The β parameters variance is constrained by the regularisation parameter C (1/&lamda;). This parameter allows to avoid overfitting (see next section) by relaxing this constrain (Figure 5C-D).

Figure 5: Schematic description of the SVM with linear kernel method. Here, there is an example with two descriptive features x1 (x-axis) and x2 (y-axis). Each dot corresponds to a region (positive or negative) in the training set. A. Infinity of separating hyper plan. B. the hyperplan that maximize the margin. By varying the C parameters (large C and low D) the model is more or less sensitive to noise in data.

Radial kernel:The kernel function is a similarity function which tranforms the descriptive feature space in a space of higher dimension. It assumes that there is not exist any linear hyper plane able to separate the positive and negative regions (Figure 6). Using the e1071 functions, could make vary the parameter C and also the γ parameter (γ=1/σ²) which allows to smooth the similarity function in order to be more or less stringent. The similarity function returns a probability between 0 and 1. Smoother is γ (high), more bias are introduced.

Figure 6: Schematic representation of radial kernel transformation. Here, there is an example with one descriptive feature x1 (1D) transformed into a 2D space by the kernel function K(x,y). Each dot corresponds to a region (positive or negative) in the training set.

Choose the best model: Prevent overfitting using cross-validation [Top]
We talk about over-fitting when the built model is not able to well predict new positive/negative regions as it overfits the training set. It is nor able to generalize. This phenomenon is due to a too large number of descriptive values which will define a too specific model. In Figure 7B, the purple curve represents an example of the evolution of the residual sum squared (rss) error while relaxing the LASSO constraint (allowing more features in the model) using LARS algorithm. We can see that the rss decreases as descriptive features are added in the model. Some parameters (defined above) must be tuned in order to find the best balance between over-fitting and cost function optimization. In order to quantify the over-fitting of a model, a cross validation step is necessary. Basically, the original training set is splitted in a new training set from which a model is built and a test set (Figure 7A). Applying the model on the test set, we evaluate its capacity to correctly predict positive and negative region. The black curve in Figure 7B represents the evolution of the mean squared error (MSE) while relaxing the LASSO constraint (LARS). From this analysis, we choose the model that produces the minimal MSE.

Figure 7: Cross-validation and overfitting. A. Cartoon representing the principle of cross-validation. B. x-axis indicates the number of steps in the LARS algorithm. At each step, the LASSO constraint is relaxed. Purple: residual sum squared (rss) error; black: mean squared error (MSE).

The error made by the model is computed by different ways:
LARS: mean squared error (Figure 7B)

glmnet: deviance ratio which corresponds to the ratio between the likelihood of the β parameters in the built model and the saturated likelihood (maximum)

SVM: prediction error . . . (I do not really know)

Model performance [Top]
As the optimal parameters were defined, we can evaluate the selected model by plotting the precision/recall, the precision/cut off, the recall/cut off and the ROC curves. This is performed using cross-validation results. We used the R package ROCR ( Sing et al. 2005 ).
TP: True Positive; FP: False Positive; FN: False Negative; N: Negative samples
precision = TP / (TP + FP)

recall = true positive rate = TP / (TP + FN)

false positive rate = FP / N
The precision indicate the proportion of true positive among all the regions predicted as positive. The recall indicates the proportion of true positive detected among all the true positive. Thanks to these curves we can choose the most appropriate cut off on the fit according to the question. Here we want to select regions in order to test them experimentally so we want high precision to avoid false positive.

Figure 8: Performance curves (logistic regression). Top left: precision/recall; Top right: precision/cut off; Bottom left: recall/ cut off; Bottom right: precision/recall; true positive rate / false positive rate (ROC) curves

Controls [Top]
We permuted 100 times the response vector and we computed and evaluated models by plotting performance curves

>>> GO to summary >>> Go to data description >>> GO to Results