McQSAR

Frequently asked questions

Is there a version available for operating system X?
McQSAR was developed on Linux, Mac OSX, and MS Windows. Executables for other operating systems can be arranged upon request. Don't be shy, contact the author.
How can I control a number of descriptors in resulted models?
As of version 1.2.0, McQSAR provides options 'maxNDescriptors' and 'minNDescriptors' that can be used to force the number of descriptors used in the equations. Note that the length of the equations can be affected by playing around with the cross-validation parameters, too - leave out less and the equations tend to get longer. Please remember, however, that leave-one-out (LOO) is a dubious method.
McQSAR reports the Q2 value, but I also want to know the R2.
As of version 1.2.0, McQSAR reports the correlation index in verbose output mode. All users are recommended to upgrade to version 1.2.0+.
The predictive-R² (square of the Pearson's product moment correlation coefficient between the experimental and the predicted activity values) for the best-so-far equations, evaluated against the validation set (if provided), is reported on the command line output if verbose (and in the output file for the last generation upon program termination). Validation set is provided using the switch '-v your_dataset_filename'. Validation set can coincide with the calibration set.
The output for the last generation of a completed run will look like:
// ******* Generation XXX ******* Y = ... score: 0.36 R2: 0.58
where 'score: 0.36' is the q2 value, and '0.58' is the R2 value for the validation data set.
What are validation R2_0 values?
R2_0 is the coefficient of determination of a linear regression of actual against predicted variable (or vice versa) through origin, that is, the intercept in the linear regression is forced to zero. For an ideal model, the slope of such fit is exactly one. Figure 1 and the related text in Golbraikh & Tropsha (2002) Beware of q2! explains it well.
The McQSAR publication talks about "alignments" and selecting the best one with respect to compounds. Does that mean conformation?
In the paper, the word "conformation" refers to any instance of a compound, including different protonation and tautomeric states that might have variance even in 2D descriptors. These "conformers" are recognized based on the name of the data row only (which have nothing to do with the actual chemical representation). An "alignment" then is a selection of "conformers", one for each compound.
What does the "leverage" represent in the program output for the last generation?
Leverage is a distance-to-model measure that can be used to assess whether a prediction for a new compound is within the applicability domain of a (linear) QSAR model. The computation of leverage requires that the number of compounds in the training set is known and the inverse of the covariance matrix of the training set descriptors is available; see for example Eriksson et al. (2003) Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification- and Regression-Based QSARs. Environ. Health Perspect. 111, 1361-1375, page 1366.
The leverage data is used in prediction mode, but is must be printed out when generating equations.
Should I get a y-intercept value in the returned linear model?
Not necessarily, a mutation can delete the constant term from a linear equation and it's still mathematically valid and predictive.