Features
All user-configurable parameters are stored in an ASCII text file. A default parameter file (a template for you to modify) is obtained by executing McQSAR with the '--writeparams' option (see tutorial).
-
Supported input file types are
- Plain ASCII text files containing a matrix of numerical values. Column headers should be the descriptor names. Row headers (text in first column) are considered to be the molecule names. Delimiter can be space, tab, comma, semicolon, '|', (forgot what we have implemented - anyway, the delimiter is automatically detected).
- MDL SD file format, versions V2000 and V3000 (the latter is less extensively tested). The structures are not currently handled, only the descriptor values are read from the data fields.
- GOLPE ASCII text file format (as output by ALMOND).
- Dragon text file format.
-
Data preprocessing
- Remove missing value and non-informative descriptors (i.e. columns with zero standard deviation).
- Remove inter-correlating descriptors (i.e. remove redundancy).
-
QSAR model generation
- User-configurable pool of low-level building blocks of mathematical terms for the generated models.
- Objective function is a leave-d%-out-averaged-over-B-repetitions cross-validation procedure, where d and B can be given by the user. This has several advantages over the traditional leave-one-out cross-validation, please read Golbraikh A and Tropsha A (2002) Beware of q2! J. Mol. Graph. Model. 20, 269-276.
- A validation data set can be used to monitor the evolution of the models. The predictive-R2 (square of the Pearson's product moment correlation coefficient between the experimental and the predicted activity values) is reported for the best found QSAR models during the run. The validation data is not used to guide the evolution of the QSAR models in any way, the validation feature is solely for monitoring the progress of the GA.
- The population of models generated on a previous run can be input as an initial population for a new run, i.e. the runs can be restarted (with altered run parameters if you so wish).
-
Prediction of activities
- For those compounds that have multiple conformers or representations, the average predicted activity is reported. If the energies of the compounds are known, the Boltzmann-weighted ensemble average is reported.
- If the input contains multiple QSAR models that predict the same activity (resolved by variable name), the average (and standard deviation) of the predictions of all those models is reported.
- Computes leverage values. For a predicted activity value of a compound using a linear equation, the leverage value indicates whether the (descriptor vector for the) compound resides within the applicability domain of the model. The reported value is the leverage value minus the leverage warning level, 3*k/n, where k is the number of descriptors used in the equation plus one and n is the number of calibration set compounds. Thus, a negative relative leverage value indicates that the compound was within the applicability domain of the model.