Linear Regression Models
General remarks
Moloc provides the possibility to generate linear regression
models on the basis of
topological similarity descriptors
in a rather automated way. The prerequisites to generate such
models are:
- Structures of the model (training) compounds, e.g. in
form of a .sd file. The structures should be appropriately
protonized.
- A table-file containing the names of the model structures
as well as (experimental) values of the quantity which is to
be modeled, preferentially with additional error values.
- Alternatively, the model data can be directly contained
as data fields in the .sd file.
- The model values should be on a reasonable scale, such that
error bars are of comparable size over the whole range of
experimental values. Thus, equilibrium constants, ic50 values,
etc. should be taken on a logarithmic scale.
- Often measurements contain cases in which no value could
be obtained (e.g. below measuring sensitivity), which may
constitute a sizable fraction of all items. To include these
cases a simlified indicator model may be considered whith just
two values of data: -1 (inactive) and +1 (active). Prediction
would then be 'active' for a positive return value.
- To obtain more specific predictions from indicator models
a set of models may be envisaged with varying threshold
between 'active' and 'inactive'.
The most straightforward type of model is the minimal
model that minimizes the loading defect of a test
compound with a similarity model consisting of a minimal number
of test compounds. The loading defect is the
fraction of pharmacophoric content of the test compound that
cannot be described by the set of training compounds.
Alternatively, a partitioned model can be
generated. In this case, a similarity analysis of the model
compounds is used to partition the compounds in order to produce
several submodels. A test compound then gets attributed a value
for each submodel, together with corresponding confidentiality
limits and loading defects.
In this tutorial we make again use of the files hdrqn.sd
and hdrqn.lst which can be found in the moloc/dat
directory. They contain the model structures and plasma protein
binding data, respectively.
Generation of Models
- Start Moloc and read in the structures of file hdrqn.sd,
most straightforwardly by the command
Moloc -d hdrqn.sd.
- Enter the library menu 'lib'.
- Read in the experimental data
(option 'e/n', back to 'lib').
- Enter the 'linear models' submenu 'l'.
- If your library of structures contains duplicates, they may
be removed from the library with the prune option
'p'.
- To produce a topological pharmacophore similarity model, now
choose option 't':
- You will be asked for the library of structures to yield
the model.
- Then specify the experimental data (#1) and,
optionally, the error values (#2, missing for our data) and
a file name which is used for the .tpr (topological
pharmacophore description) and .sml (similarity) files, which
will be produced in the process. Finally, the program asks for
a generic model name (we choose 'hsa'), which will be used to
write an .mdl model file.
- The next choice affects the model type. The default choice
(minimal topological pharmacophore) is most staightforward and
recommended.
- Now, the program asks for a generic error value.
This value is asked for in our case, because our data did not
contain error values. It should represent a reasonable measure
of the accuracy of the data (e.g. 0.4, in our case).
- Finally, one can modify the parameters for the pharmacophore
and similarity calculations, and specify the maximum number of
agons a structure may contain (we choose 9).
- The program now calculates topological pharmacophore
descriptions for each model compound as well as the corresponding
similarity matrix. The results are stored on two files with the
same name and extensions .tpr and .sml, respectively.
- Minimal Topological Pharmacophore Model
- This type of model tries to minimize the loading defect of
the test compound by a model built with a minimum number of
training compounds.
- The program asks for a model name and writes two files
with that name: a list of experimental values (extension .xvl),
and a model file (extension .mdl).
- For this type of model a validation run can be performed with
option 'c'.
- Partitioned Models: Selection of partitioning
- In this case, Moloc generates and displays a
hierarchical tree, representing the similarity matrix. The green
bonds represent possible (sub)models. The left-hand vertices
are labeled with a partition label and, after the comma, with a
number giving the unexplained part of the models variance. The
right-hand vertices carry the similarity level of branch
splitting.
- With option l a similarity level can be chosen that governs
partitioning of the model. A value below 0.29 yields a single
model. For values between .29 and 0.468 two models are generated,
and so on.
- We select a value of 0.50 to obtain three submodels. In this
rather artificial model, this choice is arbitrary and just for
demonstration purposes.
- Upon exit from this menu the three model partitions hsa_2,
hsa_3, and hsa_4 are generated. For each of those, two files
(with extensions .cfd and .mtx) are generated. Optionally, .tab
files can be produced in addition. These may serve to determine
model parameters by third party software. Finally a .mdl file is
written.
- Models are kept in memory, but can later be reloaded with
option 'g'.
Model Results for Structures
- Read the structures into Moloc (we just take the training set
which is still in memory).
- Go to the 'models' menu 'lib/l' (where we
generated before our 'hsa' model with option
't').
- Our 'hsa' model is still in memory (otherwise it could be read
in from file with option 'g').
- Set model options 'o'.
- From the output-option selector we choose
all and optimal model fractions. This yields
results for every model fraction (three numbers for each) together
with the (repeated) fraction which is rated optimal by the
program.
- With the weighting exponent the algorithm for the selection
of the optimal fraction can be modified.
- If the agon limit is set, the program omits structures that
have more agons than the maximum in the training set.
- Now, choose the evaluation option 'v', and
select structures (all).
- For larger sets give a name for the output file [.tab] which
yields a TAB-separated table.
- The screen output is in rather condensed form: vv+x,er;ld,
where vv+x is the value vv with decimal exponent x, error, er,
in the same units as vv, and loading effect, ld in % units.
Thus, -27-1,11;12 signifies a value of -2.7 with an error of 1.1
and a loading defect of 12%.
Evaluation can also be performed within the 'dTp'
menu (option 'v') to facilitate structure
evaluation during design. In this case, options are taken as
set in the present menu.
For batch mode operation use the program
Mtprmp, where results can also be added as
additional data fileds in the .sd file.
The Program 'Mdls'
Minimal linear models can be located in a subdirectory, called 'mdl',
of the installation directory 'moloc'. If the .mdl files are equipped
with additional key lines, these models can be directly called with
the program 'Mdls'. Tag lines must be at the beginning of the .mdl file.
The following tags, located at the beginning of a line, are undestood:
- MODEL_TYPE minimal_topological_pharmacophore
This tag line is automatically written when the model ist generated.
- MODEL_DESC m My Pet Model
This tag indicates a short description of the model. The letter 'm'
(followed by a blank) is used to specify the model in the Mdls call.
Make sure that no two models have the same letter. Without this tag
line the model cannot be addressed by 'Mdls'.
- MODEL_HELP Help text
This help text will be presented when calling 'Mdls ?'. There may be
several help tag lines.
- MODEL_TAG MyModel
Model results are listed under columns with this header. If this tag
is missing, the name of the .mdl file is used as column header.
- MODEL_DIR datadir
In case the model-data files (.tpr, .sml, .xvl) are not located in the
same 'mdl' directory, this tag serves to specify the corresponding
directory path.
The command 'Mdls ?' lists the available models with associated help
text (if available), in our case:
...
-m<number> My Pet Model [0]
Help text
...
For a set of structures contained in an .sd file 'strct.sd' model
predictions can be calculated with the command:
Mdls -m0 strct.sd
which produces a result file 'strct.txt'.