Descriptors for Pharmacophoric Properties
General remarks
Pharmacophoric properties of molecules are described in Moloc
mainly by topological pharmacophores
(-> theory). From these,
two types of descriptors have been
derived:
moment- and similarity- ones.
This tutorial illustrates how to calculate them for the example
of a set of substituted hydroxyquinolone compounds for which
plasma protein binding data are known (J.Med.Chem.40,4053,1997).
Files, containing the structures, hdrqn.sd, and the
experimental data, hdrqn.lst, can be found in the
moloc/dat directory.
Generation of Topological Moment Descriptors
- In batch mode, simply issue the command
Msrfvl -m3 hdrqn.sd which will produce a file
hdrqn.srf.
- This file contains all 69 available topological pharmacophore
moments (up to 3rd order).
- Choosing the parameter '-m2', '-m1', or'-m0' will produce
34, 14, or 4 descriptors, respectively, of progressively lower
order.
- Further descriptors can be obtained by setting further
parameters. Issuing the command Msrfvl without
parameters will produce a list of options.
- Interactive mode:
- Start Moloc with the parameter hdrqn.sd,
Moloc hdrqn.sd.
- Go to the pharmacophore menu 'php'.
- Select 'w' (write topological
pharmacophore file for all atom structures).
- From the pop-up selector choose the structures, (e.g.
hit the 'all' button).
- Now choose 'tab' from the format
selector.
- The next selector will ask you up to which order you
need the descriptors (see above).
- Then, upon specifying a file name, the program will
calculate the requested moments.
These descriptors can be used to derive statistical models
for various molecular properties which relate to pharmacophoric
molecular features.
Topological Similarity Descriptors
In order to calculate similarity descriptors a similarity matrix
for a set of molecules has to be calculated beforehand.
One particular example of generating such a matrix based on the
concept of topological pharmacophores
is provided in the companion tutorial on
Diversity Analysis. It is assumed here that the user has
already familiarized himself with that tutorial.
It has to be kept in mind, that the set of molecules providing
the similarity matrix is an integral part of similarity
descriptors. They change with the set and are able to account
for variations as they occur in the set.
As of February 2005, the next sections are obsolete (see
recommended new version).
A possibility to use old
model files is given in the last section of this page.
Generation of Files with Similarity Descriptors
- Run the batch program Mtprgn that generates topological
pharmacophore description of the structures, by issuing the command
Mtprgn -s hdrqn.sd, which generates the file
hdrqn.tpr of topological pharmacophores.
To get a description of the program usage of most Moloc batch
programs, just issue the program name (in our case
Mtprgn). To run a batch program, it is assumed,
that either Moloc's bin directory (e.g. 'c:\apps\moloc\bin') is
in the path, or that the user types the full path
(e.g. c:\apps\moloc\bin\Mtprgn).
- To generate the similarity matrix issue the command
Mtprsml -t9 hdrqn.tpr which generates the
similarity file hdrqn.sml. However, the following procedures can
be followed with any similarity file (e.g. generated with the
Tanimoto similarity algorithm on DAYLIGHT fingerprint descriptors),
as long as it adheres to the same file format.
- Now, start Moloc and read in the structure file hdrqn.sd
'.../g/m',
preferentially by previously setting read-in entries invisible
'.../g/o'. Return to main menu.
- Read in the experimental data from the file hdrqn.lst
in menu 'lib/n'. Provide an arbitrary non-blank
library name when the program asks for one. It then takes the
entries, the name of which it found on the list, together in a
library and (optionally) assigns data found in the file to the
corresponding entries.
- Go to the diversity menu 'mch/d'.
- Select the option 'r' (repeat cluster analysis)
and specify hdrqn.sml as similarity file to be reexamined
(use complete linkage as before). This will put you into the
cluster analysis menu clan.
- To obtain an impression of the experimental data you may color
the tree (menu option 'c' which causes red
coloring for entries having small values, and blue coloring for
large values. The chances to obtain a reasonable model are good,
when the color does not vary significantly within a branch.
- Click menu option '!' (which may also read
'd' or 'D') and set the mode
slider to 2 which causes 'D' to appear on the
menu bar.
- The extended help (obtained on Moloc's text port by choosing
'D' with the Ctrl-key pressed down) will describe
the calculations performed.
- Pick the main-stem bond (basis) of the tree to initiate the
calculation, which may take some time for high dimensional systems.
Our 59-entry case takes just a few seconds, however, since
eigenvalues of the similarity matrix are calculated larger systems
require considerably more time.
- At the end of the calculation Moloc shows a graph of the
eigenvalues of the similarity matrix and indicates that it has
written two files: dsc.tab and dsc.mtx.
- It is advisable to rename these files (e.g. to
hdrqn_dsc.tab and hdrqn_dsc.mtx) in order to
avoid overwriting by future calculations.
- In addition Moloc indicated that it took 15 descriptors
(dsc's) and that a weight wgt = 45.68 was found. This last
number is the value of the zero'th eigenvalue which was
omitted from the graph because it usually is far off scale from
the rest. This indicates that on average every structure
has a 77% portion (45.68 of 59) contributed by this zero'th
eigenvector.
The two new files, dsc.tab and dsc.mtx,
provide the basis for the generation of statistical models based
on the molecular description underlying the calculation of the
used similarity matrix (in our case the topological pharmacophore
description).
Calculate Model Parameters
Given the file of similarity descriptors, hdrqn_dsc.tab,
and a set of corresponding target values, Hdrqn.lst,
for which we want to obtain a predictive model, we may utilize any
of the standard tools, such as PCR (principle component regression),
PLS (partial least squares), or a neural net algorithm to produce
such a model.
In most cases ordinary least squares will not be particularly
suited because the large number of descriptors may lead to
colinearity problems. Once a model has been obtained and its
predictive capabilities have been established by cross-validation
methods, the next step is to apply it to compounds not contained
in the set that was used to generate the model. For this purpose
the matrix hdrqn_dsc.mtx is useful. Furthermore, a few
companion programs are provided with Moloc to facilitate prediction.
Calculate Predictions for a Linear Model
For our hydroxyquinolone example a PLS analysis yielded
r**2 = 0.784 and q**2 = 0.704. The corresponding coefficients are
given in file hdrqn.cfd (one coefficient per line,
constant at the end). This file can be found in the data directory
moloc/dat.
In order to calculate predictions for a set of structures, these
must be provided in mol-format. In addition to this file the
associated Moloc program Mtprmp requires:
- the pharmacophore description of the training set,
hdrqn.tpr
- the descriptor matrix file hdrqn_dsc.mtx
- the coefficient file hdrqn.cfd.
The calculations can be performed in two different formats
(for simplicity we just calculate the values for the molecules
of the original training set):
- Simply generate a file with the results
Issue the command
Mtprmp -u hdrqn.tpr -m hdrqn_dsc.mtx -g hdrqn.cfd -o
hdrqn.plt hdrqn.sd
or, relying on default extensions and output file-name, the
abbreviated command
Mtprmp -u hdrqn -m hdrqn_dsc -g hdrqn hdrqn.
The output file hdrqn.plt contains two columns
containing the names of the structures and the model prediction
values.
- Modify structure file
The second mode consists of producing a new structure file which,
in addition to the original data, is augmented by a data field
for each compound which contains the calculated model value.
The corresponding command reads
Mtprmp -u hdrqn -m hdrqn_dsc -g hdrqn -r DATA1 <
hdrqn.sd > hdrqn1.sd
- With the parameter '-r DATA1' the user specifies, that this
(redirection) mode should be activated. The data field will
carry the label 'DATA1'.
- Using the standard input to feed the structure file to the
program, and the standard output to produce the result offers the
possibility to relay the structure file through different models,
using the piping mechanism. Thus, at the end, one obtains the
original structure file, augmented by several data fields,
one for each model.
This mode is useful for the generation of web tools, in which
the user may specify the models he is interested in. The
web-program than simply has to care that the submitted structure
file is piped through the selected model programs, and that the
appropriate output is generated.
Running the new version of Mtprmp with old model files
- Write a model file, hdrqn.mdl, with the following 4 lines:
- Line 1: MODEL_TYPE topological_pharmacophore
- Line 2: hdrqn.tpr
- Line 3: hdrqn.sml
- Line 4: hdrqn
- Rename hdrqn_dsc.mtx to hdrqn.mtx (.mtx and .cfd
files must carry the same name).
- Generate a file, hdrqn.sml, containing the single line:
1 components: topological, 9 1.0 1.0 1.0 2.0 2.0 0.0 0.0 1.0
This, to transfer the parameters for the similarity
calculation.
- In the call to Mtprmp replace the parameters:
-u hdrqn -m hdrqn_dsc -g hdrqn
by
-m hdrqn
e.g. Mtprmp -m hdrqn -r DATA1 < hdrqn.sd > hdrqn1.sd
(The new version of Mtprmp just looks for the .mdl file.)