Topological Pharmacophore Description of Chemical Structures using MAB-Force-Field Derived Data and Corresponding Similarity Measures

by Paul R. Gerber

Pharmaceutical Research and Development, F. Hoffmann-La Roche, Basel, Switzerland

 

Abstract

Hydrogen bonding strengths and reference bond lengths, as derived from data generated by the force-field MAB are used to produce pharmacophore descriptions of chemical structures in terms of pharmacophoric units (agons) of two types, hydrogen binders and hydrophobics. Within each type a more detailed characterization is achieved by assigning strength values on a continuous scale. Furthermore, separate characteristic distances for these two agon types determine their size in terms of topological distances, or equivalently the average number of atoms participating in an agon. The actual agons derive from clustering procedures with respect to inter-atomic topological distance matrices. This procedure determines the total number of agons in a pharmacophore. In addition to the agon strengths, a pharmacophore is characterized by the topological distance matrix between its agons. Furthermore, a similarity measure is presented, which assigns to each pair of pharmacophores a similarity value between zero and one. This value is obtained by identifying like agons and comparing corresponding strength values as well as the resulting partial distance matrices. The measure can be used to assess the pharmacophoric diversity content of a library of structures. Alternatively, the structures of a library can be rated with respect to their similarity to a given (lead-) pharmacophore. Examples for both types of applications are given.

 

Introduction

In drug discovery the characterization of chemical structures in terms of their potential pharmacophoric action is a central issue in the computer-aided search for new leads. The number of proposed schemes is large [1-4], and the difficulties in finding relevant measures to judge the merits of any method make it hard to discriminate among them. A second problem consists of the huge size of many databases that one would wish to mine for possible drug candidates. This size problem imposes significant restrictions on the complexity of possible algorithms for searching [5]. Correspondingly, there is a span of models in which a trade between computational speed and accuracy (or relevance to a particular purpose) is emphasized in various ways.

The characterization of pharmacophoric properties of chemical structures consists of two components: a description in terms of a relevant scheme of any single structure, and a prescription of how to compare any pair of structures. The description attributes to each structure a set of values, such as counts of substructure elements, physical parameters, geometrical quantities etc. The similarity measure combines the two value sets of a pair of structures to yield a value (normally between zero and one) which describes the similarity of the pair. A value of zero indicates that the two structures have nothing in common, while the value one tells us that the two structures are identical with respect to the chosen description.

An example of a bit-vector type description is the concept of fingerprints as used in the DAYLIGHT software [6]. Fingerprints essentially monitor the occurrence of linear structural elements of various types. For such a bit-vector description a natural similarity measure is the Tanimoto index. This type of characterization is specially designed to handle large datasets at high speed and is neither restricted to nor optimally suited for pharmacophoric applications.

A second example is the pharmacophoric similarity concept [7] as implemented in the software Moloc [8]. Here full atomic coordinates and simple pharmacophore properties of the atoms are used as descriptors. The similarity measure is a rather involved function of these quantities. It compares the position and possibly directionality of polar atoms, and also accounts for volume overlap. In addition, this similarity function is maximized by varying the rigid body superposition of the pair of structures. Clearly, this type of comparison is considerably more demanding computationally than logical operations on bit-vectors, and, correspondingly, such an analysis is restricted to small sets of structures (up to a few thousand in the most favorable cases), but is highly relevant for pharmacophoric purposes.

In this note, a new type of characterization is proposed, which attempts to retain as much as possible of the pharmacophoric accuracy but avoids three-dimensional description with its rather expensive computational treatment [9].

Topological Pharmacophore Description (tpr)

The new type of description uses a set of centers with pharmacophoric properties, which we will call agons, and a topological distance matrix between the agons. Agons are classified into hydrogen binders and hydrophobics.

In the most detailed case each atom is an agon by itself. The basis for assigning pharmacophoric properties is the analysis of the structure within the force field MAB [10,11] of the modeling program Moloc [8]. An atom is a Hydrogen binder, when it has a Hydrogen-bond donor- or acceptor strength above a corresponding donor- or acceptor strength threshold, q d or q a, otherwise it is classified as hydrophobic. However, we introduce an additional parameter n h with the effect that if a hydrophobic atom has at least n h Hydrogen binders as neighbors, it is discarded as hydrophobic (n h = 0 has no effect).

Topological distances are also derived from MAB by taking the reference bond distance between bonded atoms, and the sum of these for the shortest path between two atoms that are not directly connected. This single-atom description is in general much too detailed for a speedy comparison.

In order to reduce the complexity of description the hydrophobic atoms are first grouped together by complete-linkage clustering with respect to the topological distance matrix as obtained by removing the Hydrogen binders and possibly, for nonzero n h, neighbors to them from the structure. By choosing an appropriate clustering level, as specified by a critical distance r p, the number of hydrophobic agons can be drastically reduced. An agon is then characterized by its extension as determined by the average mutual distance between all pairs in a cluster. Figure 1 illustrates how the number of clusters decreases with increasing value of r p for the example structure of valium. Increasing the r p–values through 3, 5, 7, to 12 leads to 9, 5, 3, and 2 clusters respectively.

 

Figure 1: Dependence of hydrophobic cluster-size on the range parameter r p. When going from left to right through the pictures of the molecule valium the parameter assumed the values 3, 5, 7, and 12 respectively, leading to 9, 5, 3, and 2 clusters respectively. Clusters are indicated in red color (dark). In the leftmost structure the two atoms carrying a label are clusters by their own.

Furthermore, a minimal cluster size s p may be specified, which excludes hydrophobic clusters with at most s p atoms.

A further reduction in the number of agons can be achieved by clustering the Hydrogen-binding atoms, this time using the full distance matrix and a distinct clustering level. This level is also specified by a critical distance, r h. Donor- and acceptor strengths of such a cluster are obtained by taking the square root of the sum of squares of the corresponding atomic properties within the cluster. The extension of such clusters is calculated as in the case of hydrophobic agons.

The distance matrix for the clustered agons consists now of distance values between pairs of cluster. The distance between two clusters is simply obtained by averaging the distance values for all distinct pairs of atoms, which can be formed by taking one atom from each cluster.

By going to increasingly coarse clustering the level of description decreases, i.e. the amount of data per structure decreases. This reduction can be quite substantial, although the computational effort to produce the data remains essentially unchanged.

Parameters for Pharmacophore Specification

Six parameters determine the level of topological description, namely

q d = H-bond donor threshold (1.3), h

q a = H-bond acceptor threshold (1.3), a

r h = H-binder critical cluster distance (3.5), b

n h = minimal number of H-binder neighbors for exclusion as hydrophob (2), q

r p = hydrophob critical cluster distance (7), d

s p = upper size-limit for exclusion of hydrophobic clusters (1), c

Numbers in brackets give the default values in the in-house program Mtprgn, which calculates topological pharmacophores for a list of structures given as SMILES codes. The letters at the end are the qualifiers needed to modify the values in this program. Thus we may characterize a set of tpr’s by the six numbers of the above list. A default-set is characterized by (1.3, 1.3, 3.5, 2, 7, 1).

Similarity Measures

The calculation of the similarity between two topological pharmacophores consists of two essential steps. The first one is to pair up the agons of the pharmacophores, the second one to calculate a similarity value for the given pairing. Each pairing consists of the same number of agon pairs and is subject to the condition that only agons of equal type (hydrogen binders or hydrophobs) can make a pair. Thus, the number of H-binder (hydrophob) pairs is equal to the smaller of the two numbers of H-binder (hydrophob) agons of the two pharmacophores. Because the number of different pairings increases in a factorial fashion with the number of agons, a limitation in this number is usually unavoidable. Combining, for example, a (4,3)-pharmacophore (4 Hydrogen binders, 3 hydrophobs) with a (2,4)-pharmacophore yields 12*24 = 288 different pairings. The reduction by clustering of the previous section becomes often a necessity, if one wants to avoid that many of the structures cannot be treated because of excessive computational effort. Alternatively, one could think of using approximate graph theoretical methods to speed up the similarity calculation [12].

For a given pairing the similarity value is calculated as follows. For each pair of agons a weight is calculated. For a hydrophobic pair this weight has the form

where e is the extension of an agon. For H-binders the weight reads

where D and A are donor and acceptor strengths respectively of an agon, and wd and wa fix the relative weight of donor and acceptor properties with respect to hydrophobic ones. Finally, the weights P are combined with distance data to yield the similarity value

The second (double) sum runs over all combinations of distinct agon pairs, while the weight function, g, takes into account that the distance values dkl have in general non-equal values in pharmacophores 1 and 2. For this function we took the form

, if , else .

This represents a finite-range bell-shaped function for the difference of the two arguments. The maximum value of one is assumed for equal argument values, and the width is given by the square root of the parameter Wg. This quantity is the sum of squares of agon-type dependent width parameters d . For H-binders d has a constant value d h, while for hydrophobs we take

where Pp is the hydrophobic agon weight, defined above, and is a bare width for hydrophobs. This combination makes sense because Pp can be seen as a quadratic extension of the agon.

Finally, when a positive width-increase parameter ,d d , is specified, the definition of Wg by the sum of squares of d ’s is augmented by multiplication with a factor of the form

This factor provides the possibility to set less rigorous bounds for agons at large (exceeding d d) topological distances.

For each pairing of agons a value S is calculated. The pairing with the maximum value of S yields the best superposition of agons. The corresponding S-value yields, after normalization the final similarity value for the two pharmacophores.

Normalization is achieved by dividing the maximum S-value by the geometric mean of the two self-similarity values of the two pharmacophores k and l:

Clearly, the self-similarity value is directly obtained for the identical agon pairing. All other pairings yield smaller S-values.

Alternative normalization rules may be adequate depending on the nature of the problem. Thus instead of taking the total self-similarity value of either pharmacophore, one can envisage to just consider the self-similarity value of the minimal subpharmacophore made up only from the agons occurring in the actual optimal pairing. This leads to a total of four similarity values, either of which may be useful in a particular problem. Thus we call:

sf = full similarity = total self-similarity taken for both pharmacophores,

sb = sub-similarity = minimal self-similarity taken for first pharmacophore,

ss = super-similarity = minimal self-similarity taken for second pharmacophore,

sp = partial similarity = minimal self-similarity taken for both pharmacophores.

sb and ss make only sense if pharmacophores k and l occur in a non-symmetric context, e.g. when l is a target pharmacophore and k runs through a database. For diversity assessment of a database these non-symmetric similarity measures make little sense.

Parameters for Similarity Calculations

The parameters specifying a similarity calculation are:

wd = relative weight of donor property with respect to hydrophobic (1), d

wa= relative weight of acceptor property with respect to hydrophobic (1), a

d h= width for H-binders (2), h

D p= basic width for hydrophobs (2), p

d d= critical distance for width increase (0), w

Numbers in brackets are default values of the in-house program Mtprsml, while the letters are the qualifiers used to modify these values.

Pharmacophore Representation of a Database

For the structures in a database it may be usefull to calculate and store topological pharmacophore representations for several reasons:

The degree of diversity among the structures as determined with the given similarity function may be of interest in order to select minimal divers subsets, i.e. a small subsets of structures representing a high degree of the inherent diversity.

The similarity measure may also help to obtain an indication, whether two databases differ in pharmacophoric content, and if so, which selection of one database would optimally complement the other.

Furthermore, the structures of the database may be ranked with respect to their similarity with a given structure, or more generally with respect to the average similarity with several pharmacophors. The given structure may be a known active pharmacophore or possibly a substrate. The ranking yields an indication as to which structures of the database are alternative candidates for a desired pharmacophoric effect. Such a ranking provides a prioritization of structures for screening of databases of available compounds.

As a technical remark we may add that, while it is possible to represent in general most of the structures in a database as topological pharmacophores, some these may turn out to have a large number of agons. In subsequent similarity calculations a threshold in this number must often be imposed for efficiency reasons, such that for practical purposes there is always some loss of structures.

As an example we have calculated pharmacophore representations of the available compounds from the Roche in-house database. We obtained some 180’000 structures. The calculation of topological pharmacophore data took some 12 hours of CPU-time on MIPS-R10000-processors which corresponds to a speed of four structures per second. Using several processors in parallel, a database of one million entries can be converted over night.

A characteristic of such a representation of structures as pharmacophores is the distribution of agon numbers for the structures. The following table shows this distribution for the (1.3, 1.3, 3.5, 2, 10, 1)-representation of the above mentioned Roche structures.

h\p

0

1

2

3

4

5

6

7

8

9

more

total

0

0

378

1071

687

292

134

41

13

6

1

5

2628

1

38

5228

13730

7922

2001

472

125

29

11

7

3

29566

2

100

9771

26733

17489

5774

1825

441

148

39

24

7

62351

3

134

5560

18398

14755

6116

2339

873

277

83

34

25

48594

4

76

1946

6660

6627

4145

1911

559

203

75

64

36

22302

5

14

447

1803

2437

1854

999

417

104

28

23

13

8139

6

3

136

626

1105

1058

802

339

142

46

24

18

4299

7

3

24

261

615

647

451

287

116

42

13

11

2470

8

0

9

61

202

413

276

147

84

53

14

13

1272

9

0

2

28

83

186

143

143

62

70

32

12

761

more

0

0

18

79

235

394

219

231

187

142

708

2213

total

368

23501

69389

52001

22721

9746

3591

1409

640

378

851

184595

This table shows e.g. that (2, 2) pharmacophores occur most frequently, namely 26733 times.

The following shorthand representation indicates, which fraction of all structures is taken into account when a limit on the number of agons per pharmacophore for similarity calculations is imposed. For the above case this representation reads.

Limit 5 6 7 8 9 10

% of database 58.1 75.2 84.6 90.2 93.6 95.5

The more fine-grained the (1.3, 1.3, 2, 2, 7, 1)-representation of the same database, in which critical clustering distances for H-binders and hydrophobs are reduced to 2 and 7 respectively, yields the following characterization

Limit 5 6 7 8 9 10

% of database 22.9 42.5 59.7 72.0 80.5 85.9

It is obvious, that restriction to a maximum of eight agons per pharmacophore, as can usually be recommendable for reasonable speed, only catches 72% of the database in this latter representation, in contrast to the previous case, where over 90% would be treated. An illustration for the difference in detailing between these two descriptions is, that amide units are taken as a single H-binder agon in the first description, while carbonyl and nitrogen are separate agons in the second one.

Ionizable Groups

The hydrogen-binder properties, D and A, of an agon containing basic or acidic groups, depend very strongly on the state of protonation of that group. This state, in turn, is determined by the pKa of the group and by the actual pH. Since pKa values are not generally known for the molecules of a whole database, we utilized the program pkalc [13] to estimate the pKa values of the various groups. This program gives calculated values as well as an indication whether a group is basic or acidic, and which atom takes or releases a proton. Before calculating pharmacophore data, basic groups were protonated whenever their calculated pKa value was higher than the assumed pH (taken to be 7), and acidic groups with pKa smaller than pH were deprotonated.

The computational resources needed for this protonation calculation are an order of magnitude larger than for calculating the pharmacophore data, and the CPU-times quoted above do not include this part. Thus, it saves much CPU-time if protonation of the various groups is already known.

Lead Pharmacophores and Ranking of a Database

Pharmacophore representations of structures have lead-character in projects where the underlying structures are leads or active compounds. An advantage of a lead pharmacophore is that it is not tied to chemical entities but rather to pharmacophoric properties. Thus, a lead pharmacophore may be able to uncover structural leads with novel chemical groups. Furthermore, a pharmacophoric lead may also be generated from a substrate or even from a transition state analog of a substrate.

As an illustration for ranking we have taken the (1.3, 1.3, 3.5, 2, 10, 1)-representation the, above-mentioned database of available Roche structures. The lead pharmacophore was generated from the peptide sequence, dPHE-PRO-ARG, for short fpr, which is well known as a thrombin inhibitor, and may also be considered a substrate model. The terminal amine- and carboxylate groups of this peptide were replaced by methyl groups in order to avoid spurious agons in the lead pharmacophore. From this structure the (1.3, 1.3, 3.5, 2, 10, 1)-pharmacophore was generated, which turned out to carry three H-binders and three hydrophobs. Ranking was made with the default similarity measure.

In order to judge this ranking, we made use of an in-house database of Thrombin inhibitors. For simplicity, every structure that was a member of this thrombin database was considered a hit. Figures 2 and 3 show the percentages of hits among all thrombin inhibitors versus the ranked enumeration of the full database.

 

Figure 2, 3: The percentages of hits among all thrombin inhibitors are shown versus the ranked enumeration of the full database. The four curves correspond to the different types of normalization. From top to bottom the curves result from sb-, sf-, sp-, and ss-normalization, respectively. The arrangement of database structures is according to decreasing similarity values, separately so for each normalization, and thus, the same abscissa value corresponds in general to different structures for the four curves. Figure 3 is a close-up of Figure 2 for the region near the origin.

It is evident that the hit-rate for the best-ranked structures is substantially enhanced over the rate encountered by random sampling. Using sub-pharmacophore similarity, sb, the 500 best-ranked structures contain 123 hits as compared to a total of 166359 structures of suitable tpr representation, which include 1115 structures of the Thrombin database. This corresponds to an almost forty-fold enhancement of the hit-rate over random selection.

It must be kept in mind that some of the thrombin inhibitors have a binding mode, which makes use of additional features of the protein, not occupied by dPHE-PRO-ARG, such that not all of the 1115 structures would be properly represented by our lead pharmacophore. This may also account for the rather steady increase in hit number for larger ranking values, as seen in the left-hand graph.

Furthermore, one can envisage ranking virtual structural databases, of which pharmacophore representations can be generated with justifiable computational effort. Such virtual databases originate e.g. from an enumeration of products in combinatorial chemistry. However, if large sets of substituents are considered for each possible substitution site, the number of product structures may be prohibitively large to envisage a pharmacophoric evaluation. In such cases the educt sets may be subjected to previous diversity analysis in order to select a subset of maximal pharmacophoric diversity and manageable size. Such a diversity analysis can, in addition to the methods mentioned in the Introduction, also be performed within the framework of topological pharmacophores as we will show later on.

As a further possible application one can envisage to look for binding motives in a database. For the case of ATP binding proteins, for example, inhibitor motives to substitute the Adenine may be extracted from a database by prioritizing it with the topological pharmacophore of N-methyl-adenine itself. For this case it may be advisable to use a rather fine-grained description level in order to separate all H-binder groups. With a (1.3, 1.3, 2, 2, 7, 1)-representation we obtained a five-agon (4, 1) pharmacophore. Clearly, the database should be described on a similar level of detail. Although this leads to many pharmacophores with high agon numbers, this is not really a problem for the prioritization because pharmacophores with many agons will yield a small similarity value with the Adenine pharmacophore just because of the non-matching size. Quite generally, it is a feature of the used full-similarity measure, that a pair of structures of different size obtains a low similarity value.

The rating has been made for the of the MEDCHEM97 database as provided by the DAYLIGT software company . From the 33000 structures in this database 27000 survived under the condition that pharmacophores of more than eight agons were omitted.

Figure 4: Similarity values of the structures of the Medchem94 database N-methyl-adenine. All pharmacophores were generated in a (1.3, 1.3, 2, 2, 7, 1)-representation. The arrangement of the structures of the database is according to decreasing similarity value to N-methyl-adenine. The four curves from bottom to top correspond to from sf-, sb-, ss-, and sp-normalization, respectively. The selection of 99 structures corresponds to the sf-normalization of the bottom curve.

Figure 4 shows the similarity values of the structures in the sequence obtained from the similarity calculation against the N-methyl-adenine pharmacophore. All four normalizations are shown. However, we restrict the discussion to the full similarity case, sf. It is obvious that only a narrow selection of structures reaches similarity values of 0.8 or higher. This value is also a limit where a very sharp decrease over some hundred structures crosses over to a range with a somewhat more moderate slope, comprising some 2000 structures. After that the similarity value decreases more slowly over almost the whole database. A final range of again steeper decrease is found for the most hydrophobic structures. Of course, there is no consideration of chemical or pharmacokinetic aspects included in such a selection. The 99 most similar structures are displayed in Figure 5.

Figure 5: The 99 structures out of the MEDCHEM94 database, which are most similar to N-methyl-adenine. They have been aligned by a rigid-body matching procedure, which superimposes the centroid coordinates of corresponding agons as good as possible by means of a standard rigid-body-match procedure. The structures have then been spread appart in their most flat plane such that similarity values decrease within each row from left to right and then among rows from top to bottom. The target N-methyl adenine itself is also a member of this database (similarity value one) and is displayed at the top left-hand corner.

Since adenine itself is generally to small to provide good binding to the proteins, the purpose of such an investigation is mainly to obtain a set of pre-rated proposals to replace adenine by alternative groups.

Diversity Analysis

The proposed similarity measure can also be used to assess the diversity with respect to pharmacophoric properties of a database of structures. However, the speed with which similarity values can be calculated sets a technical limit of a few thousand entries per database. Because the computational effort grows with the square of the number of entries, this limit is not likely to improve quickly. One can either calculate a full similarity matrix, which can be subjected to hierarchical clustering methods, or just a limited set of nearest neighbours per structure, suited for partitional clustering methods.

As an illustration we analyse the set of 99 adenine analogs shown above. Figure 7 shows the clustering tree of the similarity matrix, and illustrates that the set consists of rather similar molecules (similarities above .5) as one would expect from its mode of generation.

 

Figure 6, left: Hierarchical tree for complete linkage clustering of the similarity matrix the set of 99 adenine analogs shown above. While the similarity values against N-methyl-adenine are all above 0.77, similarities among the set reach lower values. Nevertheless, the set still consists of rather similar molecules (similarities above .5) as one would expect from its mode of generation.

Figure 7, right: Eigenvalues of the distance matrix, the elements of which are dij, which are related to the similarity matrix elements sij through dij = 1 – sij. The single large negative eigenvalue (trace(dij) = 0) has been omitted. Every eigenvalue of about one or larger represents a separate additional diversity component. Evidently, only some twelve independent components occur in this set. This means there is a redundancy factor of about 8 in the set of 99 structures.

While clustering is a quite useful way of grouping structures together, it is more problematic to use the diversity matrix as a tool to find absolute measures for characterizing the diversity of a database. Given a pharmacophore description and a similarity measure, a diversity assessment can readily be made. Figure 7 shows the eigenvalues of the distance matrix (dij = 1 – sij). This representation allows for estimation of the minimal number of compounds, necessary to represent the full diversity content of the set. For every eigenvalue of about one or larger represents a diversity component. In the present case a subset of about a dozen structures would be sufficient to represent the diversity of the whole, or equivalently, there is a redundancy factor of about 8 in the set of 99 structures. However, it must be kept in mind that the most difficult question remains to be whether description and similarity measure are adequate to the required purpose of judging the set or database.

Software Implementation

The topological pharmacophore concept, as described above, has been implemented in the modeling software package Moloc. The auxiliary program Mtprgn generates the topological pharmacophore description for a whole database in batch mode. The program Mtprsml performs various types of similarity calculations for given tpr-representations. Furthermore, some of the mentioned features, such as examination of atom assignments to agons, generation of topological pharmacophores from single structures or from 3-dimensional pharmacophores, or superposition of structures onto a target, can be addressed directly from various menus within Moloc itself.

References

1)Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31-36.

2)Greene, J.; Kahn, S.; Savoj, H.; Sprague, P.; Teig, S. J.Chem.Inf.Comput.Sci. 1994, 34, 1297-

1308.

3)Boehm, H. J. J.Comput.Aided Mol.Des. 1992, 6, 61-78.

4)Bersuker, I. B.; Bahceci, S.; Boggs, J. E.; Pearlman, B. S. J. Comput.-Aided Mol. Design 1999, 13,

419-434.

5)Downs, G. M.; Barnard, J. M. J. Inf. Comput. Sci. 1997, 37, 59-61.

6)Daylight Chemical Information Systems Inc. www.Daylight.com .

7)Gerber, P. R. Roche Internal Report B174895 1995.

8)Mueller, K.; Ammann, H. J.; Doran, D. M.; Gerber, P. R.; Gubernator, K.; Schrepfer, G.

Bull.Soc.Chim.Belg. 1988, 97, 655-667.

9)Willet, P. J. Mol. Recog. 1995, 8, 290-303.

10)Gerber, P. R.; Muller, K. J. Comput.-Aided Mol. Design 1995, 9, 251-268.

11)Gerber, P. R. J. Comput.-Aided Mol. Design 1998, 12, 37-51.

12)Rarey, M.; Dixon, J. S. J. Comput.-Aided Mol. Design 1998, 12, 471-490.

13)CompuDrug www.Compudrug.hu .