Similarity Descriptors, Topological Pharmacophores, and Protein Binding

by Paul R. Gerber

Pharmaceutical Research and Development, F. Hoffmann-La Roche, Basel, Switzerland

 

Key words: descriptors, similarity, regression models, topological pharmacophores, protein binding

 

Abstract

A method to produce descriptors from a given similarity function between pairs of chemical structures is presented. This is useful in cases where the similarity measure cannot easily be derived from a descriptor scheme. The descriptors are able to encompass the full diversity contained in the set of structures with respect to the given similarity measure. The method is illustrated on several cases of protein binding, with a topological pharmacophore description providing the similarity measure.

Introduction

A common way to model molecular properties is to start off with a set of descriptors, which take on molecule-specific values [1]. These descriptors may be measured quantities or data derived from molecular topology or conformation. Common examples for descriptors are occurrences or counts of sub-structural elements, molecular surface areas of various types, size and shape parameters etc. Having gathered such descriptors for a set of molecules for which the quantity in question is also known, one proceeds by establishing a functional relation between descriptor values and this quantity with the help of statistical regression methods. This relation, then, enables the inter- or extrapolation of values for the desired quantity for molecules with given descriptor values.

On the other hand, a set of descriptors may be used to define a similarity measure, which assigns to each pair of molecules a value between zero and one, indicating how similar the two molecules are with respect to their descriptor values. A well-known example is the Tanimoto index (see e.g. [2]), which allows for comparison of two sets of bit descriptors.

Now, one can imagine cases in which a similarity relation is given, which cannot be derived straightforwardly from a set of molecular descriptors, but which may be well suited to characterize structures by comparison [3,4]. In such cases it is desirable to build models on the ground of the given similarity relation. In this paper we propose such a method and illustrate its use in several examples of quantities useful for pharmacological evaluation purposes.

Similarity Descriptors

Given a set of N chemical structures (training set) and a similarity measure between them, one can calculate the corresponding symmetrical similarity matrix, S. Its elements, Skj, which describe the similarity between molecules j and k, take on values between zero, when two structures have nothing in common, and one for structures of identical properties. One may now attempt to consider this matrix as a metric tensor in a high-dimensional vector space, in which the structures are represented by unit vectors (as suggested by Sjj = 1). Let us assume for the moment, that S fulfills the conditions of a metric (triangle inequality). The condition Sjk > 0 then leads to the geometrical picture that all structure vectors lie within an at most N-dimensional cone of opening angle of 900. We may now try to find an orthonormal basis for this vector space which will provide a component representation of our structural unit-vectors. A particular choice would be to put the first basis vector along the cone axis, such that all structures have a positive first coordinate. With respect to all further basis vectors negative coordinates are then possible. This implies that only the first basis vector may represent a possible chemical structure, the rest are non-structural basis vectors. Although such a choice may not be directly obtainable, the aim of the present method is to consider the coordinates of structures with respect to any such basis as descriptors for regression models.

In the following sections we will derive explicit procedures and formulae for this program. However, in order to make things more concrete we first introduce a specific similarity measure and a specific example.

Topological Pharmacophores

The pharmacophore description of chemical structures, which we will consider here, has been introduced previously (for details see [4]). A connected chemical structure is partitioned into a set of pharmacophoric units (called agons) which typically contain several atoms each. These agons fall into two classes: H-binders and hydrophobics. A H-binder agon is characterized by H-bond donor- and acceptor strengths as derived from the corresponding values attributed to the polar atoms of the agon by the force field MAB [5,6]. Hydrophobic agons are characterized by their size (topological extension of the atoms belonging to an agon). In addition to these values, characterizing its agons, a pharmacophore has attributed to it the distance matrix that contains as elements the topological distances for each pair of agons.

A similarity value between two pharmacophores is evaluated by pairing up agons of the same type in all possible ways and by rating agon-strengths products for each pair with a weighting function, depending on differences in corresponding distance matrix elements. The maximum value of this function among all possible pairings yields, properly normalized, the corresponding pharmacophoric similarity value of the pair of structures.

The use of the full combinatorial possibility of agon pairing sets limitations on the speed of these similarity calculations. This can either be circumvented by a restriction to structures that yield agon numbers below a given threshold, or by increasing the coarseness of agon selection within the structures, which leads to more atoms per agon and, hence, to smaller numbers of agons per pharmacophore.

The inclusion of the topological distance matrix in the pharmacophore description emphasizes geometric aspects, which are of primary importance in influencing biological processes, as illustrated by the lock and key concept [7]. However, since the parametrization of geometric data by a set of descriptors is notoriously difficult, the present topological pharmacophore description provides a good example for the use of similarity descriptors.

Example: Protein Binding

For a set of molecules for which plasma protein-binding percentages are known from the literature [8-10] we perform a similarity analysis within the framework of our topological pharmacophore description. We restrict ourselves to structures that yield a maximum of ten agons per pharmacophore and arrive at a set of 602 structures. Percentages b of plasma protein binding are transformed to approximate binding constants Kb with the assumption that (1) binding occurs to albumin, (2) the bound form is a binary complex, and (3) albumin is much more concentrated than the drug in the test solution. With these assumptions we calculate approximate values for log Kb using

, (Eq. 1)

where p is the albumin concentration, assumed to be the same in all experiments. One might argue that, considering the complexity of the serum system, these assumptions are far too simplistic, but then, the whole dataset may just be considered to be too heterogeneous to justify any modelling at all. However, if one is determined to model the available data, Eq. 1 can be considered a way to stretch out the data crammed together into the two regions near values of 0% and in particular of 100% of protein binding. At the ends of the b-scale we set thresholds of 2% and 99.5% respectively to avoid divergences of the logarithm function i.e. b-values below 2% or above 99.5% are replaced by the corresponding threshold values.

Before calculating the topological pharmacophore data of the structures, they are submitted to a pKa calculation [11]. Subsequently, acidic and basic groups are (de)protonated to produce the protonation state appropriate for pH = 7. It is very important to have the proper state of protonation, because donor- and acceptor strengths of polar atoms depend very heavily on the number of proton ligands. Of course this calculation is also subject to some degree of uncertainty.

Figure 1 Hierarchical tree representation of the similarity matrix for the reduced set of 100 structures with known protein binding data. The stem originates at a similarity value of zero while the leaves are at similarity value one. The branches are colored according to log Kb values (Eq. 1), with the scale ranging from blue, for a value of 1.69 (corresponding to 0% binding) over white to red, for a value of –2.30 (corresponding to 100% protein binding). In general, branches are of consistent color shading and unite with branches of different shade only at low similarity values.

The similarity matrix derived from these pharmacophore representations is best represented as a hierarchical tree, in which like structures are grouped together as branches. Fig.1 shows this tree in a color-coded form, with a scale going from red, for low pKb values, over white to blue for high values. The appearance of more or less uniform color shades within branches indicates, that our pharmacophore description together with the similarity measure is able to group together structures with similar protein binding values. In order not to overload the graphs, we present them for a reduced set of randomly chosen 100 structures. Whenever numerical data are mentioned we give them for the reduced set and (in brackets) for the whole set.

Calculating Descriptors

First we diagonalize the symmetric similarity matrix. For our example the spectrum of eigenvalues is displayed in Fig.2. The first eigenvalue (labeled 0) is not included in this graph because of its high value of 41.4 (245.1). The spectrum is typical for such similarity matrices. The first few eigenvalues decrease rather quickly (in our case) from high values to values around one. Then, a slower decay towards zero sets in, and finally, a fairly insignificant tail of near-zero negative values is seen. Ideally if the similarity function were perfectly suited to provide a metric in pharmacophore space, no negative values would show up. Thus, we consider the negative tail as an indication that our similarity function is only an approximate metric.

Figure 2 Eigenvalue (loading) spectrum of the similarity matrix for the reduced set of 100 structures. The highest eigenvalue of 41.1 (with index 0) has been omitted in order to have a reasonable scale on the vertical loading axis. High loading values indicate that many structures have contributions from the corresponding component in pharmacophore space. On the other hand the number of significant loading values (of about one or higher) is a measure for the diversity of the set of structures with respect to its pharmacophoric content. The negative eigenvalues are an indication of the degree of inadequacy of the similarity measure to serve as a metric in pharmacophore space. They amount to a total of 1.13 which must be brought into relation to 100, the number of structures. To compensate for this defect, only the subspace of the first 60 eigenvalues is taken as descriptor space.

The eigenvectors of the similarity matrix are already orthogonal, and thus suited to provide a basis in our pharmacophore space. Since the trace of the similarity matrix is equal to the number of phamacophors 100 (602) one could in the ideal case, where no negative eigenvalues occur, consider the eigenvalues as loadings of the directions in pharmacophore space with the actual pharmacophores. A zero eigenvalue indicates that this direction is of no relevance, and consequently, only the subspace spanned by the eigenvectors with positive eigenvalues must be considered. However, in actual cases one encounters always some negative loading, in our example a total of 1.13 (39.1). In order to compensate for this defect we not only omit the negative eigenvalue directios but also omit some of the smallest positive ones, such that the sum of the remaining large positive eigenvalues just does not exceed the number of pharmacophores. By this token the total loading does not exceed the number of pharmacophores. In our example the number of eigenvalues, K, needed to achieve full loading is 60 (105). That these two values are far closer than suggested by the number of structures in the corresponding sets illustrates that the reduced set of 100 structures describes a significant portion of the diversity present among all 602 compounds.

Now, the eigenvectors belonging to these K remaining relevant positive eigenvalues span the subspace within which every pharmacophore lies (apart from imperfections of the metric), and thus form a basis. In order to normalize these basis vectors we remember that we consider S as the metric, which implies that the eigenvectors need be divided by the square root of their eigenvalue, in order to become unit vectors with respect to the metric S.

The similarity descriptors, s k, (k = 1,…,K) of a given structure are now simply the coordinates of its pharmacophore with respect to this normalized basis, and are obtained from the set of similarity values, sl, (l=1,…,N) with respect to the original (training) set of pharmacophores through

, (Eq. 2)

where Ek are the eigenvalues of the similarity matrix, S, and ekl are components of the corresponding eigenvectors.

Statistical Analysis and Regression Models

The descriptor values for the structures can be submitted to a statistical analysis, for which we used the program package TSAR [12]. It is important that the raw data are used in this analysis without taking resort to any kind of variance scaling. Since descriptors originating from small eigenvalues have little importance, scaling up their data ranges would unduly emphasize them.

A principal component analysis of the variance matrix reveals the interdependence of the descriptors. As one might have expected from the way they have been generated, there is typically little interdependence found. In our example there is an average fraction of 1.67% (0.95%) of variance per descriptor. For the principle components the fraction of variance explained starts at 26% (24%) for the first component and quickly decreases for the following components. The fourteenth component still explains 1.61% (1.39%) of the variance, the average value per descriptor, and the 28th component still explains 0.84% (0.72%), half the average. In fact the spectrum of the covariance matrix (multiplied by N) looks very similar to the spectrum of the similarity matrix S, except for the first eigenvalue, not shown in Fig. 2. That this is approximately true can be understood from the experience that the average descriptors are close to zero (except for s 0). For this reason and from Eq. 2 the near equality of the covalence matrix and of S/N is evident, when restricted to the space spanned by the second to K-th eigenvectors of S.

All this illustrates that there is little redundancy present when using similarity descriptors, they account properly for the pharmacophoric content of the set of structures.

Before a regression analysis is applied, one should note, that the final number of descriptors, K, is typically smaller than then the number, N, of experimental data values, although it is possible, in principle, that K is equal to N. Thus, straightforward multiple linear regression could be applied, to obtain a statistical model. However, the large number of relevant descriptors suggests the use of the partial least square (PLS) method [13]. For a stable model (low values for the predictive sum of squares) four (five) components must be taken, which leads to 77% (52%) of variance explained. For the reduced set the result of this analysis is shown in Fig.3.

Figure 3 Correlation plot of calculated versus experimental (Eq. 1) for log Kb values the reduced set of 100 structures.

 

Model Prediction

With the regression model for the training set at hand, in order to make a prediction of the target quantity for an additional compound the following procedure applies. From its pharmacophore representation one calculates all similarity values sl (l = 1,…,N) with respect to all structures of the training set. From these values the similarity descriptors s k are calculated through Eq. 2. Since any pharmacophore is a unit vector in pharmacophor space we expect, within the accuracy of the similarity metric, the following inequality to hold:

. (Eq. 3)

For structures of the training set the equality should apply. This is actually observed up to a few percent, e.g. in our example we have an average defect value (deviation from one) of 0.04± 1.6% (0.07± 4.1). For a test structure we expect the inequality to hold, because it may have components outside the space spanned by the training-set pharmacophores. Thus, the defect from the value of one is an indication of the inability of the model to account for the pharmacophoric content of the test structure, and thus, to fully predict the requested quantity.

Figure 4 Percentage of structures (of all 602) that have a defect percentage larger then the value given on the horizontal axis. In this case only some 50% of all compounds have a defect of less than 10%. The highest defect value encountered is 62.5%.

As an illustration we attempt to predict protein binding values for the full set of 602 compounds from the model derived for the reduced set of 100 compounds. Calculating for each structure the set of descriptor values and using the coefficients of the linear regression equation, model predictions are immediately obtained as well as loading defect values. Before examining the predictions we have to look at the defect values. The average defect of all structures within the reduced model amounts to 10.8± 11.9%. The defect distribution, shown in Fig.4, illustrates that many structures have sizeable defects. Thus, it is useful to analyze the prediction power of the model as a function of the defect. This is illustrated in Fig.5 and 6, which display the predicted protein binding values versus the experimental ones for structures with a defect value of less than 10% and for all structures, respectively. Clearly, for defects above 10% the predictive power of the model drops significantly.

Figure 5 (left) Correlation plot of calculated versus experimental log Kb values (Eq. 1) for the 326 structures having less than 10% loading defect with respect to the model obtained from the reduced set of 100 structures. The quality of the model is reduced as compared to the correlation of the original model (Fig. 3).

Figure 6 (right) Same as Fig. 5 but for all 602 structures, having loading defect values up to 62.5%. The quality of the model is now markedly reduced.

 

Specific Protein Binding

In this section we present three examples for which the target protein is well defined in the sense that a single active site is occupied by the inhibitor molecules. All examples make use of inhibition data (IC50) gathered in the course of drug design projects. We will not make any reference to structural details, but simply apply the procedure just described in the preceding sections:

- protonate the structures corresponding to pH 7 ([11])

- calculate topological pharmacophores and the corresponding similarity matrix ([4])

- perform the similarity analysis and calculate the similarity descriptors

- apply the PLS method and produce the corresponding linear regression model

This program was applied to the three projects of Human Dihydrofolate Reductase (DHFR) inhibitors [14], Thrombin inhibitors [15-17], and Renin inhibitors [18]. Table 1 summarizes the data. The first row gives the number of compounds with well-defined experimental data (IC50) we had access to. Compounds for which only a lower limit for the IC50-value was given (i.e. unmeasurably weak binders) were omitted beforehand, because they cannot properly be included in the analysis. From this starting set of compounds a fraction were removed because we restricted the calculation to pharmacophores of at most eight agons for computational speed reasons. The second row contains the number of remaining compounds used in the calculation. In the third row the number of resulting similarity descriptors is given. The forth row contains the number of relevant components obtained from the PLS analysis by the argument of having a minimal predictive sum of squares. The final row gives the percentage variance explained by the model. For comparison the numbers for the Albumin binding case of the previous sections are also included in the table.

Table 1 Data for the statistical models of the three systems of specific protein binding. For comparison data of the plasma binding models are also given.

Protein

DHFR

Thrombin

Renin

Albumin

Albumin red.

total number of compounds

1089

2443

1842

691

691

number of compounds selected

964

1567

996

602

100

number of descriptors

51

86

33

105

60

number of PLS components

7

8

5

5

4

% variance explained

64.5

65.3

61

52.3

77.5

Cross validation r**2

0.582

0.610

0.579

0.374

0.367

 

If we consider the number of descriptors as an indication of the diversity of a set of structures it is interesting to note that the cases of specific binding show up as less divers than the plasma binding case, in accordance with expectations. This, because Albumin is known to carry various binding pockets of different and partly variable shape and size (see e.g. [19]).

Fig.7 to 9 show the predicted against the actually measured (log IC50)-values for the three cases.

Figure 7 Correlation of calculated versus experimental log IC50 values for the model of the inhibition of human dihydrofolate reductase. The set contains 964 compounds.

Figure 8 Same as Fig. 7 but for the inhibition of human thrombin by a set of 1567 compounds.

Figure 9 Same as Fig. 7 but for the inhibition of human renin by a set of 996 compounds.

Topological pharmacophores represent some kind of reduced geometric description of the actual binding conformation. For this reason one cannot expect to obtain full predictive power, but one has to consider the calculated IC50-values from our models as a lower thresholds to possible actual binding data. Nevertheless, such conditional predictions are still valuable for a pre-selection of possible leads or drug candidates.

Discussion

We have presented a method to construct descriptors for statistical models, in cases where only a similarity relation between the objects to be modeled is given. The method has been applied to the case of topological pharmacophore description of chemical structures for which a similarity measure is given, which makes no reference to any kind of descriptor set. The analysis of the similarity matrix led to a natural assessment of diversity of the set of structures, and allowed to directly derive a set of similarity descriptors that account of the pharmacophoric content of the set. Conventional statistical methods were applied to generate regression models from these descriptors. Owing to the large number of descriptors (though usually not redundant) we considered partial least squares the method of choice. Four cases of protein binding were treated with fair success.

The question arises, whether one can gain some intuitive understanding of the relevant mechanism leading to specific values of the target quantity by examining the nature of descriptors that are of outstanding importance. In the case of similarity descriptors such intuitive understanding can hardly be expected, since the descriptors are of rather abstract nature. However, a look at a color-coded tree as shown in Fig. 1 may provide some insight by examination the structures belonging to a prominently colored branches for common features of pharmacophoric nature.

The choice of the similarity measure is, of course, a central ingredient for a possible success, and determines the sophistication of the model. In our case of topological pharmacophores, geometric information is built into the model as far as the topological distance matrix is able to account for.

A convenient feature of the whole method is its ability to indicate loading defects. This gives on the one hand a reliability indication, on the other hand it directly suggests possibilities to improve models, by inclusion of high-defect structures into the training set.

For similarity measures directly derived from a set of conventional descriptors, such as e.g. the Tanimoto similarity, we expect no improvement of the present method over models derived from the original descriptors.

It is quite possible that rather complex phenomena, which originate from several different processes, may yield to description by a single model, as partly illustrated by the reasonable success in the case of albumin binding, in which several binding sites are present, some of them of variable geometry.

Acknowledgements

We are very much indebted to Krystyna Kratzat and Manfred Kansy for making pharmakokinetic data from the literature freely available in electronic form throughout our whole company. Discussions with Nicole Kratochwil on the serum albumin system are much appreciated.

References

1. van de Waterbeemd, H., B. Testa, and G. Folkers, eds. Computer-Assisted Lead Finding and Optimization. 1997, Wiley: Weinheim.

2. Daylight Chemical Information Systems Inc., www.Daylight.com .

3. Rarey, M. and J.S. Dixon, Feature Trees: A New Molecular Similarity Measure Based on Tree Matching. Journal of Computer Aided Molecular Design, 1998. 12: p. 471-490.

4. Gerber, P.R., Topological Pharmacophore Description of Chemical Structures using MAB-Force-Field Derived Data and Corresponding Similarity Measures. Proc. 4th Girona Seminar on Molecular Similarity, 1999.

5. Gerber, P.R. and K. Muller, MAB, a Generally Applicable Molecular Force Field for Structure Modelling in Medicinal Chemistry. Journal of Computer Aided Molecular Design, 1995. 9: p. 251-268.

6. Gerber, P.R., Charge Distribution from a Simple Molecular Orbital Type Calculation and Non-Bonding Interaction Terms in the Force Field MAB. Journal of Computer Aided Molecular Design, 1998. 12: p. 37-51.

7. Folkers, G. Lock and Key - A Hundred Years After. in Emil Fischer Commemorate Symposium. 1995: Pharmaceutica Acta Helvetiae.

8. Goodman Gilman, A., The Pharmacological Basis of Therapeutics. 1996, New York: Mc Graw-Hill.

9. Dinnendahl, V. and V. Fricke, eds. Arzneistoff-Profile. 1997, Govi Pharmazeutischer Verlag GmbH: Eschborn.

10. von Bruchhausen, F., et al., eds. Hagers Handbuch der Pharmazeutischen Praxis. . 1993, Springer: Berlin.

11. CompuDrug, www.Compudrug.hu.

12. Molecular, O., TSAR, . 1998, The Medawar Center: Oxford.

13. Eriksson, L., et al., Introduction to Multi- and Megavariate Data Analysis using Projection Methods (PCA & PLS). 1999, Umea, Sweden: Umetrics.

14. Dale, G.E., et al., A Single Amino Acid Substitution in Staphylococcus Aureus Dihydrofolate Reductase Determines Trimethoprim Resistance. Journal of Molecular Biology, 1997. 266: p. 23-30.

15. Hilpert, K., et al., Design and Synthesis of Potent and Highly Selective Thrombin Inhibitors. Journal of Medicinal Chemistry, 1994. 37: p. 3889-3901.

16. Obst, U., D.W. Banner, and F. Diederich, Molecular Recognition at the Thrombin Active Site: Structure-Based Design and Synthesis of Ptent and Selective Thrombin Inhibitors and the X-ray Crystal Structure of Two Thrombin-Inhibitor Complexes. Chemistry & Biology, 1997. 4: p. 287-295.

17. Banner, D.W. and P. Hadvary, Crystallographic Analysis at 3.0-A Resolution of the Binding to Human Thrombin of Four Site-Directed Inhibitors. The Journal of Biological Chemistry, 1991. 266: p. 20085-20093.

18. Oefner, C., et al., Renin Inhibition by Substituted Piperidines: A Novel Paradigm for the Inhibition of Monomeric Aspartic Proteinases? Chemistry & Biology, 1999. 6: p. 127-131.

19. Curry, S., P. Brick, and N.P. Franks, Fatty Acid Binding to Human Serum Albumin: New Insights from Crystallographic Studies. Biochimica et Biophysica Acta, 1999. 1441: p. 131-140.