FITTING THE GAUSSIAN CURVE TO ECOLOGICAL DATA1

HUGH G. GAUCH, JR.

Ecology and Systematics, Cornell University, Ithaca, New York 14850

AND

GENE B. CHASE2

Education, Cornell University, Ithaca, New York 14850

Reprinted from ECOLOGY Vol. 55, No. 6, Autumn 1974 pp. 1377-1381

Abstract. An increasing number of models in ecology involve Gaussian curves (or the lognormal, which is the Gaussian with logarithms of the abscissa values). These include models of species distributions along gradients, habitat and niche width, overlap, and packing, and the lognormal distribution of species importances. However, these models cannot be investigated quantitatively unless data can be fitted to Gaussian curves and appropriate statistics determined.

Further, the Gaussian data collected by ecologists typically have special properties affecting the fitting process, notably irregular spacing of samples and truncation of the curves, thereby making ours a specialized case (as contrasted with Gaussian data in general). We conclude the best approach for our application to be a variation of parameters algorithm, and present it with a computer program in FORTRAN IV.

Applications are given which test field data quantitatively against hypotheses for species distributions along environmental gradients and for dominance-diversity relationships.

Key words: Gaussian curve; lognormal; probit analysis; variation of parameters.

INTRODUCTION

The Gaussian or normal distribution is one of the most common distributions in nature, and it appears in all the physical, biological, and social sciences. The need to fit experimental data to Gaussian curves is ubiquitous.

The Gaussian curve appears in various ecological studies and hypotheses. For example, species distributions along environmental gradients are generally considered to be Gaussian curves (Gauch and Whittaker 1972). It is this model of vegetation structure, and its consequent implications for designing ordination techniques (Gauch et al. 1974), which has motivated our work on fitting the Gaussian curve. This model in turn immediately relates to issues of current interest: species niche and habitat distributions, overlap, and packing. Also, the log-normal distribution of species importances (Preston 1948) can be investigated (the lognormal curve simply being the Gaussian curve with a logarithmic scale on the abscissa). However, none of these models can be investigated or tested quantitatively unless one is able to fit experimental data to Gaussian curves.

Four properties of the Gaussian data which ecologists tend to use (as in the above examples) are distinctive for their data, as contrasted with Gaussian data in general. Two are welcome. First, the base line (horizontal asymptote) is at zero. Second, the curve is level with respect to the abscissa and not tilted. These two simplifying properties may be used to advantage in our ecological applications. However, ecological data often have two unwelcome properties. First, the experimental data points may not be spaced uniformly along the abscissa. Second, the sampling interval may not include the entire Gaussian curve; one or both ends of the curve may be unsampled, and the sampling interval may not even include the mode of the curve.

Our review of literature and computer programs on fitting the Gaussian curve did not find techniques adequate for ecological applications. Hence we designed our own algorithm and computer program.

The problem of fitting the Gaussian curve in ecological applications may be stated as follows. Because the base line is at zero and the curve not tilted, the Gaussian equation involves only three parameters:

where Y0 is the maximum, the mean, and the standard deviation. We are given a set of N data points (x, y) whose positions along the curve may be uneven and which may not sample the entire curve. From this, we first desire to find the three parameters Y0, , and of the Gaussian curve that best fit these data points in the sense that the sum of squared deviations (between observed and fitted y values) is minimized. Second, we want to present statistics measuring goodness of fit. For this, a convenient statistic is the variance accounted for by the fitted curve, expressed as a percentage of the original or total variance. Another lucid statistic is the average percentage displacement, obtained by summing the displacements of observed and fitted values divided by the summed observed values, and multiplied by 100 to give a percentage. Our program calculates these statistics, and also graphs the data points and fitted curve together so rapid visual comparison is possible. Depending on the exact nature of the data and hypothesis involved in an individual application, other statistical tests will also be applicable, such as the chi-squared or the F-test, and they can be easily computed from the output given.

REVIEW OF LITERATURE

Five approaches have been taken to fitting the Gaussian curve: general non-linear least-squares fitting, polynomial approximation, probit analysis, graphical approximation, and variation of parameters.

General non-linear least-squares fitting is available in "canned" programs. These programs are very general and must have supplied, for a given case, the function wanted and its partial derivatives with respect to each parameter. The algorithm makes heavy use of matrix inversions, which becomes costly for large matrices. Hence this generalized approach is inefficient and costly for the Gaussian function.

Polynomial approximation is possible, the method being worked out for exponential functions (as ours), even with unequally spaced data points (Scheid 1968). For the Gaussian, only even-degree polynomials will occur. Reasonable accuracy would require at least three terms, which is as many parameters as we have in the original Gaussian equation (or more). However, the best method for finding these polynomial coefficients is a variation of parameters of these polynomial coefficients. Therefore, if this method were used, it would be at least as much work as a straightforward variation of parameters anyway, and additional effort would be required afterwards to go back to the original parameters of the Gaussian equation. Hence we reject this approach. Furthermore, the solutions obtained may be tilted curves.

Probit analysis is used routinely for fitting Gaussian curves (Finney 1952). The associated cumulative function of to linearize data after which a simple least-squares straight line used. These results are then "un-linearized" get back parameters equation. problem us that our may be truncated not sampled on one or other both ends. It becomes necessary know where truncation points (before values can computed) and cannot usually even guessed at in applications. If sides literature only offers encouragement equations this should exist. Furthermore pro-bit transformed equivalent original data: sum squared deviations (SSD) preserved underprobit transform. As consequence SSD parameter estimates dubious meaning particular maximum likelihood estimators. short does fit estimated estimators invalid go calculate chi-squared statistic.

Before computers were available, Preston (1948) fitted Gaussian curves by what appear to be several approximate graphical techniques; this was good for its time through necessarily somewhat subjective. However, Patrick et al. (1954) have preserved a good account of one of these methods, that developed by R. Singleton and used by Preston. We pass by these methods because computers are now generally available.

Variation of parameters requires that each parameter depends on all the others in a continuous way. The Gaussian function happens to have this property. von Hoerner (1967) has worked out an excellent variation of parameters algorithm for the Gaussian function. By detailed theoretical considerations, and after fitting thousands of sets of data, he determined the best sequence of steps and optimum values for certain somewhat arbitrary numbers. He achieved excellent efficiency. Our method draws heavily upon his work. However, von Hoerner was working with data from radiotelescopes where the base of the Gaussian curve could be at other than zero and the curve could be tilted. Consequently, he spent effort solving for two additional parameters not pertaining to our application. Also, his data always included the mode, whereas ours may not, and we need to take extra measures to make some provision for this additional problem. Thus, variation of parameters, starting with von Hoerner's results, is the approach that we have chosen.

This choice of variation of parameters has several other advantages. First, many measures of goodness of fit are possible. But that used in variation of parameters is the sum of squared deviations (SSD), a statistic of very general applicability. Second, from statistical theory it can be shown that when the sampling errors are random (as in most of our applications) there is no better measure to optimize than SSD. When SSD is used, random errors will not change the Gaussian parameters of a curve. Third, least-squares fits produce maximum likelihood estimates. Fourth, because the best least-squares fit is obtained, in individual circumstances various other statistical tests may be valid and desirable. And fifth, the percentage of variance accounted for and related statistics may be used to check the hypothesis that the data are indeed Gaussian. That is, the hypothesis one begins with itself becomes testable after one has done the computation.

Details of our variation of parameters algorithm are presented in Appendix I.

APPLICATIONS

Our computer program for fitting the Gaussian may be used to evaluate quantitatively hypotheses whenever the Gaussian or lognormal is involved. This includes a wide variety of ecological problems, as discussed in the Introduction. We give several examples:

1) Whittaker (1967; Fig. 8, lower panel) presents the population densities for 12 species in the Santa Catalina Mountains, Arizona, along an environmental moisture gradient. This graph shows mainly Gaussian curves and a few bimodal or skewed curves; yet even the latter are also approximately Gaussian in form. He makes the hypothesis that the species' distributions are generally Gaussian in form, and shows the smoothed curves as such. By use of our program, his original data (Whittaker pers. comm.) were fitted to Gaussian curves, and they accounted for 87% of the variance. In view of sampling errors in field data, this seems to substantiate well his hypothesis for this Santa Catalina data. Furthermore, replacing the original moisture indices with ordination values from a Gaussian ordination yields 94% variance accounted for (Gauch et al. 1974).

2) The species in the just-discussed example are now described quantitatively with respect to their mode, maximum, and standard deviation along the gradient. This makes possible quantitative study of such problems as habitat width and overlap, possible differences between coniferous and deciduous species, and so on.

3) Preston (1948) presents Dirk's data from capture of 56,131 moths in Orono, Maine. He graphs the number of species per octave, and hypothesizes that the result is a lognormal distribution (that is, Gaussian with logarithms of the x values). He gives a fitted curve with parameters, expressed in our terms, of mode 3, maximum 48, and standard deviation 3.42 (in units of octaves). When fitted by our program, we obtain the parameter values respectively of 2.811, 45.86, and 3.701. Our SDD is 174.5; his SSD we computed to be 196.8. Our fitted curve accounts for 96% of the variance. Allowing some margin for sampling errors, this substantiates his hypothesis well.

4) Accepting Preston's model for the moth population at Orono, Maine for reasons given in the previous paragraph, we can compute the theoretical number of species which would be found in a complete sample. Dirk actually obtained 349 species in his sample; Preston gives the appropriate equation and computes the theoretical total number of species to be 410, so Dirk's sample represents approximately 85% of the possible total. Using our Gaussian parameters for this data, which produce a lower SSD and which we therefore consider more accurate, we obtain for Dirk's moth sample a theoretical total of 425 species, so his sample is approximately 82% of the possible total.

This method for fitting Gaussian curves is very sparing of the investigator's time, especially as contrasted with some other possibilities. Also, because SSD was minimized, the resulting parameters are maximum likelihood estimators, and consequently various statistical tests may be applied. Finally, the algorithm has not imposed any constraints or unattainable requirements, but can handle typical ecological data.

APPENDIX I

Our variation of parameters algorithm and computer program for fitting the Gaussian are described in this appendix.

We designed the program so that an input threshold value may be specified, and data points with y values at or below this threshold will be ignored. This saves computer time for any application for which such values happen to be insignificant or meaningless.

Because the Gaussian equation has three parameters, a minimum of three positive data points are required to define a Gaussian curve. The program detects any data sets with under three positive data points, prints a message that the curve has under three positive values, and goes on to the next data set.

First guesses of Y0, , and must be supplied before beginning variation of parameters; the more accurate these are, the faster the solution will be reached. For first guesses we use a very simple algorithm. The (x, y) data points are searched to find the highest value of y; the height of this point y is taken as the first guess of the maximum Y0, and its corresponding position x on the abscissa is the first guess of the mode . Solving the Gaussian equation for , one obtains

Data values (x, y) are put into this equation to solve for , and a weighted average is used for the first guess of the standard deviation u. In theory, the weights should be assigned according to the value of the derivative in the neighborhood of the data point and according to the sampling error distribution. In practice, we have found this overly complex and wasteful of computer time. We have therefore replaced this by a similar but very simple weighting function: if y is between 100% to 95% of the largest y value, the weight is 0; if between 95% and 50%, the weight is 1; if y is 50% to 0% of the largest y value, the weight varies linearly from 1 to 0.

The variations of parameters to follow have in common the same basic structure. The basic point is to hold one parameter constant, vary a second, and observe the change produced in the third and in SSD. For a variation, three points are constructed: the current value of the parameter being varied, plus and minus an amount, 2 ; and the corresponding three SSD values. These three parameter and SSD values give the coordinates for three points that define a parabola. The position of the minimum value of this parabola is then computed. If the parabola bottom is within plus or minus of the old value, this becomes the new value for the parameter being varied; if beyond these limits, the new value is changed the maximum amount of in the right direction. This cautious moving is necessary to prevent an oscillation that would preclude converging to a solution. The parabola is constructed using 2 rather than in order to smooth over local irregularities. At the start, is set to /4; at each iteration when the variations of Y0 and are both within plus or minus , is narrowed by a factor of 3 for the next iteration.

For the Gaussian, we use the following variation of parameters. First, vary , holding constant, and calculate the corresponding three Y0 and SSD values; use the three and SSD values to define a parabola and solve for the new . Second, vary , holding constant, and calculate the corresponding three Y0 and SSD values; use the three and SSD values to define a parabola and solve for the new . Third, using these new and values, calculate the new Y0 and the new SSD. This completes a variation of parameters. This order of variation, , , and then Y0, is in order of the criticalness of the parameters, and produces the most rapid convergence to a solution.

This variation is iterated; at each iteration a decision must be made to terminate or to continue iterating. Three criteria are used. First, if either variation of improved SSD, or the overall SSD improved, there is mathematically the prospect of getting a better solution, so iterating may continue; otherwise iterating is terminated. A second test is practical rather than mathematical: unless Y0, , or change by at least 0.5%, we consider the current solutions already accurate enough for our purposes and terminate. (Obviously, other users could set this value at whatever is appropriate for their purposes.) Otherwise, iteration of the entire process is continued. The third criterion is an absolute limit on the number of iterations, based on von Hoerner's results. He found from 2,000 observations that never more than six iterations were needed to reduce the remaining improvement below one-fifth of the mean error. However, he used more data points than we tend to have, so as a slight modification of his conclusions, we use an absolute iterations limit of 10 and terminate if this limit is reached.

One problem remains, namely, the case where the mode is outside the sampling interval. After several attempts to solve this problem, we finally found the following reasonably satisfactory approach. If, in making the first guesses, we find the mode to be at or within 5% of an end of the sampling interval, we make two changes in initializing the variation of parameters. The is narrowed by a factor of 3 at the very outset, and the number of iterations allowed is increased to 25. Since is now I/(4 X 3) or 1/12 of the 25 iterations will allow searching for the mode out to approximately 2 from the ends of the sampling interval. Experience rapidly shows that, given any finite amount of sampling error, the shape of the Gaussian curve is not effectively defined if there is no more than one tail beginning more than approximately 2o- from the mode. That is, when the mode is beyond the sampling interval by a standard deviation or two, a wide variety of Gaussian curves can go through the data points with very nearly equal precision. Generally, several solutions could be given for which the SSD is always better than 99.99% of its lowest possible value; yet the Gaussian parameters could vary on the order of 20% or more. Therefore we suggest that fitted curves in this situation can be readily used for the purpose of accounting for variance, but should not be taken too seriously for the purpose of knowing the Gaussian parameters themselves. This is not a peculiar weakness of this method; the problem is inherent in the mathematics of the situation. The only exception would be data with no sampling errors for which one was willing to do all computations in double or extended precision and increase the number of iterations enormously; however, this is not the typical situation confronting biologists. Therefore, when the mode is more than a couple of standard deviations away from the sampling interval, we consider the curve not defined at all well and thus the data are intractable and the solution useful only for accounting for variance.

The input data deck for the program consists of the x values, an alphanumeric field for purpose of identification, and one or several sets of y values to fit as Gaussian curves. Whenever several sets of y values are used with a single set of x values, the computer also presents the performance statistics (on variance and displacement) for the data set as a whole. Thus, a given x vector may be fitted to a single y vector, or one by one to an entire matrix of y vectors. This latter feature is very convenient in our particular application.

The program's output consists of identification of the input data, a graph of data points with fitted curve (unless suppressed by an input option); the graph scales (unless graphing suppressed), the three parameter values; and SSD, percentage of variance accounted for, and average percentage displacement of the y values; the threshold value and number of data points and iterations used; and finally, the reason for terminating iteration.

We have compiled the program in FORTRAN G on Cornell University's IBM System 360/65 computer. The source program itself uses only 11K of memory (where 1K is 1,024 bytes, and one byte is 4 bits), and complied together with system additions only 40K. Curves of 30 data points are fitted in 1 sec. Hence the program is quite economical.

After fitting many sets of data of a variety of characteristics, we find that the criteria presented for terminating iteration result generally in an accuracy of accounting for 99.9% of the variance possible. Obviously, a user desiring more or less accuracy could modify these criteria; the present values are suitable for our uses.

This program is in the Cornell Ecology Programs series as "Cornell Ecology Program 12. Fit Gaussian or Normal Curve," is written in FORTRAN IV, and is 595 cards long. A documentation, including a listing of the program and results with a set of sample data, is available from the first-named author at cost (currently US $5.00 payable to Cornell University; also a copy of the punched program deck currently costs $8.00 plus postage for 3 pounds and 5 ounces if airmail is requested).

1This research was supported by a National Science Foundation grant. We appreciate helpful comments on the manuscript from R. H. Whittaker. Manuscript received April 11, 1973; accepted April 22, 1974.

2Present address: Mathematics Department, Messiah College, Grantham, PA 17027.

LITERATURE CITED

Finney, D. J. 1952. Probit analysis: A statistical treatment of the sigmoid response curve. 2nd Ed. Cambridge Univ. Press, London. 318 p.

Gauch, H. G., G. B. Chase, and R. H. Whittaker. 1974. Ordination of vegetation samples by Gaussian species distributions. Ecology 55: 1382-1390.

Gauch, H. G., and R. H. Whittaker. 1972. Coenocline simulation. Ecology 53: 446-451.

Hoerner, S. von. 1967. Least-squares fit of a Gaussian to radio sources. Astrophys. J. 147: 467-470.

Patrick, R., M. H. Hohn, and J. H. Wallace. 1954. A new method for determining the pattern of the diatom flora. Not. Nat. of the Acad. of Nat. Sei. of Philadelphia 259: 1-12.

Preston, F. W. 1948. The commonness, and rarity, of species. Ecology 29: 254-283.

Scheid, F. 1968. Schaum's outline of theory and problems of numerical analysis. McGraw-Hill, New York. 422 p.

Whittaker, R. H. 1967. Gradient analysis of vegetation. Biol. Rev. 42: 207-264.