HUGH G. GAUCH, JR.
Ecology and Systematics, Cornell University, Ithaca, New York 14850
AND
GENE B. CHASE2
Education, Cornell University, Ithaca, New York 14850
Reprinted from ECOLOGY Vol. 55, No. 6, Autumn 1974 pp. 1377-1381
Abstract. An increasing number of models
in ecology involve Gaussian curves (or the lognormal, which is the Gaussian
with logarithms of the abscissa values). These include models of species distributions
along gradients, habitat and niche width, overlap, and packing, and the lognormal
distribution of species importances. However, these models cannot be investigated
quantitatively unless data can be fitted to Gaussian curves and appropriate
statistics determined.
Further, the Gaussian data collected by ecologists
typically have special properties affecting the fitting process, notably irregular
spacing of samples and truncation of the curves, thereby making ours a specialized
case (as contrasted with Gaussian data in general). We conclude the best approach
for our application to be a variation of parameters algorithm, and present it
with a computer program in FORTRAN IV.
Applications are given which test field data
quantitatively against hypotheses for species distributions along environmental
gradients and for dominance-diversity relationships.
Key words: Gaussian curve; lognormal;
probit analysis; variation of parameters.
INTRODUCTION
The Gaussian or normal distribution is one of
the most common distributions in nature, and it appears in all the physical,
biological, and social sciences. The need to fit experimental data to Gaussian
curves is ubiquitous.
The Gaussian curve appears in various ecological
studies and hypotheses. For example, species distributions along environmental
gradients are generally considered to be Gaussian curves (Gauch and Whittaker
1972). It is this model of vegetation structure, and its consequent implications
for designing ordination techniques (Gauch et al. 1974), which has motivated
our work on fitting the Gaussian curve. This model in turn immediately relates
to issues of current interest: species niche and habitat distributions, overlap,
and packing. Also, the log-normal distribution of species importances (Preston
1948) can be investigated (the lognormal curve simply being the Gaussian curve
with a logarithmic scale on the abscissa). However, none of these models can
be investigated or tested quantitatively unless one is able to fit experimental
data to Gaussian curves.
Four properties of the Gaussian data which ecologists
tend to use (as in the above examples) are
distinctive for their data, as contrasted with
Gaussian data in general. Two are welcome. First, the base line (horizontal
asymptote) is at zero. Second, the curve is level with respect to the abscissa
and not tilted. These two simplifying properties may be used to advantage in
our ecological applications. However, ecological data often have two unwelcome
properties. First, the experimental data points may not be spaced uniformly
along the abscissa. Second, the sampling interval may not include the entire
Gaussian curve; one or both ends of the curve may be unsampled, and the sampling
interval may not even include the mode of the curve.
Our review of literature and computer programs
on fitting the Gaussian curve did not find techniques adequate for ecological
applications. Hence we designed our own algorithm and computer program.
The problem of fitting the Gaussian curve in ecological applications may be
stated as follows. Because the base line is at zero and the curve not tilted,
the Gaussian equation involves only three parameters:
where Y0 is the maximum,
the mean, and
the standard deviation.
We are given a set of N data points (x, y) whose positions along
the curve may be uneven and which may not sample the entire curve. From this,
we first desire to find the three parameters Y0,
,
and
of the Gaussian curve that best fit these data points
in the sense that the sum of squared deviations (between observed and fitted
y values) is minimized. Second, we want to present statistics measuring
goodness of fit. For this, a convenient statistic is the variance accounted
for by the fitted curve, expressed as a percentage of the original or total
variance. Another lucid statistic is the average percentage displacement, obtained
by summing the displacements of observed and fitted values divided by the summed
observed values, and multiplied by 100 to give a percentage. Our program calculates
these statistics, and also graphs the data points and fitted curve together
so rapid visual comparison is possible. Depending on the exact nature of the
data and hypothesis involved in an individual application, other statistical
tests will also be applicable, such as the chi-squared or the F-test,
and they can be easily computed from the output given.
REVIEW OF LITERATURE
Five approaches have been taken to fitting
the Gaussian curve: general non-linear least-squares fitting, polynomial approximation,
probit analysis, graphical approximation, and variation of parameters.
General non-linear least-squares fitting is available
in "canned" programs. These programs are very general and must have
supplied, for a given case, the function wanted and its partial derivatives
with respect to each parameter. The algorithm makes heavy use of matrix inversions,
which becomes costly for large matrices. Hence this generalized approach is
inefficient and costly for the Gaussian function.
Polynomial approximation is possible, the method
being worked out for exponential functions (as ours), even with unequally spaced
data points (Scheid 1968). For the Gaussian, only even-degree polynomials will
occur. Reasonable accuracy would require at least three terms, which is as many
parameters as we have in the original Gaussian equation (or more). However,
the best method for finding these polynomial coefficients is a variation of
parameters of these polynomial coefficients. Therefore, if this method were
used, it would be at least as much work as a straightforward variation of parameters
anyway, and additional effort would be required afterwards to go back to the
original parameters of the Gaussian equation. Hence we reject this approach.
Furthermore, the solutions obtained may be tilted curves.
Probit analysis is used routinely for fitting Gaussian curves (Finney 1952).
The associated cumulative function of to linearize data after which a simple
least-squares straight line used. These results are then "un-linearized"
get back parameters equation. problem us that our may be truncated not sampled
on one or other both ends. It becomes necessary know where truncation points
(before values can computed) and cannot usually even guessed at in applications.
If sides literature only offers encouragement equations this should exist. Furthermore
pro-bit transformed equivalent original data: sum squared deviations (SSD) preserved
underprobit transform. As consequence SSD parameter estimates dubious meaning
particular maximum likelihood estimators. short does fit estimated estimators
invalid go calculate chi-squared statistic.
Before computers were available, Preston (1948)
fitted Gaussian curves by what appear to be several approximate graphical techniques;
this was good for its time through necessarily somewhat subjective. However,
Patrick et al. (1954) have preserved a good account of one of these methods,
that developed by R. Singleton and used by Preston. We pass by these methods
because computers are now generally available.
Variation of parameters requires that each parameter depends on all the others in a continuous way. The Gaussian function happens to have this property. von Hoerner (1967) has worked out an excellent variation of parameters algorithm for the Gaussian function. By detailed theoretical considerations, and after fitting thousands of sets of data, he determined the best sequence of steps and optimum values for certain somewhat arbitrary numbers. He achieved excellent efficiency. Our method draws heavily upon his work. However, von Hoerner was working with data from radiotelescopes where the base of the Gaussian curve could be at other than zero and the curve could be tilted. Consequently, he spent effort solving for two additional parameters not pertaining to our application. Also, his data always included the mode, whereas ours may not, and we need to take extra measures to make some provision for this additional problem. Thus, variation of parameters, starting with von Hoerner's results, is the approach that we have chosen.
This choice of variation of parameters has several other advantages. First, many measures of goodness of fit are possible. But that used in variation of parameters is the sum of squared deviations (SSD), a statistic of very general applicability. Second, from statistical theory it can be shown that when the sampling errors are random (as in most of our applications) there is no better measure to optimize than SSD. When SSD is used, random errors will not change the Gaussian parameters of a curve. Third, least-squares fits produce maximum likelihood estimates. Fourth, because the best least-squares fit is obtained, in individual circumstances various other statistical tests may be valid and desirable. And fifth, the percentage of variance accounted for and related statistics may be used to check the hypothesis that the data are indeed Gaussian. That is, the hypothesis one begins with itself becomes testable after one has done the computation.
Details of our variation of parameters algorithm are presented in Appendix I.
APPLICATIONS
Our computer program for fitting the Gaussian may be used to evaluate quantitatively hypotheses whenever the Gaussian or lognormal is involved. This includes a wide variety of ecological problems, as discussed in the Introduction. We give several examples:
1) Whittaker (1967; Fig. 8, lower panel) presents the population densities for 12 species in the Santa Catalina Mountains, Arizona, along an environmental moisture gradient. This graph shows mainly Gaussian curves and a few bimodal or skewed curves; yet even the latter are also approximately Gaussian in form. He makes the hypothesis that the species' distributions are generally Gaussian in form, and shows the smoothed curves as such. By use of our program, his original data (Whittaker pers. comm.) were fitted to Gaussian curves, and they accounted for 87% of the variance. In view of sampling errors in field data, this seems to substantiate well his hypothesis for this Santa Catalina data. Furthermore, replacing the original moisture indices with ordination values from a Gaussian ordination yields 94% variance accounted for (Gauch et al. 1974).
2) The species in the just-discussed example are now described quantitatively with respect to their mode, maximum, and standard deviation along the gradient. This makes possible quantitative study of such problems as habitat width and overlap, possible differences between coniferous and deciduous species, and so on.
3) Preston (1948) presents Dirk's data from capture of 56,131 moths in Orono, Maine. He graphs the number of species per octave, and hypothesizes that the result is a lognormal distribution (that is, Gaussian with logarithms of the x values). He gives a fitted curve with parameters, expressed in our terms, of mode 3, maximum 48, and standard deviation 3.42 (in units of octaves). When fitted by our program, we obtain the parameter values respectively of 2.811, 45.86, and 3.701. Our SDD is 174.5; his SSD we computed to be 196.8. Our fitted curve accounts for 96% of the variance. Allowing some margin for sampling errors, this substantiates his hypothesis well.
4) Accepting Preston's model for the moth population at Orono, Maine for reasons given in the previous paragraph, we can compute the theoretical number of species which would be found in a complete sample. Dirk actually obtained 349 species in his sample; Preston gives the appropriate equation and computes the theoretical total number of species to be 410, so Dirk's sample represents approximately 85% of the possible total. Using our Gaussian parameters for this data, which produce a lower SSD and which we therefore consider more accurate, we obtain for Dirk's moth sample a theoretical total of 425 species, so his sample is approximately 82% of the possible total.
This method for fitting Gaussian curves is very sparing of the investigator's time, especially as contrasted with some other possibilities. Also, because SSD was minimized, the resulting parameters are maximum likelihood estimators, and consequently various statistical tests may be applied. Finally, the algorithm has not imposed any constraints or unattainable requirements, but can handle typical ecological data.
APPENDIX I
Our variation of parameters algorithm and computer program for fitting the Gaussian are described in this appendix.
We designed the program so that an input threshold value may be specified, and data points with y values at or below this threshold will be ignored. This saves computer time for any application for which such values happen to be insignificant or meaningless.
Because the Gaussian equation has three parameters, a minimum of three positive data points are required to define a Gaussian curve. The program detects any data sets with under three positive data points, prints a message that the curve has under three positive values, and goes on to the next data set.
First guesses of Y0,
, and
must be supplied
before beginning variation of parameters; the more accurate these are, the faster
the solution will be reached. For first guesses we use a very simple algorithm.
The (x, y) data points are searched to find the highest value of y; the
height of this point y is taken as the first guess of the maximum Y0,
and its corresponding position x on the abscissa is the first guess of
the mode
. Solving the Gaussian equation for
, one obtains
Data values (x, y) are put into this equation
to solve for
, and a weighted average is used for the first guess of the standard
deviation u. In theory, the weights should be assigned according to the value
of the derivative in the neighborhood of the data point and according to the
sampling error distribution. In practice, we have found this overly complex
and wasteful of computer time. We have therefore replaced this by a similar
but very simple weighting function: if y is between 100% to 95% of the largest
y value, the weight is 0; if between 95% and 50%, the weight is 1; if
y is 50% to 0% of the largest y value, the weight varies linearly from 1 to
0.
The variations of parameters to follow have in
common the same basic structure. The basic point is to hold one parameter constant,
vary a second, and observe the change produced in the third and in SSD. For
a variation, three points are constructed: the
current value of the parameter being varied, plus and minus an amount, 2
;
and the corresponding three SSD values. These three parameter and SSD values
give the coordinates for three points that define a parabola. The position of
the minimum value of this parabola is then computed. If the parabola bottom
is within plus or minus
of the old value, this becomes
the new value for the parameter being varied; if beyond these limits, the new
value is changed the maximum amount of
in the right direction.
This cautious moving is necessary to prevent an oscillation that would preclude
converging to a solution. The parabola is constructed using 2
rather than
in order to smooth
over local irregularities. At the start,
is set to
/4;
at each iteration when the variations of Y0 and
are both within plus or minus
,
is
narrowed by a factor of 3 for the next iteration.
For the Gaussian, we use the following variation
of parameters. First, vary
, holding
constant,
and calculate the
corresponding three Y0 and SSD values; use
the three
and SSD values to define a parabola and solve for the new
. Second, vary
, holding
constant,
and calculate the corresponding
three Y0 and SSD values; use the three
and SSD values to define
a parabola and solve for the new
. Third, using these new
and
values, calculate the new Y0 and the new SSD. This completes a variation of
parameters. This order of variation,
,
, and then
Y0, is in order
of the criticalness of the parameters, and produces the most rapid convergence
to a solution.
This variation is iterated; at each iteration
a decision must be made to terminate or to continue iterating. Three criteria
are used. First, if either variation of
improved SSD, or the overall
SSD improved, there is mathematically the prospect of getting a better solution,
so iterating may continue; otherwise iterating
is terminated. A second test is practical
rather than mathematical: unless Y0,
, or
change by at least 0.5%, we consider
the current solutions already accurate enough for our purposes and terminate.
(Obviously, other users could set this value at whatever is appropriate for
their purposes.) Otherwise, iteration of the entire process is continued. The
third criterion is an absolute limit on the number of iterations, based on von
Hoerner's results. He found from 2,000 observations that never more than six
iterations were needed to reduce the remaining improvement below one-fifth of
the mean error. However, he used more data points than we tend to have,
so as a slight modification of his conclusions, we use an absolute iterations
limit of 10 and terminate if this limit is reached.
One problem remains, namely, the case where the
mode is outside the sampling interval. After several attempts to solve this
problem, we finally found the following reasonably satisfactory approach. If,
in making the first guesses, we find the mode
to be at or within 5%
of an end of the sampling interval, we make two changes in initializing the
variation of parameters. The
is narrowed by a factor of 3 at the very outset,
and the number of iterations allowed is increased to 25. Since
is now I/(4
X 3) or 1/12 of
the 25 iterations will allow searching for the mode
out to approximately 2
from the ends of the sampling interval. Experience
rapidly shows that, given any finite amount of sampling error, the shape of
the Gaussian curve is not effectively defined if there is no more than one tail
beginning more than approximately 2o- from the mode. That is, when the mode
is beyond the sampling interval by a standard deviation or two, a wide variety
of Gaussian curves can go through the data points with very nearly equal precision.
Generally, several solutions could be given for which the SSD is always better
than 99.99% of its lowest possible value; yet the Gaussian parameters could
vary on the order of 20% or more. Therefore we suggest that fitted curves in
this situation can be readily used for the purpose of accounting for variance,
but should not be taken too seriously for the purpose of knowing the Gaussian
parameters themselves. This is not a peculiar weakness of this method; the problem
is inherent in the mathematics of the situation. The only exception would be
data with no sampling errors for which one was willing to do all computations
in double or extended precision and increase the number of iterations enormously;
however, this is not the typical situation confronting biologists. Therefore,
when the mode is more than a couple of standard deviations away from the sampling
interval, we consider the curve not defined at all well and thus the data are
intractable and the solution useful only for accounting for variance.
The input data deck for the program consists of the x values, an alphanumeric field for purpose of identification, and one or several sets of y values to fit as Gaussian curves. Whenever several sets of y values are used with a single set of x values, the computer also presents the performance statistics (on variance and displacement) for the data set as a whole. Thus, a given x vector may be fitted to a single y vector, or one by one to an entire matrix of y vectors. This latter feature is very convenient in our particular application.
The program's output consists of identification of the input data, a graph of data points with fitted curve (unless suppressed by an input option); the graph scales (unless graphing suppressed), the three parameter values; and SSD, percentage of variance accounted for, and average percentage displacement of the y values; the threshold value and number of data points and iterations used; and finally, the reason for terminating iteration.
We have compiled the program in FORTRAN G on Cornell University's IBM System 360/65 computer. The source program itself uses only 11K of memory (where 1K is 1,024 bytes, and one byte is 4 bits), and complied together with system additions only 40K. Curves of 30 data points are fitted in 1 sec. Hence the program is quite economical.
After fitting many sets of data of a variety of characteristics, we find that the criteria presented for terminating iteration result generally in an accuracy of accounting for 99.9% of the variance possible. Obviously, a user desiring more or less accuracy could modify these criteria; the present values are suitable for our uses.
This program is in the Cornell Ecology Programs series as "Cornell Ecology Program 12. Fit Gaussian or Normal Curve," is written in FORTRAN IV, and is 595 cards long. A documentation, including a listing of the program and results with a set of sample data, is available from the first-named author at cost (currently US $5.00 payable to Cornell University; also a copy of the punched program deck currently costs $8.00 plus postage for 3 pounds and 5 ounces if airmail is requested).
1This research was supported by a National Science Foundation grant. We appreciate helpful comments on the manuscript from R. H. Whittaker. Manuscript received April 11, 1973; accepted April 22, 1974.
2Present address: Mathematics Department, Messiah College, Grantham, PA 17027.
LITERATURE CITED
Finney, D. J. 1952. Probit analysis: A statistical treatment of the sigmoid response curve. 2nd Ed. Cambridge Univ. Press, London. 318 p.
Gauch, H. G., G. B. Chase, and R. H. Whittaker. 1974. Ordination of vegetation samples by Gaussian species distributions. Ecology 55: 1382-1390.
Gauch, H. G., and R. H. Whittaker. 1972. Coenocline simulation. Ecology 53: 446-451.
Hoerner, S. von. 1967. Least-squares fit of a Gaussian to radio sources. Astrophys. J. 147: 467-470.
Patrick, R., M. H. Hohn, and J. H. Wallace. 1954. A new method for determining the pattern of the diatom flora. Not. Nat. of the Acad. of Nat. Sei. of Philadelphia 259: 1-12.
Preston, F. W. 1948. The commonness, and rarity, of species. Ecology 29: 254-283.
Scheid, F. 1968. Schaum's outline of theory and problems of numerical analysis. McGraw-Hill, New York. 422 p.
Whittaker, R. H. 1967. Gradient analysis of vegetation. Biol. Rev. 42: 207-264.