Gambar halaman
PDF
ePub

base correspond to a value M of the variable. Professor Pearson has proposed to call this value the mode," i.e., it is the "fashion"able" value.

If the distribution be symmetrical, mode, median, and mean will coincide, but not otherwise. If the distribution be skew, the mode becomes more important than the mean; it becomes, in fact, the type, the value that strikes the eye by its constant repetition when one looks down a column of values of the variable.

The median value has been largely used by Mr. Francis Galton on account of its practical simplicity. Thus, to get the median height of, say, 100 men, they have simply to be ranged in order of their heights, and the fiftieth man from the end measured.

An approximate method of obtaining the mode when the mean and the median are given, based on an empirical formula of Professor Pearson's, is described at the end of this paper."

Sec. 8. Standard Deviation of a Distribution.

Consider the second moment-coefficient, the quantity denoted in sec. 4 by μ2. μ is the mean square deviation [i.e., mean (deviation)2] of the curve round its centroid vertical; it is therefore of the order of the square of a length. Let

σ =

o is then a length, and a very important one, for it measures the mean distance of the curve from the centroid vertical, i.e., the degree of "scatter' of the distribution from the mean. This quantity has been variously called "standard deviation," "mean 66 error, ," "error of mean square," and so on; in dynamics it occurs as "swing radius" and "radius of gyration." The first term is the one we shall use.

In the case of a skew-curve it might seem best to refer the curve to the mode instead of the mean, and state separately the standard deviations in excess and defect of the mode; unluckily however these quantities do not lend themselves to easy calculation. In any case a rough and handy rule holds good, that a total range of six times the standard deviation contains nearly the whole frequency; it is a crude but convenient measure of the practical extension of the curve.

The range that occurs in the curve of type (4) is not, it must be noted, in any way a measure of scatter or variation. In a curve of that type you may hold the range fixed, and then proceed to vary the standard deviation as you please. Thus the two curves in fig. 3 have been drawn with the same range; it would evidently

4 Supplementary note p. 343.

[graphic][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][merged small]

be useless to take as a measure of variation any quantity which would be the same for both. The equations of the two curves are

[ocr errors][merged small][merged small][merged small][merged small][merged small][ocr errors][merged small]

They have equal areas and the same range, viz., 20 centimetres." The standard deviations are

[merged small][merged small][merged small][ocr errors][merged small][ocr errors][merged small][merged small][merged small]

It should be noticed that in the normal curve the constant in the exponent is the standard deviation.

Sec. 9. Skewness.

The two curves of Fig. 3 are examples of the symmetrical form of the limited range distribution; this case is however exceptional. As a rule distributions will be more or less skew, and it will be convenient to have some numerical measure of their departure from

5 Each of the large squares represents 5 cms. on the original scale.

symmetry. Whatever be taken as this measure, it must be a pure number, and must be easily calculable. Professor Pearson defines skewness in his paper as the ratio of the distance between the mode and mean to the standard deviation.

This ratio is zero for a symmetrical curve, where mode and mean coincide, and is never greater than unity. In all the skewcurves (2) to (4) the skewness may be expressed directly in terms of the constants.

Sec. 10. Probable Errors.

Now these are all numerical measures descriptive of the original distribution: but to what extent can we rely on the numbers when obtained?

Thus, suppose we make a series of physical observations of any kind, group the results into a frequency-curve and determine its mean, standard deviation, and so on. However steady the methods and material may be in reality, we shall never, on repeating the observations, arrive at exactly the same values of the constants. To take a concrete case, return to dice. Repeat a series like that of example (4) in the table r times, and work out the r resulting values of the chance p. Now form a new frequency-curve of the values of p. This derived curve will have some standard deviation o', and, evidently, the smaller the more can we rely on the determination of p from a single set of observations. The standard deviation o' can in general be estimated à priori for any constant of a chance distribution, and gives us a most valuable measure of the accuracy with which that constant can be determined. In general not o' but a number

t = 674489 . . . o'

is used as the actual measure, and called the Probable Error of the constant; it might be better called, in Mr. Galton's nomenclature, the Quartile Deviation (or error) as it is not probable in any sense. Deviations greater and less than the quartile occur with equal frequency in the long run; i.e., the quartile may be defined as that point on the axis of measurement the vertical through which cuts off a quarter of the area of the frequency polygon on one side.

The distributions of skew curve constants may in general be taken as very closely normal, otherwise some more complex measure would have to be adopted, or the probable errors in excess and defect given separately.

[ocr errors]

The "probable errors are quoted after the constants in several of the tables of the appendix.

I am indebted to Professor Pearson for permission to use his unpublished results for the calculation of these "probable errors" of skew curve constants.

Sec. 11. Contributory Causes and Component Cause-Groups. Besides definitions of these various measures, there are one or two points of nomenclature on which I wish to be clear.

Consider again the case of generation of a frequency polygon by dice throwing. Each die as it falls and rolls has its final position determined by various causes, which cannot be precisely specified, and may (somewhat loosely) be described as "infinite" in number. Each of these causes might be termed a “contributory "cause" in the formation of the chance distribution. Professor Pearson, however, uses the phrase in a different sense,' in the sense, namely, of each die being a "contributory cause." The distinction is important, for while the number of contributory causes in the first sense is very large, the number of component cause-groups (considered as one group to each die) is not only finite, but often very small.

Now when we pass from fitting a point-binomial to a known discontinuous number of observations to fitting it to a series known to be really continuous, our numbers n, p, and q no longer admit of anything but a metaphorical interpretation, if indeed they can be considered as anything further than constants describing the form of the distribution. To avoid the slight confusion of terms as above, we shall refer to n in every case as the number of "component cause-groups." Just as n in the case of dice throwing is, at it were, the number of distinct channels through which the multitude of small contributory causes affect the form of the polygon, so we assume as a tentative working hypothesis that it bears the same significance in the case of continuous variation; there is no need for n to be large, however great we may conceive the number of contributory causes.

Sec. 12. Homogeneity of Material.

8

To pass on to another point of terminology, what is to be understood when we speak of statistical material as "homogeneous" or otherwise?

All that we shall imply by the term is this, that the distribution of frequency arising from measurements on the given material is capable of adequate representation by one single theoretical frequency-curve. This definition is very loose, but it scarcely seems possible to go further. Two points about it should be noticed; first, that material may be homogeneous with regard to one measurement while non-homogeneous with regard to another: the measurement should always be specified if not evident from 1 E.g., loc. cit., p. 358.

These remarks and some of the preceding were suggested by some of Professor Edgeworth's criticisms in the Journal for September, 1895, p. 510.

the context; secondly, that the material is not necessarily nonhomogeneous because it can, by some method of sorting, be split up into two or more superposed distributions. Common sense or experience may have to be called in to tell us whether the representation of the material is adequate or no, or the theory of probability may have to be again employed to find whether the odds against divergences of fit are greater than seems permissible. The definition of homogeneity might appear more naturally based on the condition that an increase in the number of observations should tend to bring the statistical polygon closer and closer to some theoretical form. I have purposely excluded this condition. Useful though it might be in theory, or in a physical case where the number of observations could be easily increased, in economics it would frequently be, from the very nature of the case, simply impossible to apply. The material we are handling is often limited, and the very introduction of fresh data may create the nonhomogeneity for which we are seeking a test.

Sec. 13. Test of Fit.

Professor Pearson has proposed in his paper a test for the closeness of fit of a theoretical curve to a statistical polygon, namely, the ratio

area between curve and polygon

area of whole curve or polygon'

the area between curve and polygon being all counted positive. The better the fit the smaller this ratio. The test is of course very rough; it lays the same stress on very probable divergences near the mode, and very improbable divergences in the tail of the curve, but it is convenient to have some such ready means of comparison. In a large series of normal curve fits containing some 1,000 observations each, the ratio ranged between 6 per cent. and 9 per cent. With skew-curves the results are generally better, averaging about 5 per 66 cent. misfit."

We may now pass on to consider the special case of "pauperism" with the aid of such frequency-distributions and numerical measures as we have here very briefly described.

II.—On Pauperism in England and Wales.

Sec. 14. Material used.

The raw material on which the present study of pauperism is based is contained in the series of returns that generally go by the name of "Pell's Returns," viz., No. 214, 1876; No. 339, 1882; and No. 266, 1892. The return published in 1876 is retrospective, containing the figures for 1850, 1860, and 1870. The returns give,

« SebelumnyaLanjutkan »