We were looking at the date of onset of south-west monsoon over Kerala for 104 years (1901-2004) to find a suitable probability model for the onset date. It varies from 11 May (in 1918) to 18 June (in 1972), and a mean close to May 30. Apparently the curve looks like a mixture of three distributions having modes at June 1, June 8, and May 17. It is not easy to fit the data by any probability model unless one takes a mixture of three suitable distributions.
Suppose we want to model the number of passengers in a particular United Airlines flight from New York to Los Angeles in September 2001. Quite understandably the data for the last 20 days of that month will be very unusual and will never match with that of the first 10 days. Which parametric curve would fit it?
Take another recent example. The number of hate crimes involving racial and religious discrimination for 11 months after the Brexit referendum was 49,921, a 23% rise from 40,741, in the same period the previous year. If we combine these two periods to frame a model, it might become very difficult to explain it by any standard parametric model. What is the possible good model in this situation?
Of course, the last two are extreme situations, but such events do occur in different contexts. And the rainfall date is a simple and standard event. However, a parametric model would fail disastrously in many such real situations. Quite often people working in real life struggle to choose between parametric and nonparametric statistical methods for modeling and decision making. Nonparametric procedures are used in many basic statistical applications like a ‘histogram’, which is nothing but a nonparametric way of representing a distribution. Let’s discuss the important issue of ‘parametric versus nonparametric’ in the context of the simplest yet most daunting task of any types of statistical analysis – that is ‘model fitting’.
The idea of model fitting is to find a suitable statistical model (graph or curve) to explain the data, keeping the nature of the data in mind. The objective is to find the curve which represents the data in the best possible manner. In a parametric setup, the model or curve usually has some inherent features – one cannot deviate from that. For example, if a Gaussian (or normal) curve is fitted, it will be symmetric about a central value, will have a peak at the centre, and it will dampen at both ends having very light tails. On the other hand, if we opt for an exponential curve, for example, the curve would start from zero, would reach a peak, and would slowly damp at the right tail. It will be a skewed curve. Usually any such model has a few parameters, which are unknown constants, and dictate the nature of the curve to some extent without disturbing the inherent features of the underlying curve. For example, the Gaussian model has two parameters – one fixes the central value, and the other dictates the spread of the curve; the exponential curve has only one parameter which determines both the central value and the variability.
Unknown parameters can be (and should be) estimated from the data. Our objective is to find the unknown parameters of the model in a way so that the curve is either stretched or shrank suitably, horizontally or vertically or both, and it is also dragged appropriately to place over the data points so that the observed data points might be as close as possible to the fitted curve. We can do that in an ad-hoc manner – by hand, simply by an eye estimation. However, instead of making an ad-hoc data fitting we often do that by using some statistical methodology where usually we minimize some sort of distances between the data points and the fitted curve.
However, in a parametric method, we cannot deviate from the inherent nature of the model. Even if the nature of the data is different, say the data is highly skewed or asymmetric, the model cannot capture that kind of features once we decide to fit a symmetric model like Gaussian to the data. Of course, there are several models to capture asymmetry or skewness or other features. But, once we chose any of them, we are bound to accept the inherent (and in-built) features of that model. What we can do is to just find the few (one or two or three or so) parameters of the model to fix (estimate) the features in a data-driven way.
Real life is full of surprises. Quite often real life is much more complicated than the explaination by a few parameters. Data from real life can be seen to have multiple modes. Quite often the data, when plotted, are seen to be asymmetric, it might have high skewness. The two tails of the data curve can have different thicknesses. Moreover, real life is always inflicted with unforeseen dramas, which are impossible to estimate. The so-called ‘Black Swan’ events and the immediate carry-over effects of such an event might make the data plot really abrupt. And quite often a parametric model, for which the nature of the curve is usually preset except for some parameters, would fail to capture this. The choice of parameters in a parametric model can help only to shrink, magnify and drag the curve on the data plots. But, the multi-modality, skewness, abrupt tail behaviours, sudden and dramatic change points – all these will be beyond the capacity of a parametric model usually.
We need a nonparametric model for such a purpose. The curve will be free-flowing without much prefixed constraints about its shape, and take a data-driven shape without keeping much preset prejudices – it makes fewer and much less stringent assumptions than the parametric counterparts. In the nonparametric method, we go on estimating the curve using the data. And the characteristics like the ranks of the observations, the differences in successive observations in the ordered dataset, etc. might provide quality information about the model and its fit.
Also, a parametric model is very sensitive to the outliers. For example, suppose we are about to fit a parametric model for a dataset of size 100. Suppose 99 observations fall within 0 to 10, and one particular observation is 1 billion (due to erroneous recording), which is clearly an outlier. A parametric model will invariably fit an absolutely wrong curve in order to accommodate the ‘outlier’. However, a nonparametric model only depending on the ranks of the observations will provide a reasonably good fit. Similarly, if the true relationship in a regression analysis is nonlinear, and typically unknown, a nonparametric regression model with less assumptions will provide a robust solution. However, assumptions like randomness and independence of data are often needed for carrying nonparametric methods.
Decision making procedure in a nonparametric set up would use the ranks and scores. However, if we are quite certain about a valid parametric model (say Gaussian, gamma, exponential) for a dataset, a parametric modelling or parametric inference would be better to chose – it will provide more efficiency to the procedure. However, if the underlying model is uncertain to the analyst, it is always safe to use a nonparametric procedure – although the efficiency might be a bit less than the case of ‘true’ (but unknown) parametric modelling, it is much more robust in the sense that there is no possibility of a disaster which might otherwise arise out of a wrong model specification in the parametric scenario.
Certainly, the real life is difficult to be explained by only a few parameters quite often – the world seems predominantly more suitable for nonparametrics.
Writer Atanu Biswas and Bimal Roy
(Professors at the Indian Statistical Institute, Kolkata)