Constructing an Algorithm for Selecting the Number of Histogram Bins in Statistical Hypothesis Testing for Normal Distribution of Sample Data

Authors: Ivelina Zlateva; Nikola Nikolov; Mariela Alexandrova; Violin Raykov
DIN
IJOER-OCT-2018-6
Abstract

Practice, on the whole, makes extensive use of the vast range of assumptions and conjectures in regards to the type of frequency distribution in statistical samples, the deviations from which would significantly affect the qualities of the model and the estimation accuracy of its parameters. Regrettably, a reliable and clearly defined criterion as to their permissibility is completely absent.

For instance the fish stock assessment procedure is initially based on assumption that the frequencies in the lengthfrequency samples used for estimation of growth parameters of fish and analysis of the stock status are normally distributed or follow approximately the normal distribution [15,17].

The purpose of the present study is to construct an algorithm for identification of the statistical distribution of a random variable focusing on the proper selection of the number of histogram bins and further assessment of its impact on the stochastic models delivered. To that effect, appropriate simulation studies have been carried out to compensate for the lack of any concrete evidence related to the potential impact of the number of bins in the histogram and the overall data accuracy on the results of the application of the statistical criterion for the verification of the law of distribution. Applied has been the direct statistical method for determining the law of the distribution - chi-square criteria along with some indirect methods. Provided for the simulation studies were machine-generated data sets and the relevant simulations were held in MATLAB programming environment.

Keywords
histogram bins length-frequency samples normal distribution stochastic modeling stock assessment.
Introduction

Exploring the law of random variable distribution is the first fundamental step in a researcher’s journey into the possibility for obtaining specific targeted information about the object of their study. Analyzing experimental data and displaying it graphically in a histogram offers the scientist a better insight into the intricate pattern of statistical regularity, which, in turn, will help them draw the relevant inferences about the events and processes under study. Undoubtedly, the information thus obtained is often insufficient and requires further refinement through the use of more scientifically-based methods of knowledge acquisition and attainment of improved objectivity and decision quality.

Indeed, thorough awareness of the distribution law, along with its underlying parameters, opens up the possibility for the parameters of the object under exploration to be modeled with sufficient accuracy, and to be validated as unbiased estimates of the general population (herein, class biological objects) with sufficient accuracy and thus, provides an effective means of solving various prediction problems.

In probability theory and mathematical statistics, the normal distribution, or the Gaussian distribution is continuous and gives a good approximate description of the samples, with the data values being tightly grouped round the mean, and distributed symmetrically to form a bell-shaped density curve.

It is widely applicable for mathematical descriptions of real-world phenomena and processes as well. This is ascribed to the validity of the central limit theorem that the sum of a large number of independent random variables with arbitrary laws of distribution is considered as such with a normal distribution of the variables.

Conclusion

Through the adoption of an experimental and statistical approach, a passive experiment was carried out to collect relevant information about the growth parameters of BO in the Bulgarian Black Sea coast in the area of β€œTrakata” in the vicinity of the town of Varna. The aim is to determine the law of statistical distribution of the length of the two types of BO. The lack of specific information on the impact of the accuracy of the data used on the results of the application of statistical criterion for validating the law of distribution has necessitated the completion of additional simulation studies. Accordingly, conducted have been further studies to clarify the number of the splitting sample bin intervals with the formation of empirical distribution, which directly affects the selection of the theoretical law of distribution. A direct approach is used to determine the law of distribution through chi- square criteria in combination with indirect methods. Employed in the simulations were computer-generated data with a normal distribution in MATLAB environment function randn.

The following primary conclusions have been reached:

1) In the study of the distribution law, combining the direct method (chi-square), the recommendations𝑛𝑖 . 𝑝𝑖 < 5and indirect methods improves the quality of the end solution. The considerable computational work while fusing them together does not pose a problem with the present-day state-of-the-art computer technology.

2) The number of bin intervals π‘˜, to which the data necessary for the construction of the histogram is split has a profound effect upon the results obtained in the process of determining the law of the random variable distribution. The selection of only one specific value of π‘˜ is found to be quite insufficient to bring about a reasonable conclusion. The use of computer equipment with appropriate software provides the opportunity for multiple values to be included in the study towards a more informed decision.

3) The use of k from 5 to 13 is considered sufficient enough to reveal the stochastic regularity. With significant data noise-contamination the smaller values of π‘˜ produce reliable results, although the degrees of freedom are on decrease. With substantial data noise-contamination, the smaller values of π‘˜ (5,6), the chi- square is able to detect the normal distribution in spite of the curtailed degrees of freedom.

4) With both uncontaminated and contaminated data, the increase of π‘˜, is likely to result in intervals of 𝑛. 𝑝𝑖 < 5. This indicator increases with increased number of contamination intervals. The πœ’ 2 criterion recognizes the normal distribution easily, when there are intervals with 𝑛𝑖 . 𝑝𝑖 < 5, and in both cases, following their integration.

5) The proposed recommended values for the number of sample splitting intervals is π‘˜ = 5 βˆ’13, with n > 200. Modern computer technology makes it possible for the distribution of data to be explored with multiple intervals, rather than only one selected value for π‘˜, subsequent to the process of decision-making. The availability of information about the level of data contamination is of utmost convenience.

6) The distribution of BO1 and BO2 lengths is subject to the law of normal distribution. The accuracy of the experimental data, of 0.1 cm with which they have been obtained is seen as sufficient.

7) The obtained models of the laws of probability distribution with the underlying parameters are viewed as adequate can be used for solving research and practical tasks as well.

Article Preview