Another approach on elaborating frequency tables using GAUSS´s

The correct organization of data is of extreme importance for the analysis in the process of statistical inference, aiming at the elaboration of valid conclusions and generic recommendations on a certain subject under study. The specialized scientiﬁc literature describes several procedures for the organization of data in frequency distribution tables. However, it is not often that inaccuracies are made in the use of data organization tools. These inaccuracies have resulted in distorted analyzes and incongruous conclusions. This paper presents a critical analysis of the methods and procedures of data organization described in the literature, speciﬁcally the distribution of data in frequency distribution tables and proposes a diﬀerent approach in the elaboration of class frequency distribution tables. For all variables with predeﬁned categories, these and their respective reference values serve as the body of the table. For the variables that do not have pre-deﬁned categories, the use of Gauss’s normal distribution theory as a criterion for categorization is proposed and justiﬁed. The categories generated serve as a table body. It is concluded that the frequency distribution tables can be distinguished into two types, that is, frequency distribution tables of predeﬁned classes or categories and tables of frequency distribution of classes or categories to be deﬁned, thus suggesting their adoption as concept.


INTRODUCTION
Statistics, as a science, is summarized in a set of methods and procedures that allow the collection, organization and presentation of data for their analysis, in order to obtain valid conclusions and the elaboration of generic recommendations on a given population being studied.
Statistics finds application in the most diverse areas of knowledge such as economics, psychology, sociology, agriculture and health, helping in the decision-making process based on data analysis. In order for statistics to fulfill this role, it is imperative that the methods and procedures are properly applied. Data collection involves the correct application of sampling methods which culminate in the selection of representative samples and ensure the validity of the data and, at the same time, the validity of the studies; the organization and presentation of data aims the simplification and compression of data into frequency distribution tables, whether they are single frequency tables or class frequency tables. Data summarized in tables are generally, and where appropriate, visualized in graphs, diagrams and other statistical tools. Their correct organization is of extreme importance for data analysis in the process of statistical inference.
Another approach on elaborating frequency tables using GAUSS´s normal distribution theory www.nucleodoconhecimento.com.br The specialized scientific literature describes a wide range of procedures for data organization in frequency distribution tables. It has been found that inaccuracies in the use of data organization tools are often committed, resulting in distorted interpretation and incongruent conclusions. This article aims to present a critical analysis of the methods and procedures of data organization described in the literature, specifically the data distribution in frequency tables and proposes a different approach in the elaboration of frequency distribution tables.

. THEORETICAL FRAMEWORK ON FREQUENCY DISTRIBUTION TABLES
The statistic is divided into two main groups: descriptive statistics and inferential statistics.
Descriptive Statistics is the part that deals with the description, classification, organization and presentation of the data of a variable. It allows summarizing data and helps describing the attributes of a given data group or a population by means of the calculation of descriptive measures such as the mean and the standard deviation. The tabular and graphic descriptive techniques used with the support of the graphical capabilities of modern computers and the various software available make this type of summary more feasible and more understandable.
Inferential statistics, on the other hand, is responsible for analyzing data in order to obtain valid conclusions and the elaboration of generic recommendations about the population being studied.

TABULAR AND GRAPHIC DESCRIPTIVE TECHNIQUES
The raw data of a variable obtained in a sampling process do not allow the visualization of any characteristic of the sample and much less of the population under study. Therefore, tables and graphs are essential statistical resources in data processing. The class frequency distribution table also contains two columns, the first for the class intervals of the variable and the second for the number of individuals belonging to each class, that is, the absolute frequency of the class.
Single frequency distribution tables have the major disadvantage of being too long, due to the fact that all sampled values have to be listed and, above all, do not allow the reading of data characteristics. These type of tables are totally inappropriate when the purpose is to summarize large amounts of data. For these cases, class frequency distribution tables are more suitable, as they compact large amounts of data into a few classes and link the quantitative attributes of the variable to the actual meaning of the characteristics of the variable.
The fundamental aspect in the elaboration of distribution tables of class frequencies is the determination of the number of classes, in which the data will be framed. For this purpose, several criteria for determining the number of classes are described in the literature, most of which are strictly mathematical based. Milton and Tsokos (1991) indicate the use of the following formulas to determine the number of classes: Another approach on elaborating frequency tables using GAUSS´s normal distribution theory www.nucleodoconhecimento.com.br 2 -Number of classes = Beiguelman (2002) indicates that the number of classes is between 8 and 20, depending on data. For Dawson and Trapp (2003) 6 and 14 classes are adequate to provide sufficient information without excessive detail. Triola and Triola (2006) recommend that the number of classes should be between 5 and 20, while Kuzma and Bohnenblust (2001) consider that the number of classes should be 5 to 15. Reis (2005) states that the number of classes should be between 4 and 14 and also suggests the use of one of the following solutions: 1 -Number of classes equal 5 for samples with size less than 25 and number of classes = for samples with a size equal or greater than 25.
2 -Use the Sturges formula: number of classes = 1 + 3,22 log n The aforementioned shows the variety of criteria described within the literature for the construction of frequency distribution tables. However, these criteria have a common disadvantage that makes them unsuitable as a basis for determining the number of classes of data for certain variables, such as variables in the field of biology and health sciences, economics, industry, among others, which have pre-defined categories, that are, standardized categories. Some of these variables are described below and the respective reference values of the pre-defined categories are indicated.In the field of biology: Example1: Reference values of total cholesterol in humans The BMI helps to define the obesity degree of a person according to the World Health Organization. By calculating and interpreting the BMI it is possible to know if a person is above or below the weight parameters recommended for their physical structure. To calculate the BMI, the weight measured in kg is divided by the height squared (BMI = kg / m 2 ). Example 3 -Classification of arterial hypertension. Reference values are in general 40 a 60 mg/L for men, 30 a 50 mg/L for women and 25 a 40 mg/L for children (CAQUET, 2011).
In the field of economics: When the variables do not have predefined categories and exhausted the search for clues or indicators that help to create the categories, without being by mathematical criterion, it is suggested here the use of the theory of the Normal Distribution of Gauss for the variable categorization. This aims to create categories that make it possible to link the quantitative to the qualitative and make the data interpretation more perceptible.
The Gauss normal distribution theory, widely referenced in the literature (RICE and SCOTT, 2005), states that for a set of normal (symmetric, unimodal) data, approximately 68% of the data are located up to a unit of standard deviation of the mean, approximately 95% of data are located up to two units of mean standard deviation and 99% of the data are located up to three units of mean standard deviation.  (2005) Thus, it is established as a criterion for the definition of categories from the Theory of Normal Distribution, the use of the variation range of 68%, as follows: This suggestion of variables categorization is based on the calculation of a measure of central tendency, the mean and a measure of variability, the standard deviation. Values that are at a standard deviation of the mean are considered to be close to the mean, so the designation of the category "within the mean" or "around the mean" is suggested. Values that are outside this range are considered to be distant from the mean and the "above the mean" and "below the mean" designations are suggested respectively. In this way we obtain categories that give meaning to the values of the variable making the data interpretation clearer.
However, the proposed methodology must always be applied, taking care of the specific conditions of each case. 3 classes are always obtained, and if the data are in fact extracted from a normal population, the second class will have a relative frequency around 68% and the other two around 16% each, which may, in some cases, not be desirable. On the other hand, if there are some / a lot of atypical observations in the data or if the underlying distribution is biased, multimodal, the method in question will not produce the desired results. Anyway, it is a criterion that, well considered its application, can help to improve the interpretation of the data and thus the understanding of a certain phenomenon.

SIMPLE FREQUENCY DISTRIBUTION
It is the depicted below a simple distribution table constructed by using age´s data. Source: the author Table 6, of simple frequency distribution, shows its inadequacy for the purposes of summarizing large amounts of data. The table is too long in the vertical and above all, it does not allow the reading of the data characteristics.

FREQUENCY DISTRIBUTION TABLE OF CLASSES TO BE DEFINED
Age is a variable that, for purposes of demographic analysis, has pre-defined categories. However, for other purposes, this definition of categories may not be adequate, as is the case of the analysis of age of the candidates for the access exam for the University. In this particular case, the aim is to evaluate the candidate's age variability, determine the candidate with the highest and the lowest age, respectively, as well as analyze the data variability. It is important to know at what intervals the age of most candidates are distributed in order to draw conclusions about the profile of students that finish secondary school and therefore are potential candidates for higher education.
For demonstration purposes, a sub-sample of 182 candidates was randomly withdrawn from the global sample to show how the variable has to be categorized prior to construction of frequency distribution.

Source: Author
For these 182 data the mean is 22 years and the standard deviation is 4 years. The use of Gauss's normal distribution theory for the definition of categories as described in section 2.1 results in the following categorization of the variable and its respective frequency distribution:     Once the class range is determined, the limits of each of the 6 categories can be finally calculated. The lower limit of the 1st category is the minimum value and the upper limit of this category is obtained by adding the class range. The same is done for the other categories and the following result is obtained:

Source: Author
The first finding is that the calculated categories do not have their own designation, that is, they do not have attributes that qualify the condition in the individual. The second finding, despite the same number of categories, the values of each category are not equal. By the calculated categories, individuals with a BMI between 14.9 and 21.7 would be mistaken for having normal weight, when in fact some of them are in the "underweight" condition.

FINAL CONSIDERATIONS
Frequency distribution tables are very useful tools in data processing. Their construction must, however, comply with certain scientifically valid criteria. It is essential to ensure that, irrespective of the criterion adopted, the interpretative reading of the data leads to logical and understandable conclusions about the behavior of the variable. In this regard, the link Another approach on elaborating frequency tables using GAUSS´s normal distribution theory www.nucleodoconhecimento.com.br between quantitative and qualitative attributes is crucial. Thus, the suggested criterion, based on the use of Gauss's Normal Distribution Theory for the categorization of variables, is quite feasible.
For all that has been expressed, there is enough ground to introduce the differentiation of class frequency distribution tables into two types, that are, tables of frequencies of predefined classes or categories and tables of frequencies of classes or categories to be defined, suggesting thus its adoption as a concept.