Inferential statistics is the use of sample statistics to make inferences about the population parameters. A parameter is a characteristic of the population.

Suppose you sampled 100 records out of the entire data set or population. Assuming that the 100 samples are representative of the total population, you can take the results of those 100 samples and infer the results over the entire population or data set.

Let us say that the 100 records were from the sales data set of a retail store, and you wish to pull and examine the sales invoices to attempt to determine whether the customer was female or male. This information would help the store better target their advertising. If the results were that there were 35 sales to males and 65 sales to females, you would conclude that females made 65 percent of all purchases, as you cannot review every sales invoice. In this case, the statistic is 65 percent and the parameter is females.

With data analytic software, you can perform calculations on the entire data set or population. Therefore, we will focus on descriptive statistics whether the data fields are categorical or numerical. Inferential statistics are discussed more in the sampling segment of this chapter.

MEASURES OF CENTER

In a data set, you want to know where the middle of the data is and what the typical or frequent value is. The most common way to summarize numerical data is to determine its center by describing the mean or average and the median.

The mean is merely a term for the average of all the numbers. In terms of the dataset, it would be the total of all the numbers in a particular field divided by the number of records. The mean amount may not even appear in a transaction in the data set as it is a calculated amount. For a PAID_AMOUNT field of1,000 records, where the total of those records is $250,000, the mean (or average) is $250. IDEA's field statistics show averages for each numeric field. When you summarize a file in IDEA, you may also select for it to output the average for each key or group in the newly summarized file.

You need to take care when considering the mean. It is very sensitive to extremes or outliers. A few very large or a very few small amounts may make the mean not representative of the data. If there was a single transaction of $200,000 in the previous example and you exclude this outlier, the mean would be $50 rather than $250.

The median measurement is not sensitive to outliers. The median is the midpoint in the distribution of the data. It is the point that divides the distribution into two, with one half being equal to or less than the median and the other half being equal to or greater than the median. Again, there may not be an actual record amount that corresponds to the median amount. It is merely a positional value.

Data must be arranged as indexed or sorted, in either descending or ascending order, before you can successfully apply the formula. Once the data is ordered, determine if there are an odd or even number of records.

If the data contains an odd number of records, then the median is the one exactly in the middle of the ordered records.

In the example of numbers 1, 2, 4, 5, and 5, the middle or median position number is the third number, which has a value of 4.

If the data contains an even number of records, the median is the average of the two numbers appearing in the middle.

In the example of numbers 1, 2, 4, 7, 8, and 8, the two middle numbers are 4 and 7. The median position is between 4 and 7, which is the 3.5 spot. By adding those two middle numbers and dividing by 2, the result of 5.5 is the median value.

To calculate the position of the median, you may use this formula:

Median = (N +1)/2

The letter N (in uppercase) represents the number of records in the field or population. A lowercase letter n represents the number of cases in a sample. This is used when the median position is needed in a sample.

In applying the formula for the odd number of the five records above, the median position calculation is (5 + 1)/2 = 3 with a value of 4 in that position.

For the even number example of six records, the median position calculation is (6 + 1)/2 = 3.5, with a value of 5.5 in that position.

When the mean and median values are far from each other, it is good to be aware of both values. You now know that there are outliers in the data that need to be addressed.

Along with the mean and median, the mode is frequently mentioned as a measure of center. The mode refers to the most frequently occurring value in a distribution or data set. It is determined by counting the frequency of each result. In the discussion of Benford's Law in Chapter 5, it can be seen that the leading digit 1 for the first-digit test is the mode or the number that most frequently appears in data sets. While we will not be using the mode for any calculations, you should be aware of what it represents.

Found a mistake? Please highlight the word and press Shift + Enter