Search This Blog

Wednesday, March 13, 2019

MAT 209 - Statistics Chapter 3 - Statistical Descriptions

* The following notes contain mathematical equations that may not have converted correctly over into HTML format. Apologizes for the issue.

Chapter 3 Summarizing Data: Statistical Descriptions

Measures of location - Summarize a list of numbers by a "typical" value. The three most common measures of location are the mean, the median, and the mode.
            - Measures of central location = Similar to averages, indicative of central data set.
            - Measures of variation = Spread/dispersion of data.
Parameters - Any numerical quantity that characterizes a given population or some aspect of it. This means the parameter tells us something about the whole population.

Measures of central location:
·      Mean = “average” in the common sense, add up all the values and divide by the number of values. Often denoted with the symbol, “ ”. Sample mean of a set data (sample size) à , where numerator is the sum of the x values in the sample and the denominator is the total number of values. Mean of population is similar, but uses the symbol “µ”.
This is a popular measure of data set because of its pros:
§  Can be calculated for any numerical data set.
§  For any numerical data set, it is unique, unambiguous value.
§  Can be further statistically treated.
§  If each value replaced by mean, value would not change.
§  Takes into account value of each item in the data set.
§  Relatively reliable.
            However, it has its own cons:
§  Affected by very small or very large values.
§  For grouped data, it cannot be determined when there is an open class.
·      Trimmed Mean = Same as mean, but outliers (upper and lower 5%) removed.
·      Weighted Mean = Placing weights (measures of relative importance) to certain values when acquiring the mean. The weighted mean of a set of numbers x1, x2, x3, ... and xn, whose relative importance is expressed numerically by a corresponding set of numbers w1, w2, …, and wn is shown as the equation below.
Numerator = sum of the products obtained by multiplying each x by its corresponding weight. Denominator = sum of the weights.

Example of weighted mean in action:
Three grains produced by farmers in the US in the year 2000 brought per-bushel prices of $2.65 (wheat), $1.85 (corn), and $1.05 (oats). Production in the billions of bushels was 2.22, 9.97, and 0.15, respectively. What is the overall average price that farmers received per bushel for the three grains?

·      Grand Mean = Overall mean of k sets of data having the means of x1, x2, x3, …, and xk, consisting of n1, n2, n3, …, nk measurements/observations. Similar to formula setup of weight mean.

·      Median = Value of the item that is in the middle or the average of two items that are nearest the middle. Often denoted with the symbol, “ x̃ ”. NO formula for value of median, but there is a formula for median position à Median is value of  th item, literal for odd numbers of values, but average between two closest numbers for even # of values.
Can be presented in the form of a box-and-whiskers plot/box plot. The box extends from Q1 to Q3 with whiskers extending to the smallest and largest value. Vertical line in the box is the median value.
Second most popular measure of a data set because of its pros:
§  Can be calculated for any numerical data set.
§  For any numerical data set, it is a unique value.
§  Simple to find once the data is ordered.
§  Not easily affected by extreme values.
§  Can be used to rank tasks by dividing into fractiles and quartiles.
v 1st Quartile = median of all values to the left of the median position for a whole set of data.
v 2nd Quartile = median
v 3rd Quartile = median of all values to the right of the median position for a whole set of data.
            Also has its own cons:
§  Ordering data can be tedious for large data sets.
§  CANNOT be combined to an overall.
§  Generally not as reliable as the mean.
·      Mode = Value that occurs with the highest frequency.
Used as a measure of a data set because of its pros:
§  Requires no calculations.
§  Can be determined for qualitative + quantitative data.
            However, it has its own cons:
§  Poor measure of central location in statistical inference.

Measures of variation:
·      Range = Largest value minus the smallest.
Used as a measure of variability because of its pros:
§  Easy to calculate and understand.
§  Natural curiosity about the minimum and maximum.
            However, it has its own cons:
§  Does NOT tell anything about the dispersion of values between the extremes.
·      Standard deviation = Amount by, which the individual values differ from the mean. First step towards it is the deviations from the mean, every value is subtracted by the mean, µ. Example: x1-µ, x2-µ, x3-µ, …, xn-µ.

§  Mean deviation = Adding the absolute values of deviations from the mean and dividing by the N number of elements. à Theoretical difficulties with inference, thus rarely used.

§  Population standard deviation (aka root-mean-square deviation) = Averaging the (deviation from the mean) squared and take the square root the result. Represented by equation below:
Where σ = measure of variation, N = # of elements, µ = mean
§  Population variance = σ2, removes the square root giving the equation .
§  Sample standard deviation is slightly different from the population standard deviation. For sample deviation as represented by s, the number of elements is n - 1 to make the variance of the sample easier to estimate.
Where s = sample measure of variation (sample standard deviation), x̄ = mean. In actual practice, this formula is rarely used.
§  Sample variance = s2, as indicated by the formula .
§  For easier computation of standard deviation use the following formula:
this formula allows us to get the standard deviation without the mean.
sxx = Alternative calculation for summation of difference between element and mean squared.

Example: On 6 consecutive days, tow truck operator received 9,7,11,10,13,7. Calculate the sample deviation.

x
x2
9
81
7
49
11
121
10
100
13
169
7
49
Total 57
Total 569
Based ∑x = 57, ∑x2 = 569, together with n = 6, we find that sxx = 569 – (57)2/6 = 27.50. Then, plug into that sxx value into the larger equation for s.
 ≈ 2.3

The same result is achieved with the sample deviation formula with mean.

3.7 Some Applications for Standard Deviation

·      Chebyshev’s theorem: For any set of data (pop/sample) and any constant k greater than 1, at least 1-1/k2 of the data must lie within k standard deviations on either side of the mean. à Therefore, we can be sure that at least 1 – 1/22 = 1 – (1/4) = ¾ or 75% of values in the data set must lie within the 2 standard deviations on either side of the mean.
And at least 1 – 1/32 = 1 – (1/9) = 8/9 or 89% of values must lie within 3 standard deviations on either side of the mean.
***For most sets of data, the actual percentage of data lying between the limits is greater than specified by Chebyshev’s theorem.
·      For distributions having the general shape of the cross-section of a bell curve (empirical rule):
1.     About 68% of values within 1 standard deviation of the mean.
2.     About 95% of values within 2 standard deviations of the mean.
3.     About 99.7% of values within 3 standard deviations of the mean.
·      Converting data into standard units or z-scores: If x is a measurement belonging to a set of data having the mean x̄ (or µ) and standard deviation s (or σ), then its value in standard units, denoted by z is …
*** Tells how many standard deviations a value lies above/below the mean.
·      Measure of relative variation (ex: coefficient of variation as shown below):
 à Expresses standard deviation as a percentage.

3.8 Description of Grouped Data

In the past grouping data was vital because it lessens the workload before calculating. Nowadays with computers, it is possible to use direct data. This section focuses on how to work with grouped data. In the case of group data:
·      We assign each item falling into a class the value of the class mark.
·      If number of values = n
sum of all measurements = ∑x * f
sum of their squares =∑x2 * f
then, the formula for  where .
·      For population mean (µ) = , for population variance (σ) = .
·      Percentiles = A measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall.
Median = 50th percentile
Q1 = 25th percentile
Q3 = 75th percentile
Find the percentile of a group using these steps:
1.     Order all values from smallest to largest.
2.     Multiple k percent (k = percentile we are looking for) by total number of values à to get index (if not a whole number round up).
3.     Count values in your data set from smallest to largest until you reach the number you got in step 2.

3.9 Further Statistical Descriptions

Varied forms of frequency distribution:
·      Symmetrical bell-shaped distribution
·      Positively skewed bell-shaped distribution (tail on right)
·      Negatively skewed bell-shaped distribution (tail on left)
·      Reversed J-shaped distribution
·      U-shaped distribution
To determine amount of skewness in a frequency distribution:
Pearsonian coefficient of skewness:
Perfectly symmetrical à SK = 0, SK value is usually -3 ≤ SK ≤ 3.

Determining Outliers Using the 5 number Summary Presentation:
Minimum, Q1, Median (Q2), Q3, Maximum.

Upper outliers determined by Upper Hinge = Q3 + 1.5(IQR)
Lower outliers determined by Lower Hinge = Q1 – 1.5(IQR)

3.10 Technical Note: Summations

∑x is a very general notation à most specific notation is .
With two subscripts it is also possible to evaluate double summations such as

Three basic rules for summations:
·      Rule A:
·      Rule B:

·      Rule C:

No comments:

Post a Comment