Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Probability & Statistics I: Understanding Mean, Standard Deviation, and Data Analysis - Pr, Exams of Probability and Statistics

An excerpt from a university course on probability & statistics i, taught by robb sinn at ngcsu in fall 2008. The concepts of mean, standard deviation, and data analysis, including euclidean distance, dot product, and the empirical rule. It also discusses the importance of understanding central tendency, dispersion, and position in numerical summaries of data.

Typology: Exams

Pre 2010

Uploaded on 08/04/2009

koofers-user-0zh
koofers-user-0zh 🇺🇸

10 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Probability & Statistics I
Robb Sinn
Math 3350 / 6350
NGCSU, Fall 2008
Data Analysis
Introduction to Mean and Standard Deviation
Adata set is a set of data points drawn from a domain set, called the
sample space or population. Given a data set, we can describe it with
numerics or graphis (or both). Variables can either be quantitative (numeric)
or qualitative (categorical). Some variables can be represented as either a
numeric variable (my GPA is 3.25) or categorical (I’m a B+student).
Most statistical tests are used with numeric data sets. For numeric data
sets, we have a sample of ndata points: fx1; x2; : : : xng 2 X. The xis are
assumed to be values of a simple random variable drawn from a domain set
X.
De…nition 1 Mean, x
IThe arithmetic average of the ndata points: x=1
nXxi
If we claim to know the population mean (and this is no small claim), we
denote it . When we have only one sample, ndenotes the sample size. If
we have k > 1samples, we typcially indicate the sample size of group 1 as n1,
group 2 as n2, and so forth. The overall sample size is usually denoted as
N=n1+n2+: : : +nk. In the multigroup context, x1is the mean for group 1,
etc., and xis the grand mean. A quick formula for the grand mean given that
we know the group means is
x=n1x1+n2x2++nkxk
n1+n2+:::+nk=n1x1+n2x2++nkxk
N
A similar convention exists for the standard deviation where, given k
groups, smis the standard deviation of group m(where 1mk). We
hardly ever calculate a "grand" or overall standard deviation.
De…nition 2 Population Variance, 2
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Probability & Statistics I: Understanding Mean, Standard Deviation, and Data Analysis - Pr and more Exams Probability and Statistics in PDF only on Docsity!

Probability & Statistics I

Robb Sinn

Math 3350 / 6350 NGCSU, Fall 2008

Data Analysis

Introduction to Mean and Standard Deviation

A data set is a set of data points drawn from a domain set, called the sample space or population. Given a data set, we can describe it with numerics or graphis (or both). Variables can either be quantitative (numeric) or qualitative (categorical). Some variables can be represented as either a numeric variable (my GPA is 3.25) or categorical (Iím a B+^ student). Most statistical tests are used with numeric data sets. For numeric data sets, we have a sample of n data points: fx 1 ; x 2 ; : : : xng 2 X. The xiís are assumed to be values of a simple random variable drawn from a domain set X.

DeÖnition 1 Mean, x

I The arithmetic average of the n data points: x = (^1) n

X

xi

If we claim to know the population mean (and this is no small claim), we denote it . When we have only one sample, n denotes the sample size. If we have k > 1 samples, we typcially indicate the sample size of group 1 as n 1 , group 2 as n 2 , and so forth. The overall sample size is usually denoted as N = n 1 + n 2 + : : : + nk. In the multigroup context, x 1 is the mean for group 1, etc., and x is the grand mean. A quick formula for the grand mean given that we know the group means is

x = n^1 x n^1 + 1 +n^2 nx 22 ++:::++nnkk xk= n^1 x^1 +n^2 x N^2 ++nk^ xk

A similar convention exists for the standard deviation where, given k groups, sm is the standard deviation of group m (where 1  m  k). We hardly ever calculate a "grand" or overall standard deviation.

DeÖnition 2 Population Variance, ^2

I Given n data points, assuming we know : ^2 =

X

(xi)^2 n

We rarely use this formula since, in practice, we rarely know . Most often, we estimate  with x.

DeÖnition 3 Sample Variance, s^2

I Given n data points: s^2 =

X

(xix)^2 n 1

Note that, if we view the data set X = fx 1 ; x 2 ; : : : xng as a uniform proba- bility function, that is, with P (xi) = (^1) n for all xi, then

x = E(X) and ^2 = V (X)

An advantage of standard deviation compared to variance is that the stan- dard deviation has the same units as the data set. For example, if the data are annual incomes, the standard deviation has units of dollars earned per year whereas the variance would have units of (dollars earned per year)^2. The stan- dard deviation can be interpretted as the Euclidean (or standard) distance met- ric for any set of data. Variance, despite the drawbacks, is a fundamental probability concept that underpins statistical theory. In statistics, we typically use a machine to determine mean and standard deviation, never computing or even noting the variance.

Vector Review

Consider the idea of distance in Euclidean space. For n-dimensions, each point in the space is an n-dimensional vector.

Example 1

Let two n-dimensional vectors be given by !x =

B

B

B

x 1 x 2 .. . xn

C

C

C

A

and !y =

B

B

B

y 1 y 2 .. . yn

C

C

C

A

respectively. The two vectors are said to be points in n-space, or notationally: !x ; !y 2 Rn. In physics we often work in 3 -space or R^3 , where the 3 dimensions are left-right, forward-backward and up-down.

DeÖnition 4 Euclidean Distance between Two Vectors !x ; !y , D(!x ; !y )

I !x  !y = x 1 y 1 + x 2 y 2 +    + xnyn

Comparing this deÖnition with the magnitude formula above, we can see that k!x k^2 = !x  !x. Recall the dot product of two vectors is zero if and only if the vectors are perpendicular. By the way, mathematicians always like "metric spaces," mathematical settings where a distance metric exists. Often in abstract settings, an inner product can be established and then turned into a distance metric. The approach shown below for developing the standard devia- tion as a distance measurement in a data set is an example of the typical way mathematicians think about distance. They are much more concerned with relative distances than absolute ones.

Standard Deviation as a Distance Metric

The idea of distance in n-dimensional vector space demonstrates why stan- dard deviation was created. Assume a sample is drawn from a normal distrib- ution. The following example will illustrate the key ideas.

Example 3 In a certain data set, the data points are f 1 ; 2 ; 3 ; 6 g, so n = 4 and x = 3. For any data point, say, x = 1, we can compute the directional distance (or deviation) from the mean:

di = xi x = 1 3 = 2

This is the absolute (directed) distance between xi and x. But it says nothing about "relative" distances. It tells us nothing about whether "2 units below the mean" is a large distance (like walking 2 miles) or a small one (walking 2 yards). Nor does it tell us how near the left tail of the distribution this data point might be. Statisticians seeking a more general idea of distance decided to use the typical Euclidean distance metric for n-space to determine the distance in a data set of size n. The Örst attempt used the idea of a deviation vector. For the data in Example 3, we would create the 4 -dimensional deviation vector by subtracting the mean from each data point in turn.

Example 3 contiued: deÖning a deviation vector.

d =

B

B

B

x 1 x x 2 x .. . xn x

C

C

C

A

B

B

C

C

A =

B

B

C

C

A

A deviation di of zero means that the data point xi = x, i.e. that is does not deviate at all from the mean. Next we calculate the magnitude of the deviation vector.

d =

B

B

C

C

A =^

p (2)^2 + (1)^2 + 0^2 + 3^2 =

p 14  3 : 741 7.

This naive idea would su¢ ce if all data sets were the exact same size. But consider the following two data sets:

Example 4 Data Set 1: f 1 ; 2 ; 3 ; 6 g Data Set 2: f 1 ; 2 ; 3 ; 6 ; 1 ; 2 ; 3 ; 6 g Weíve already computed the deviation and its magnitude for Data Set 1, ! d 1 =

p

  1. Doing the same for Data Set 2, we would have

d 2 =

p

Data Set 2 uses only data points from Data Set 1 and in many ways is not very di§erent from it. The magnitutde of the deviation in Data Set 2 is larger, not because the data have spread out more, but simply because there is more data. To better "standardize" distance, statisticians tried dividing by n (or n 1 ) to get rid of the ináuence of simply adding more data when the "spread" of the data is unchanged.

Example 4 continued For Data Set 1: s 1 =

! d p^1 n 1 =^

p p^14 4 1 ^2 :^ 160 2 For Data Set 2: s 2 =

! d p^2 n 1 =^

p 28 p 8 1 = 2: 0

This makes more statistical sense. If we have 8 data points grouped within the same absolute distance as the original 4 were, the statistical distance metric will need to shrink slightly, not grow. Dividing by n 1 keeps the distance metric from growing just because n is large. The reason for using n 1 for the denominator (rather than n) will be covered later when we discuss "unbiased estimators." For now, note that the sample standard deviation is undeÖned for data sets where n = 1 which does rather make sense. How could a one-number data set have a distance metric?

Numeric Summaries of Data

There are three main ideas in numerical summaries of data: central tendency, dispersion and position. The Örst two describe the properties of the data set and, hopefully, the underlying distribution. Position describes where in the overall distribution a single data point lies. The key idea for position is determining when a data point qualiÖes as an outlier. Central tendency describes a typical data point in the distribution. This is our notion of "average." The most common descriptions of the central ten- dency of a data set are the (arithmetic) mean and the median. The mode

are above a certain scores, roughly 84% are below. The competent statistician performs these mental estimates quickly. Outliers are data points that di§er greatly from the rest of the data set. Because outliers strongly ináuence the mean but not the median, these data points skew the distribution. If outliers exist to the left, the mean jumps left and the distribution skews left. If outliers exist to the right, the mean jumps right and the distribution skews right. If outliers exist on both sides in approximately equal measure, the distribution may not be skewed at all. Detecting outliers is an art, not a science. Two introductory level methods exist, the Örst using a box plot with fences adusted by the IQR, and the second using z-scores. Suppose our sample size is a million. Based on probability, we expect about 0 :15% of all standardized scores zi > 3. This would be 0 :0015(1000000) = 1500 of the one million scores. Similarly, we would expect about the same number of scores zi < 3. It seems odd to call several thousand data points outliers when they are conforming to what probability theory suggests will happen. The more trials of an experiment we conduct, the more likely it is to experience low probability outcomes. This is why, when using z-scores to detect outliers, we tend to use a sliding scale. The one below is mine, but others might argue for something di§erent. We might both be correct. Thatís stats.

Detecting Outliers with z-Scores Sample n Criteria Small n < 50 jzj > 2 Medium 100 < n < 500 jzj > 3 Large n > 1000 jzj > 4

The above criteria do not address every eventuality. Outliers are contextual, and putting forth a single "rule" for Önding them is impossible. Most modern statistical runs utilize multivariate settings, for example, when each individual is measured on Öve variables. An individual with no single variable score more than 2 standard deviations away from the mean may be an outlier when we look at all Öve of her scores together. Also, she could be an outlier on any combination of 4, 3 or 2 variables. Detecting outliers in multivariate settings is a complex process that belies any simple-minded or robotic approach. In introductory statistics, a very simple-minded approach uses "fences" and the IQR (inter-quartile range) to detect outliers. The advantage is that this allows a handy version of the box-plot to be used to graphically check for outliers. Recall that

IQR = Q3 Q1.

A box plot (see below) is a graph with vertical lines at xmin, Q1, median, Q3, and xmax. When displayed numerically, these Öve numbers are called the Five Number Summary. The typical box plot shows the Five Number Summary. However, instead of drawing the fences of the box plot at xmin and xmax, we can draw them at x  1 : 5  IQR. Then, we mark with an asterisk any data points lying either below x 1 : 5  IQR or above x + 1: 5  IQR and call them outliers. This method has visual appeal: outliers are shown in relationship to the data set. Most basic statistical machines (TI graphing calculators, etc) will draw this type box plot. This method for detecting outliers su§ers from the fact that, in any medium-sized or larger data set, we are almost guaranteed to have outliers, and no sliding scale (like the one used for z-scores) exists to adjust the fences. However, when we care about outliers a§ecting statistical tests, the test is generally being performed on a small data set (n < 30 ), and in these settings, the box plot with IQR fences is a reasonable method for assessing the outliers. It should be noted that certain folks in NGCSUís math department consider it a failing for Math 3350 instructors to omit discussions of IQRís for, in spite of their general lack of statistical relevance, they are covered in detail in AP Statistics and other introductory level statistics courses. Thus (the theory goes), "all secondary mathematics preservice teachers should cover them in a class." In defense of this notion, each statistical test should be run only after verifying that the underlying assumptions validating its use are satisÖed. For z-tests, we need a nearly-symmetric distribution with no outliers to satisfy the normality assumption, so neophyte statistics students are taught to check histograms and box-plots for a near-perfect bell shape and a complete lack of outliers. The process of checking assumptions is a useful one to encourage, but the assumptions necessary to use z-tests are so strict that they are quite rarely met in real-world applications. In fact, the brittle nature of the z-test with regard to the normality assuption is the reason Studentís t-test was invented by a quality assurance specialist in a brewery. So have a beer, and settle down about fences, IQRís and box-plots. If you canít calm your nerves, have another beer. More about checking the assumptions of statistical tests later.

Graphical Summaries of Data

The main statistical graph we use is the histogram. Histograms are a specialized bar graphs built upon frequency tables. Categories are chosen and then, for each category, a count of the data points falling within that category are listed. The bins or categories are chosen arbitrarily with an eye toward being able to pick out the "shape" of the distribution. If too few categories exist, there will be several tall bars and not much else to see. If too many categories are chosen, each data point ends up having its own bar and once again the pattern is impossible to see. So, Goldilocks, what number of categories is "just right"? As usual with statistics, that depends. In small data sets (n < 30 ), 7 - 10 bars generally allow the histogram to show the pattern or shape of the data. Another interesting graph is the stem plot. The stem plot has similar advantages to the histogram (showing the shape of a data set). But the nu-

To check that assumption 1 is valid, we produce a histogram and check for outliers. There is no real way to test the other two assumptions. One approach (the t-test vs. the z-test) is to develop a test that holds up well (is robust) when assumption 3 is violated. The independence assumption is problematic in that few statistical tests can violate it with impunity. The worst violators of this assumption are ed- ucational researchers. Consider a standardized math test given to a districtís third graders. Suppose there 8 classes of 25 kids with 8 di§erent teachers. The independence assumption means that the 200 studentsíscores are all assumed to have no bearing upon one another. This is ridiculous. If one of the classes has 10 students whose behavior is terrible, the class will have less learning time than the other 7. This means that all 25 scores in this class will be lower than they "should be," or that they depend upon one another. In another class, they may have the "teacher of the year," a person so e§ective and so inspir- ing that everyone scores much higher than they should. People who call for medicineís version of "research informed practice" to be transferred into edu- cational settings forget this basic tenet of statistics-based research. There are ways around the dependence problem in educational research, but the routes are di¢ cult, time-consuming, expensive and tedious. And these limitations mean that many key questions about educational e¢ cacy are impossible to answer deÖnitively. The z-test is poor because it is a brittle test which means that even minor violations of the above assumptions causes severe deteriorations in its accuracy. Worse, even for rather large data sets, the Law of Large Numbers and the Central Limit Theorem cannot be invoked. In most cases, we cannot assume normality until n  1000. The t-test has the exact same assumptions and applications as does the z-test, but it is a robust statistical test. Even rather severe violations of as- sumptions 1 and 3 do not have much adverse a§ect upon its accuracy. This is why I despise introductory statistics books and classes which overemphasize the z-test. It is almost never used in real-world research studies, and it would seem it is introduced mostly because it is the easiest of statistical computations. Statistics instructors who value computational áuency use the z-test to intro- duce students to the painful process of "by-hand" statistics. My question is who cares if students can perform "by-hand" statistics? No researchers actually compute their own statistics anyway, and few modern applications of statistics are simple enough to be calculated by hand without several weeks of tedious work. A reliance upon by-hand statistics by instructors forces students to en- counter only tiny data sets, a limited number of examples and a limited variety of tests. I take a slightly di§erent approach. A modern researcher uses computers and statistical software for all computations, and di§erent skills are valued. Instead of computational áuency, I hope my students will develop an understanding of when each di§erent type of test should be used and what the reams of output from a software package like SPSS might mean in terms of the real world.