






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An excerpt from a university course on probability & statistics i, taught by robb sinn at ngcsu in fall 2008. The concepts of mean, standard deviation, and data analysis, including euclidean distance, dot product, and the empirical rule. It also discusses the importance of understanding central tendency, dispersion, and position in numerical summaries of data.
Typology: Exams
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Math 3350 / 6350 NGCSU, Fall 2008
Introduction to Mean and Standard Deviation
A data set is a set of data points drawn from a domain set, called the sample space or population. Given a data set, we can describe it with numerics or graphis (or both). Variables can either be quantitative (numeric) or qualitative (categorical). Some variables can be represented as either a numeric variable (my GPA is 3.25) or categorical (Iím a B+^ student). Most statistical tests are used with numeric data sets. For numeric data sets, we have a sample of n data points: fx 1 ; x 2 ; : : : xng 2 X. The xiís are assumed to be values of a simple random variable drawn from a domain set X.
DeÖnition 1 Mean, x
I The arithmetic average of the n data points: x = (^1) n
xi
If we claim to know the population mean (and this is no small claim), we denote it . When we have only one sample, n denotes the sample size. If we have k > 1 samples, we typcially indicate the sample size of group 1 as n 1 , group 2 as n 2 , and so forth. The overall sample size is usually denoted as N = n 1 + n 2 + : : : + nk. In the multigroup context, x 1 is the mean for group 1, etc., and x is the grand mean. A quick formula for the grand mean given that we know the group means is
x = n^1 x n^1 + 1 +n^2 nx 22 ++:::++nnkk xk= n^1 x^1 +n^2 x N^2 ++nk^ xk
A similar convention exists for the standard deviation where, given k groups, sm is the standard deviation of group m (where 1 m k). We hardly ever calculate a "grand" or overall standard deviation.
DeÖnition 2 Population Variance, ^2
I Given n data points, assuming we know : ^2 =
(xi )^2 n
We rarely use this formula since, in practice, we rarely know . Most often, we estimate with x.
DeÖnition 3 Sample Variance, s^2
I Given n data points: s^2 =
(xi x)^2 n 1
Note that, if we view the data set X = fx 1 ; x 2 ; : : : xng as a uniform proba- bility function, that is, with P (xi) = (^1) n for all xi, then
x = E(X) and ^2 = V (X)
An advantage of standard deviation compared to variance is that the stan- dard deviation has the same units as the data set. For example, if the data are annual incomes, the standard deviation has units of dollars earned per year whereas the variance would have units of (dollars earned per year)^2. The stan- dard deviation can be interpretted as the Euclidean (or standard) distance met- ric for any set of data. Variance, despite the drawbacks, is a fundamental probability concept that underpins statistical theory. In statistics, we typically use a machine to determine mean and standard deviation, never computing or even noting the variance.
Vector Review
Consider the idea of distance in Euclidean space. For n-dimensions, each point in the space is an n-dimensional vector.
Example 1
Let two n-dimensional vectors be given by !x =
x 1 x 2 .. . xn
and !y =
y 1 y 2 .. . yn
respectively. The two vectors are said to be points in n-space, or notationally: !x ; !y 2 Rn. In physics we often work in 3 -space or R^3 , where the 3 dimensions are left-right, forward-backward and up-down.
DeÖnition 4 Euclidean Distance between Two Vectors !x ; !y , D( !x ; !y )
I !x !y = x 1 y 1 + x 2 y 2 + + xnyn
Comparing this deÖnition with the magnitude formula above, we can see that k !x k^2 = !x !x. Recall the dot product of two vectors is zero if and only if the vectors are perpendicular. By the way, mathematicians always like "metric spaces," mathematical settings where a distance metric exists. Often in abstract settings, an inner product can be established and then turned into a distance metric. The approach shown below for developing the standard devia- tion as a distance measurement in a data set is an example of the typical way mathematicians think about distance. They are much more concerned with relative distances than absolute ones.
Standard Deviation as a Distance Metric
The idea of distance in n-dimensional vector space demonstrates why stan- dard deviation was created. Assume a sample is drawn from a normal distrib- ution. The following example will illustrate the key ideas.
Example 3 In a certain data set, the data points are f 1 ; 2 ; 3 ; 6 g, so n = 4 and x = 3. For any data point, say, x = 1, we can compute the directional distance (or deviation) from the mean:
di = xi x = 1 3 = 2
This is the absolute (directed) distance between xi and x. But it says nothing about "relative" distances. It tells us nothing about whether "2 units below the mean" is a large distance (like walking 2 miles) or a small one (walking 2 yards). Nor does it tell us how near the left tail of the distribution this data point might be. Statisticians seeking a more general idea of distance decided to use the typical Euclidean distance metric for n-space to determine the distance in a data set of size n. The Örst attempt used the idea of a deviation vector. For the data in Example 3, we would create the 4 -dimensional deviation vector by subtracting the mean from each data point in turn.
Example 3 contiued: deÖning a deviation vector.
d =
x 1 x x 2 x .. . xn x
A deviation di of zero means that the data point xi = x, i.e. that is does not deviate at all from the mean. Next we calculate the magnitude of the deviation vector.
d =
p ( 2)^2 + ( 1)^2 + 0^2 + 3^2 =
p 14 3 : 741 7.
This naive idea would su¢ ce if all data sets were the exact same size. But consider the following two data sets:
Example 4 Data Set 1: f 1 ; 2 ; 3 ; 6 g Data Set 2: f 1 ; 2 ; 3 ; 6 ; 1 ; 2 ; 3 ; 6 g Weíve already computed the deviation and its magnitude for Data Set 1, ! d 1 =
p
d 2 =
p
Data Set 2 uses only data points from Data Set 1 and in many ways is not very di§erent from it. The magnitutde of the deviation in Data Set 2 is larger, not because the data have spread out more, but simply because there is more data. To better "standardize" distance, statisticians tried dividing by n (or n 1 ) to get rid of the ináuence of simply adding more data when the "spread" of the data is unchanged.
Example 4 continued For Data Set 1: s 1 =