请输入您要查询的百科知识:

 

词条 Sample mean and covariance
释义

  1. Sample mean

  2. Sample covariance

  3. Unbiasedness

  4. Variance of the sampling distribution of the sample mean

  5. Weighted samples

  6. Criticism

  7. See also

  8. References

{{multiple issues|{{technical|date=June 2014}}{{Refimprove|date=February 2008}}
}}

The sample mean or empirical mean and the sample covariance are statistics computed from a collection (the sample) of data on one or more random variables.

The sample mean and sample covariance are estimators of the population mean and population covariance, where the term population refers to the set from which the sample was taken.

The sample mean is a vector each of whose elements is the sample mean of one of the random variables{{spaced ndash}}that is, each of whose elements is the arithmetic average of the observed values of one of the variables. The sample covariance matrix is a square matrix whose i, j element is the sample covariance (an estimate of the population covariance) between the sets of observed values of two of the variables and whose i, i element is the sample variance of the observed values of one of the variables. If only one variable has had values observed, then the sample mean is a single number (the arithmetic average of the observed values of that variable) and the sample covariance matrix is also simply a single value (a 1x1 matrix containing a single number, the sample variance of the observed values of that variable).

Due to their ease of calculation and other desirable characteristics, the sample mean and sample covariance are widely used in statistics and applications to numerically represent the location and dispersion, respectively, of a distribution.

Sample mean

{{main|Arithmetic mean}}

Let be the ith independently drawn observation (i=1,...,N) on the jth random variable (j=1,...,K). These observations can be arranged into N

column vectors, each with K entries, with the K ×1 column vector giving the ith observations of all variables being denoted (i=1,...,N).

The sample mean vector is a column vector whose jth element is the average value of the N observations of the jth variable:

Thus, the sample mean vector contains the average of the observations for each variable, and is written

Sample covariance

{{See also|Sample variance}}

The sample covariance matrix is a K-by-K matrix with entries

where is an estimate of the covariance between the {{math|j}}th

variable and the {{math|k}}th variable of the population underlying the data.

In terms of the observation vectors, the sample covariance is

Alternatively, arranging the observation vectors as the columns of a matrix, so that

,

which is a matrix of K rows and N columns.

Here, the sample covariance matrix can be computed as

,

where is an N by {{math|1}} vector of ones.

If the observations are arranged as rows instead of columns, so is now a 1×K row vector and is an N×K matrix whose column j is the vector of N observations on variable j, then applying transposes

in the appropriate places yields

Like covariance matrices for random vector, sample covariance matrices are positive semi-definite. To prove it, note that for any matrix the matrix is positive semi-definite. Furthermore, a covariance matrix is positive definite if and only if the rank of the vectors is K.

Unbiasedness

The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random vector , a row vector whose jth element (j = 1, ..., K) is one of the random variables.[1] The sample covariance matrix has in the denominator rather than due to a variant of Bessel's correction: In short, the sample covariance relies on the difference between each observation and the sample mean, but the sample mean is slightly correlated with each observation since it is defined in terms of all observations. If the population mean is known, the analogous unbiased estimate

using the population mean, has in the denominator. This is an example of why in probability and statistics it is essential to distinguish between random variables (upper case letters) and realizations of the random variables (lower case letters).

The maximum likelihood estimate of the covariance

for the Gaussian distribution case has N in the denominator as well. The ratio of 1/N to 1/(N − 1) approaches 1 for large N, so the maximum likelihood estimate approximately equals the unbiased estimate when the sample is large.

Variance of the sampling distribution of the sample mean

{{main|Standard error of the mean}}

For each random variable, the sample mean is a good estimator of the population mean, where a "good" estimator is defined as being efficient and unbiased. Of course the estimator will likely not be the true value of the population mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean. Thus the sample mean is a random variable, not a constant, and consequently has its own distribution. For a random sample of N observations on the jth random variable, the sample mean's distribution itself has mean equal to the population mean and variance equal to where is the population variance.

Weighted samples

{{main article|Weighted mean}}

In a weighted sample, each vector (each set of single observations on each of the K random variables) is assigned a weight . Without loss of generality, assume that the weights are normalized:

(If they are not, divide the weights by their sum).

Then the weighted mean vector is given by

and the elements of the weighted covariance matrix are

[2]

If all weights are the same, , the weighted mean and covariance reduce to the sample mean and covariance mentioned above.

Criticism

The sample mean and sample covariance are not robust statistics, meaning that they are sensitive to outliers. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably quantile-based statistics such as the sample median for location,[3] and interquartile range (IQR) for dispersion. Other alternatives include trimming and Winsorising, as in the trimmed mean and the Winsorized mean.

See also

  • Estimation of covariance matrices
  • Scatter matrix
  • Unbiased estimation of standard deviation

References

1. ^{{cite book|author1=Richard Arnold Johnson|author2=Dean W. Wichern|title=Applied Multivariate Statistical Analysis|url=https://books.google.com/books?id=gFWcQgAACAAJ|accessdate=10 August 2012|year=2007|publisher=Pearson Prentice Hall|isbn=978-0-13-187715-3}}
2. ^Mark Galassi, Jim Davies, James Theiler, Brian Gough, Gerard Jungman, Michael Booth, and Fabrice Rossi. [https://www.gnu.org/software/gsl/manual GNU Scientific Library - Reference manual, Version 1.15], 2011. [https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html Sec. 21.7 Weighted Samples]
3. ^The World Question Center 2006: The Sample Mean, Bart Kosko

5 : Covariance and correlation|Estimation methods|Summary statistics|Matrices|U-statistics

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/17 15:09:00