2-3. Measures of Variability (Statistical Dispersion)

离散程度的统计量

다음의 두 데이터 세트가 있다. 각각의 데이터 세트를 점으로 표현한 것이 dot plot이다.

Data Set 1

40

38

42

40

39

39

43

40

39

40

Data Set 2

46

37

40

33

42

36

40

47

34

45

library(ggplot2)
x1 <- c(40, 38, 42, 40, 39, 39, 43, 40, 39, 40)
x2 <- c(46, 37, 40, 33, 42, 36, 40, 47, 34, 45)
x <- data.frame(x1, x2)

ggplot(x, aes(x = x1)) + geom_dotplot()
ggplot(x, aes(x = x2)) + geom_dotplot()

참고사이트 : https://ggplot2.tidyverse.org/reference/geom_dotplot.html

1. The Range

The range of a data set is the number R defined by the formula R=xmaxxminR=x_{max}−x_{min}

where xmaxx_{max} is the largest measurement in the data set and xminx_{min} is the smallest.

EXAMPLE 23. 앞의 2개의 데이터 세트의 range를 구하라.

[Solution]

x1 <- c(40, 38, 42, 40, 39, 39, 43, 40, 39, 40)
x2 <- c(46, 37, 40, 33, 42, 36, 40, 47, 34, 45)

# range of x1
# 1)
range_x1 <- max(x1) - min(x1); range_x1

# 2)
range(x1)   # R Function : range()
diff(range(x1))

# range of x2
# 1)
range_x2 <- max(x2) - min(x2); range_x2

# 2)
range(x2)
diff(range(x2))

Note : range( )function of R returns the minimum value and the maxim value of the data set.

2. The Variance and The Standard Deviation

EXAMPLE 24. 위의 예에서 Data Set 2의 sample variance와 sample standard deviation을 구하라.

[Solution]

x <- c(46, 37, 40, 33, 42, 36, 40, 47, 34, 45)

# 1. Variance
n <- length(x); n
y <- (x - mean(x)); y
var_x <- sum(y^2)/(n-1); var_x

# 2. R Function for Variance : var()
var(x)

# 3. R Function for Standard Deviation : sd()
sd(x)

Note : In R, var() returns the sample variance, i.e. the denominator used in var() function is (n-1).

The sample variance of a set of nn sample data is the number s2s^2 defined by the formula

s2=Σ(xxˉ)2n1s^2= \frac{Σ(x−\bar{x})^2} {n−1}

which by algebra is equivalent to the formula

s2=Σx21n(Σx)2n1s^2= \frac{Σx^2−\frac{1}{n}(Σx)^2} {n−1}

The sample standard deviation of a set of nn sample data is the square root of the sample variance, hence is the number ss given by the formulas

s=Σ(xxˉ)2n1=Σx21n(Σx)2n1s= \sqrt { \frac{Σ(x−\bar{x})^2} {n−1} } = \sqrt{ \frac{Σx^2−\frac{1}{n}(Σx)^2} {n−1} }

EXAMPLE 25. 무작위로 선발한 10명의 학생의 평균 평점은 다음과 같다. sample variance와 sample standard deviation을 구하라.

1.90 3.00 2.53 3.71 2.12 1.76 2.71 1.39 4.00 3.33

[풀이]

x <- c(1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33)
var(x)
sd(x)

The population variance and population standard deviation of a set of NN population data are the numbers σ2σ^2 and σσ defined by the formulas

σ2=Σ(xμ)2Nσ^2=\frac{Σ(x−μ)^2}{N}  and  σ=Σ(xμ)2Nσ= \sqrt { \frac{Σ(x−μ)^2}{N} }

[ Difference between Two Data Sets ]

3. Coefficient of Variation

In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation σ\sigma to the mean μ\mu (or its absolute value, μ|\mu| ).

평균에 대한 상대적인 변동성의 크기를 설명할 때에 변동계수(Coefficient of Variation)을 사용한다.

CV=sxˉCV = \frac {s} {\bar{x}}

평균에 대한 표준편차의 비율로 표현된다.

변동계수가 클수록, 즉 표준편차가 표본평균에 비해 클수록 자료의 퍼짐진 정도가 더 크다고 할 수 있다.

EXAMPLE 26. Example 25.의 CV를 구하라.

[ Solution ]

# install.packages("goeveg")
library(goeveg)

x <- c(1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33)

# 1. Calculation of CV
cv_x <- sd(x) / mean(x); cv_x

# 2. R Function : cv() in 'goeveg' package
cv(x)

  1. 离散程度的统计量

    1) 全距(range)

    2) 标准差(standard deviation)

    3) 方差(variance)

    4) 变异系数(coefficient of variation)

Last updated