2-3. Measures of Variability (Statistical Dispersion)

离散程度的统计量

다음의 두 데이터 세트가 있다. 각각의 데이터 세트를 점으로 표현한 것이 dot plot이다.

Data Set 1

Data Set 2

library(ggplot2)
x1 <- c(40, 38, 42, 40, 39, 39, 43, 40, 39, 40)
x2 <- c(46, 37, 40, 33, 42, 36, 40, 47, 34, 45)
x <- data.frame(x1, x2)

ggplot(x, aes(x = x1)) + geom_dotplot()
ggplot(x, aes(x = x2)) + geom_dotplot()

참고사이트 : https://ggplot2.tidyverse.org/reference/geom_dotplot.html

1. The Range

The range of a data set is the number R defined by the formula $R=x_{max}−x_{min}$
where $x_{max}$ is the largest measurement in the data set and $x_{min}$ is the smallest.

EXAMPLE 23. 앞의 2개의 데이터 세트의 range를 구하라.

[Solution]

x1 <- c(40, 38, 42, 40, 39, 39, 43, 40, 39, 40)
x2 <- c(46, 37, 40, 33, 42, 36, 40, 47, 34, 45)

# range of x1
# 1)
range_x1 <- max(x1) - min(x1); range_x1

# 2)
range(x1)   # R Function : range()
diff(range(x1))

# range of x2
# 1)
range_x2 <- max(x2) - min(x2); range_x2

# 2)
range(x2)
diff(range(x2))

> # range of x1
> # 1)
> range_x1 <- max(x1) - min(x1); range_x1
## [1] 5
> 
> # 2)
> range(x1)   # R Function : range()
## [1] 38 43
> diff(range(x1))
## [1] 5
> 
> # range of x2
> # 1)
> range_x2 <- max(x2) - min(x2); range_x2
## [1] 14
>
> # 2)
> range(x2)
## [1] 33 47
> diff(range(x2))
## [1] 14

Note : range( )function of R returns the minimum value and the maxim value of the data set.

2. The Variance and The Standard Deviation

EXAMPLE 24. 위의 예에서 Data Set 2의 sample variance와 sample standard deviation을 구하라.

[Solution]

x <- c(46, 37, 40, 33, 42, 36, 40, 47, 34, 45)

# 1. Variance
n <- length(x); n
y <- (x - mean(x)); y
var_x <- sum(y^2)/(n-1); var_x

# 2. R Function for Variance : var()
var(x)

# 3. R Function for Standard Deviation : sd()
sd(x)

> # 1. Variance
> n <- length(x); n
## [1] 10
> y <- (x - mean(x)); y
##  [1]  6 -3  0 -7  2 -4  0  7 -6  5
> var_x <- sum(y^2)/(n-1); var_x
## [1] 24.88889
> 
> # 2. R Function for Variance : var()
> var(x)
## [1] 24.88889

> # 3. R Function for Standard Deviation : sd()
> sd(x)
## [1] 4.988877

Note : In R, var() returns the sample variance, i.e. the denominator used in var() function is (n-1).

The sample variance of a set of $n$ sample data is the number $s^2$ defined by the formula
$s^2= \frac{Σ(x−\bar{x})^2} {n−1}$
which by algebra is equivalent to the formula
$s^2= \frac{Σx^2−\frac{1}{n}(Σx)^2} {n−1}$
The sample standard deviation of a set of $n$ sample data is the square root of the sample variance, hence is the number $s$ given by the formulas
$s= \sqrt { \frac{Σ(x−\bar{x})^2} {n−1} } = \sqrt{ \frac{Σx^2−\frac{1}{n}(Σx)^2} {n−1} }$

EXAMPLE 25. 무작위로 선발한 10명의 학생의 평균 평점은 다음과 같다. sample variance와 sample standard deviation을 구하라.

1.90 3.00 2.53 3.71 2.12 1.76 2.71 1.39 4.00 3.33

[풀이]

x <- c(1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33)
var(x)
sd(x)

> var(x)
## [1] 0.7524278
> sd(x)
## [1] 0.8674259

The population variance and population standard deviation of a set of $N$ population data are the numbers $σ^2$ and $σ$ defined by the formulas
$σ^2=\frac{Σ(x−μ)^2}{N}$ and $σ= \sqrt { \frac{Σ(x−μ)^2}{N} }$

[ Difference between Two Data Sets ]

See : Degree of Dispersion

3. Coefficient of Variation

In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation $\sigma$ to the mean $\mu$ (or its absolute value, $|\mu|$ ).

평균에 대한 상대적인 변동성의 크기를 설명할 때에 변동계수(Coefficient of Variation)을 사용한다.
$CV = \frac {s} {\bar{x}}$
평균에 대한 표준편차의 비율로 표현된다.
변동계수가 클수록, 즉 표준편차가 표본평균에 비해 클수록 자료의 퍼짐진 정도가 더 크다고 할 수 있다.

EXAMPLE 26. Example 25.의 CV를 구하라.

[ Solution ]

# install.packages("goeveg")
library(goeveg)

x <- c(1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33)

# 1. Calculation of CV
cv_x <- sd(x) / mean(x); cv_x

# 2. R Function : cv() in 'goeveg' package
cv(x)

> # 1. Calculation of CV
> cv_x <- sd(x) / mean(x); cv_x
## [1] 0.3279493
> 
> # 2. R Function : cv() in 'goeveg' package
> cv(x)
## [1] 0.3279493

离散程度的统计量
1) 全距(range)
2) 标准差(standard deviation)
3) 方差(variance)
4) 变异系数(coefficient of variation)

Previous2-2. Exercises Next2-3. Exercises

Last updated 5 years ago

Was this helpful?