Thesample meanof a set ofnsample data is the numberxˉdefined by the formula
xˉ=nΣx
EXAMPLE 11. 다음 표본 데이터의 평균을 구하라.
2 -1 0 2
[Solution]
x <- c( 2, -1, 0, 2)
# method 1
n <- length(x) # number of elements of x
mean_x <- sum(x * 1/n) # or mean_x <- sum(x) / n
# method 2 : using function mean()
mean(x) # function : mean()
> mean_x <- sum(x * 1/n); mean_x # or mean_x <- sum(x) / n
## [1] 0.75
>
> mean(x) # function : mean()
## [1] 0.75
EXAMPLE 12. 무작위로 선발한 10명의 학생의 평균 평점은 다음과 같다. 표본의 평균을 구하라.
위의 계산으로 볼 때 192.4 데이터는 이상치(outlier)로 판단이 된다. 따라서 이 데이터의 중심값으로 평균값이 아닌 중앙값(median)을 구하라.
[Solution 2] median without outlier, 이상치 제거 후의 중앙값
x <- c(24.8, 22.8, 24.6, 25.2, 18.5, 23.7)
# 1. sort the data in ascending order
y <- sort(x) ; y
# 2. Numbers of data
n <- length(y)
# 3. Compute the median
if ( n %% 2 == 0 ) { # the modulo (%% operator)
( y[(n %/% 2)] + y[(n %/% 2) + 1])/2 # the numerical division (%/% operator)
} else {
y[(n %/% 2 + 1)]
}
# R function for median : median()
median(x)
> # 3. Compute the median
> if ( n %% 2 == 0 ) { # the modulo (%% operator)
+ ( y[(n %/% 2)] + y[(n %/% 2) + 1])/2 # the numerical division (%/% operator)
+ } else {
+ y[(n %/% 2)]
+ }
## [1] 24.15
>
> # R function for median : median()
> median(x)
## [1] 24.15
표본 데이터의 중앙값(median)은 다음과 같이 구한다.
표본 데이터를 오름차순으로 정렬한다. : y <- sort(x)
데이터의 갯수( n )가 홀수이면, xmedian=y2n ( 2n 번째 데이터)
데이터의 갯수가 짝수이면, xmedian=2(y2n+y(2n+1)) (즉, 2n째 데이터와 (2n+1)째 데이터의 평균이 표본 데이터의 중앙값이다.
The sample median x~of a set of sample data for which there are an odd number of measurements is the middle measurement (2n)when the data are arranged in numerical order. The sample median x~of a set of sample data for which there are an even number of measurements is the mean of the two middle measurements (mean of two numbers, 2nth number and 2n+1th number) when the data are arranged in numerical order.
EXAMPLE 17. 다음 데이터 세트의 중앙값을 구하라.
132 162 133 145 148 139 147 160 150 153
[ Solution ]
x <- c(132, 162, 133, 145, 148, 139, 147, 160, 150, 153)
# 1. sort the data in ascending order
y <- sort(x) ; y
# 2. Numbers of data
n <- length(y)
# 3. Compute the median
if ( n%%2 == 0 ) { (y[(n%/%2)] + y[(n%/%2)+1])/2 }
else { y[(n%/%2) + 1] }
# R function for median : median()
median(x)
The relationship between the mean and the median for several common shapes of distributions is shown in Figure "Skewness of Relative Frequency Histograms". The distributions in panels (a) and (b) are said to be symmetric because of the symmetry that they exhibit. The distributions in the remaining two panels are said to be skewed. In each distribution we have drawn a vertical line that divides the area under the curve in half, which in accordance with Figure "The Median" is located at the median. The following facts are true in general:
When the distribution is symmetric, as in panels (a) and (b) of Figure "Skewness of Relative Frequency Histograms", the mean and the median are equal.
When the distribution is as shown in panel (c) of Figure "Skewness of Relative Frequency Histograms", it is said to be skewed right. The mean has been pulled to the right of the median by the long “right tail” of the distribution, the few relatively large data values.
When the distribution is as shown in panel (d) of Figure "Skewness of Relative Frequency Histograms", it is said to be skewed left. The mean has been pulled to the left of the median by the long “left tail” of the distribution, the few relatively small data values.
2.1 Skewness of Relative Frequency Histogram
3. Mode
The sample modeof a set of sample data is the most frequently occurring value.
EXAMPLE 21. 다음 데이터 세트의 mode(최빈값)를 구하라.
-1 0 2 0
[Solution]
x <- c(-1, 0, 2, 0)
y <- table(x)
names(which.max(y))
R does not have a standard in-built function to calculate mode. So we create a user function to calculate mode of a data set in R. This function takes the vector as input and gives the mode value as output.
# Create the user function : getmode().
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Create the vector with numbers.
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
# Calculate the mode using the user function.
result <- getmode(v)
print(result)
# Create the vector with characters.
charv <- c("o","it","the","it","it")
# Calculate the mode using the user function.
result <- getmode(charv)
print(result)
> # Create the vector with numbers.
> v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
>
> # Calculate the mode using the user function.
> result <- getmode(v)
> print(result)
## [1] 2
>
> # Create the vector with characters.
> charv <- c("o","it","the","it","it")
>
> # Calculate the mode using the user function.
> result <- getmode(charv)
> print(result)
## [1] "it"
최빈값은 지지하는 정당이나 좋아하는 숫자 등 수가 없거나 수가 있더라도 대소관계가 의미 없는 질적 자료에서 많이 쓰인다.
중앙값은 소득이나 성적처럼 우열을 가릴 수 있는 등 순위가 중요한 중요한 자료에서 많이 쓰인다.
평균은 알다시피 가장 즐겨쓰는 대표값이지만 이상치에 민감하다는 단점이 있어 의외로 주의를 요한다. 평균이 데이터를 잘 설명하지 못하는 경우가 비단 소표본에서만 일어나는 것이 아니다. 실제로 각종 경제 지표, 특히 국가 규모의 데이터를 다룰때는 소득 상위 10%, 저소득층과 같이 분위수를 쓰는 경우가 많다. 소득 불균형이 커질수록 평균은 의미를 잃어가며, 중앙값과 평균을 구분할 수 있는 분별력이 필요해진다.
EXAMPLE 22. MASS 라이브러리의 Cars93 데이터 세트를 이용하여 차종별 가격(Price by Type)의 분포를 히스토그램으로 그리고, 차종별 가격의 skewness(왜도, 좌우 대칭 정도)와 kurtosis(첨도, 정규분포 대비 봉우리 높이 정도)를 구하라.
library(MASS)
str(Cars93)
# 1. Histogram, Price by Car Type
library(ggplot2)
ggplot(Cars93, aes(x=Price)) +
geom_histogram(binwidth=3, fill = "blue", colour = "black") +
ggtitle("Histogram, Price by Type") +
facet_grid(Type ~ .)
library(fBasics)
# 2. skewness : skewness()
skewness(Cars93$Price)
with(Cars93, tapply(Price, Type, skewness))
# 3. kurtosis : kurtosis()
kurtosis(Cars93$Price)
with(Cars93, tapply(Price, Type, kurtosis))