多変量解析手法の簡易メモなど

Rによるデータサイエンスって本のデータ解析メモですね.
この本はわりかし一般的な手法の網羅的な解説 + 参考文献豊富で良い感じ.

主成分分析(Principal Component Analysis)

目的
- 多変量データを少ない変数で表現できるようにする
- 通常は2~3変数に縮約する場合が多い。
- 分散を最大化する手法

多変量解析としては最も有名な手法の一つ

分散共分散行列の固有値問題とみなす
- 最も固有値が大きいのが第一主成分、次に固有値が大きいのが第ニ主成分... と求める
- 主成分の解析には寄与率、累積寄与率に関する情報を利用.

主成分得点、予測、biplot...

関係の強い手法として、カーネル主成分分析(kpca, 非線形主成分分析)、独立成分分析などもある。

R
- princompやprcompなどが利用可能
- princompを普通利用する
- prcompはデータのスケールなど利用可能

サンプル

require(graphics)

## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests))  # inappropriate
princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)
## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)

summary(pc.cr <- princomp(USArrests, cor = TRUE))
loadings(pc.cr)  ## note that blank entries are small but not zero
plot(pc.cr) # shows a screeplot.
biplot(pc.cr)

## Formula interface
princomp(~ ., data = USArrests, cor = TRUE)
# NA-handling
USArrests[1, 2] <- NA
pc.cr <- princomp(~ Murder + Assault + UrbanPop,
                  data = USArrests, na.action=na.exclude, cor = TRUE)
pc.cr$scores

因子分析

目的
- 主に心理学、社会学などで外的基準がない量的データから共通因子を見つけ出す探索的データ解析時に利用
- 変数間の相関関係から共通因子を求める

歴史
- 1904 Spearmanによって提唱された

利用の際の注意点
- 多義的解釈が可能なので、自分に都合の良い解釈が可能なので注意が必要
- 客観的意味付けをできるように使う事が重要
- 多義性が少ない主成分分析、対応分析などを兼用すると吉

アルゴリズム
- 主因子法、最尤法などなど
  - 主因子法: 安定した結果が得られる
  - 最尤法　: データが正規分布に従うときに利用すると吉。但し初心者向きではない

因子の回転
- 解釈の便宜のため,高い相関をもつ項目を共通因子として空間上の軸を決める操作を行う。これが因子の回転である。
- 種類
  - 直行回転、斜交回転
  - 直行回転: バリマックス(よく使われる)、バイコーティマックス、コーティマックス、エクィマックス
  - 斜交回転: プロマックス(よく使われる)、コバリミン、バイコーティミン、コーティミン

R
- パッケージ: stats
- 関数: factanal

サンプル

# A little demonstration, v2 is just v1 with noise,
# and same for v4 vs. v3 and v6 vs. v5
# Last four cases are there to add noise
# and introduce a positive manifold (g factor)
v1 <- c(1,1,1,1,1,1,1,1,1,1,3,3,3,3,3,4,5,6)
v2 <- c(1,2,1,1,1,1,2,1,2,1,3,4,3,3,3,4,6,5)
v3 <- c(3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,5,4,6)
v4 <- c(3,3,4,3,3,1,1,2,1,1,1,1,2,1,1,5,6,4)
v5 <- c(1,1,1,1,1,3,3,3,3,3,1,1,1,1,1,6,4,5)
v6 <- c(1,1,1,2,1,3,3,3,4,3,1,1,1,2,1,6,5,4)
m1 <- cbind(v1,v2,v3,v4,v5,v6)
cor(m1)
factanal(m1, factors=3) # varimax is the default
factanal(m1, factors=3, rotation="promax")
# The following shows the g factor as PC1
prcomp(m1)

## formula interface
factanal(~v1+v2+v3+v4+v5+v6, factors = 3,
         scores = "Bartlett")$scores

## a realistic example from Bartholomew (1987, pp. 61-65)
utils::example(ability.cov)

対応分析

対応分析とは
- 頻度データ、質的データの個体と変数との関連性、パターン分析を行う手法
- コレスポンデンス分析とも呼ばれる
- 数量化III類と似ている

主にフランスでよく使われているとのこと。フランスフランス

R
- パッケージ: MASS
- 関数: mca

サンプル

farms.mca <- mca(farms, abbrev=TRUE)
farms.mca
plot(farms.mca)

多次元尺度法(MDS: Multi-Dimensional Scaling)

目標
- データの個体間の類似度や距離を求めてそれを2~3次元にプロットしてデータの構造やパターン形成などを把握する手法

分類
- 計量、非計量の二種類

計量的MDS
- 解析の流れ
  - 距離を求める
  - 座標値を求める
  - 2~3次元上で個体を配置する(散布図作成)
  - 信頼性などの考察

Rでの求め方
- cmdscaleを利用

サンプル(ヨーロッパの都市の距離データ)

require(graphics)

loc <- cmdscale(eurodist)
x <- loc[,1]
y <- -loc[,2]
plot(x, y, type="n", xlab="", ylab="", main="cmdscale(eurodist)")
text(x, y, rownames(loc), cex=0.8)

cmdsE <- cmdscale(eurodist, k=20, add = TRUE, eig = TRUE, x.ret = TRUE)
utils::str(cmdsE)

非計量的MDS
- 計量的MDSは直接距離などを求められる事を前提てしているが、心理データなどの親近性データは距離の性質を満たさない
- 距離の性質を満たさない類似性データも利用可能にしたのが非計量的MDS

キーワード
- カルスカスのストレス
  - ストレスを最小にする手法

R
- パッケージ: MASS, mlbench, e1071, vegenなど
- 関数: isoMDS, sammon, metaMDS

isoMDS

swiss.x <- as.matrix(swiss[, -1])
swiss.dist <- dist(swiss.x)
swiss.mds <- isoMDS(swiss.dist)
plot(swiss.mds$points, type = "n")
text(swiss.mds$points, labels = as.character(1:nrow(swiss.x)))
swiss.sh <- Shepard(swiss.dist, swiss.mds$points)
plot(swiss.sh, pch = ".")
lines(swiss.sh$x, swiss.sh$yf, type = "S")

クラスタ分析

大きく分けて大きく三種類ある
- 階層、非階層(グループ数指定)、モデルに基づく手法

階層的クラスタリング

注意事項
- 個体数が大きいと計算量が膨大になるので大規模データには不向き
- 小規模データでも適用は計画的に

解析の流れ
- データから距離(or 類似度)を求める
- クラスタ分析手法の適用
- コーフェン行列を求める
- コーフェン行列から樹形図を作る
- 結果についての検討を行う

手法
- 最近隣法, 最遠隣法, 群平均法, メディアン法, 重心法, ウォード法

R
- 関数: hclust

サンプル

require(graphics)

hc <- hclust(dist(USArrests), "ave")
plot(hc)
plot(hc, hang = -1)

## Do the same with centroid clustering and squared Euclidean distance,
## cut the tree into ten clusters and reconstruct the upper part of the
## tree from the cluster centers.
hc <- hclust(dist(USArrests)^2, "cen")
memb <- cutree(hc, k = 10)
cent <- NULL
for(k in 1:10){
  cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE]))
}
hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
opar <- par(mfrow = c(1, 2))
plot(hc,  labels = FALSE, hang = -1, main = "Original Tree")
plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters")
par(opar)

非階層的クラスタリング

解析の流れ
- 1.K個のクラスタ中心を適当に決める
- 2.全てのデータのK個のクラスタ中心との距離を求め最も近いクラスタの分類
- 3.形成されたクラスタの中心を求める
- 2,3の繰り返し

R
- 関数: kmeans

サンプル

require(graphics)

# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex=2)

## random starts do help here with too many clusters
(cl <- kmeans(x, 5, nstart = 25))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)

モデルに基づく手法

model-based clustering
- 混合分布によるクラスタリング、潜在クラスタリングとも呼ばれる
- 確率分布の利用
- 最大尤度推測法を用いたEMアルゴリズムなど

R
- パッケージ: mclust
- 関数: EMclust, mclustBICなど

サンプル

irisBIC <- mclustBIC(iris[,-5])
irisBIC
plot(irisBIC)

subset <- sample(1:nrow(iris), 100)
irisBIC <- mclustBIC(iris[,-5], initialization=list(subset =subset))
irisBIC
plot(irisBIC)

irisBIC1 <- mclustBIC(iris[,-5], G=seq(from=1,to=9,by=2), 
                    modelNames=c("EII", "EEI", "EEE"))
irisBIC1
plot(irisBIC1)
irisBIC2  <- mclustBIC(iris[,-5], G=seq(from=2,to=8,by=2), 
                       modelNames=c("VII", "VVI", "VVV"), x= irisBIC1)
irisBIC2
plot(irisBIC2)

nNoise <- 450
set.seed(0)
poissonNoise <- apply(apply( iris[,-5], 2, range), 2, function(x, n) 
                      runif(n, min = x[1]-.1, max = x[2]+.1), n = nNoise)
set.seed(0)
noiseInit <- sample(c(TRUE,FALSE),size=nrow(iris)+nNoise,replace=TRUE,
                    prob=c(3,1))
irisNdata <- rbind(iris[,-5], poissonNoise)
irisNbic <- mclustBIC(data = irisNdata,
                      initialization = list(noise = noiseInit))
irisNbic
plot(irisNbic)

自己組織化マップ(SOM)

概要
- 教師データをもたないニューラルネットワークを利用したパターン分類手法。
- 高次元データを二次元平面へ非線形射影する
- 入力層と出力層により構成された二層のニューラルネットワーク

R
- パッケージ: kohonen, som
- 関数: som, somgrid, plot.kohonenなどなど

サンプル

data(wines)
set.seed(7)

training <- sample(nrow(wines), 120)
Xtraining <- scale(wines[training, ])
Xtest <- scale(wines[-training, ],
               center = attr(Xtraining, "scaled:center"),
               scale = attr(Xtraining, "scaled:scale"))

som.wines <- som(Xtraining, grid = somgrid(5, 5, "hexagonal"))

som.prediction <- predict(som.wines, newdata = Xtest,
          trainX = Xtraining,
          trainY = factor(wine.classes[training]))
table(wine.classes[-training], som.prediction$prediction)

線形回帰分析

回帰分析は最も一般的な手法且つ話題が豊富な手法

線形回帰分析とは
- 量的データを目的変数とした最も基本的な解析手法(って)
- 一つの説明変数の場合: 単回帰分析
- 複数の説明変数の場合: 重回帰分析

R
- 関数: lm

サンプル

require(graphics)

## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
anova(lm.D9 <- lm(weight ~ group))
summary(lm.D90 <- lm(weight ~ group - 1))# omitting intercept
summary(resid(lm.D9) - resid(lm.D90)) #- residuals almost identical

opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(lm.D9, las = 1)      # Residuals, Fitted, ...
par(opar)

## model frame :
stopifnot(identical(lm(weight ~ group, method = "model.frame"),
                    model.frame(lm.D9)))

### less simple examples in "See Also" above

非線形回帰分析

線形回帰分析とは
- 量的データを目的変数とした解析手法. 非線形

手法
- ロジスティクス回帰、多項式回帰、一般化線形モデル, 平滑化回帰と加法モデルなどなど

R
- パッケージ: stats, mgcv
- 関数: nls, glm(一般化線形モデル), gam(平滑化回帰)

線形判別分析

回帰分析との比較
- 回帰分析: 外的基準が量的データ
- 判別分析: 外的基準が質的データ

フィッシャーの線形判別関数
- irisの判別などなど
- 古典的手法

線形判別の注意事項
- 等分散の制約条件が必要
- 大量の変数には向かない

R
- 関数:lda
- predict, cross-validationなど併用する

サンプル

Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
                   Sp = rep(c("s","c","v"), rep(50,3)))
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
##  c  s  v
## 22 23 30
z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
predict(z, Iris[-train, ])$class
##  [1] s s s s s s s s s s s s s s s s s s s s s s s s s s s c c c
## [31] c c c c c c c v c c c c v c c c c c c c c c c c c v v v v v
## [61] v v v v v v v v v v v v v v v
(z1 <- update(z, . ~ . - Petal.W.))

非線形と比べるとあまり使われない

非線形判別分析

非線形判別分析とは
- 線形判別以外の全部(ぉ)の手法
- なので、非線形的手法以外にも距離に基づいた判別手法、多数決の判別方法、ベイズ判別方法、機械学習による判別方法も含む。

判別関数による判別分析
- 二次式に依る判別関数としてRだとqdaがある

R
- パッケージ: MASS
- 関数: qda

サンプル

tr <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
z <- qda(train, cl)
predict(z,test)$class

距離による判別分析
- マハラノビス距離など利用

R
- 関数: mahalanobis

サンプル

require(graphics)

ma <- cbind(1:6, 1:3)
(S <-  var(ma))
mahalanobis(c(0,0), 1:2, S)

x <- matrix(rnorm(100*3), ncol = 3)
stopifnot(mahalanobis(x, 0, diag(ncol(x))) == rowSums(x*x))
        ##- Here, D^2 = usual squared Euclidean distances

Sx <- cov(x)
D2 <- mahalanobis(x, colMeans(x), Sx)
plot(density(D2, bw=.5),
     main="Squared Mahalanobis distances, n=100, p=3") ; rug(D2)
qqplot(qchisq(ppoints(100), df=3), D2,
       main = expression("Q-Q plot of Mahalanobis" * ~D^2 *
                         " vs. quantiles of" * ~ chi[3]^2))
abline(0, 1, col = 'gray')

多数決による判別分析
- k-NNなどがある

R
- パッケージ: class
- 関数: knn

サンプル

train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
knn(train, test, cl, k = 3, prob=TRUE)
attributes(.Last.value)

ベイズによる判別手法
- ベイズモデル利用

R
- パッケージ: e1071, klaRなど
- 関数: NaiveBayes

サンプル

data(iris)
mN <- NaiveBayes(Species ~ ., data = iris)
plot(mN)

mK <- NaiveBayes(Species ~ ., data = iris, usekernel = TRUE)
plot(mK)

生存分析

生存分析とは
- イベントが起きるまでの時間とイベントとのあいだの関係の関係に着目した手法

適用範囲
- 工学: 機械システムや製品の故障
- 医学分野: 疾患の病気の再発や死亡など
- 生存分析では故障、破壊、倒産、死亡などのイベントを広義で死亡とみなす

R
- パッケージ: survival
- 関数: survfitなど

サンプル

leukemia.surv <- survfit(Surv(time, status) ~ x, data = aml) 
plot(leukemia.surv, lty = 2:3) 
legend(100, .9, c("Maintenance", "No Maintenance"), lty = 2:3) 
title("Kaplan-Meier Curves\nfor AML Maintenance Study") 
lsurv2 <- survfit(Surv(time, status) ~ x, aml, type='fleming') 
plot(lsurv2, lty=2:3, fun="cumhaz", 
        xlab="Months", ylab="Cumulative Hazard")

時系列

目的
- 時系列データの変動の特徴を捉え、現象の解明と将来の予測、制御をするために利用

モデル
- AR, ARMA, ARIMA, ARFIMA, GARCH, VARなどなど

予測モデルを作る際にはacf, pacf, AIC, スペクトル分析, 場合によっては(複数時系列モデルなど)ccfを事前に確認する必要あり
後は単位根検定

AR
- 自己回帰モデル
ARMA
- 移動平均自己回帰モデル
ARIMA
- 移動平均自己回帰モデル(差分)
ARFIMA
- 自己回帰実数和分移動平均モデル
GARCH
- 自己回帰条件付き分散不均一モデル, ノーベル経済学賞
- 派生したものにTGARCH, APARCHなどをあり
VAR
- 多変量自己回帰モデル
- arで求められる

R
- パッケージ: tseries, fracdiff, fseries
- ar, arma, arima, fracdiff, garch などなど

サンプル

arima(lh, order = c(1,0,0))
arima(lh, order = c(3,0,0))
arima(lh, order = c(1,0,1))

arima(lh, order = c(3,0,0), method = "CSS")

arima(USAccDeaths, order = c(0,1,1), seasonal = list(order=c(0,1,1)))
arima(USAccDeaths, order = c(0,1,1), seasonal = list(order=c(0,1,1)),
      method = "CSS") # drops first 13 observations.
# for a model with as few years as this, we want full ML

arima(LakeHuron, order = c(2,0,0), xreg = time(LakeHuron)-1920)

## presidents contains NAs
## graphs in example(acf) suggest order 1 or 3
require(graphics)
(fit1 <- arima(presidents, c(1, 0, 0)))
tsdiag(fit1)
(fit3 <- arima(presidents, c(3, 0, 0)))  # smaller AIC
tsdiag(fit3)

カオス時系列
- 不規則に変動する時系列データを非線形的に解析する手法

R
- パッケージ: tseriesChaos
- 関数: embedd

サンプル

library(scatterplot3d)
x <- window(rossler.ts, start=90)
xyz <- embedd(x, m=3, d=8)
scatterplot3d(xyz, type="l")

決定木

IF-THENで分岐するtreeを作り描画
- わりかしよく使われる

CARTが有名
- 分類基準としてエントロピーとジニ変数を利用

分類基準
- F統計量, χ二乗統計量なども利用される

R
- パッケージ: mvpart
- 関数: rpart

サンプル

data(car.test.frame)
z.auto <- rpart(Mileage ~ Weight, car.test.frame)
summary(z.auto)
plot(z.auto); text(z.auto)

data(spider)
fit1 <- rpart(data.matrix(spider[,1:12])~water+twigs+reft+herbs+moss+sand,spider,method="mrt")
plot(fit1); text(fit1)
fit2 <- rpart(data.matrix(spider[,1:12])~water+twigs+reft+herbs+moss+sand,spider,method="mrt",dissim="man")
plot(fit2); text(fit2)
fit3 <- rpart(gdist(spider[,1:12],meth="bray",full=TRUE,sq=TRUE)~water+twigs+reft+herbs+moss+sand,spider,method="dist")
plot(fit3); text(fit3)

ニューラルネットワーク

非線形回帰分析、非線形判別分析(パターン認識)の有力な手法

誤差逆伝播法が有名でアレ
人間には140億個のニューロンがある

モデル
- 階層型ネットワーク、非階層型ネットワーク、単一中間層ネットワークなどなど

注意事項
- overfittingしやすい
- 学習方法によって結果が変わるなどなど

R
- パッケージ: nnet

サンプル

# use half the iris data
ir <- rbind(iris3[,,1],iris3[,,2],iris3[,,3])
targets <- class.ind( c(rep("s", 50), rep("c", 50), rep("v", 50)) )
samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25))
ir1 <- nnet(ir[samp,], targets[samp,], size = 2, rang = 0.1,
            decay = 5e-4, maxit = 200)
test.cl <- function(true, pred) {
    true <- max.col(true)
    cres <- max.col(pred)
    table(true, cres)
}
test.cl(targets[-samp,], predict(ir1, ir[-samp,]))


# or
ird <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
        species = factor(c(rep("s",50), rep("c", 50), rep("v", 50))))
ir.nn2 <- nnet(species ~ ., data = ird, subset = samp, size = 2, rang = 0.1,
               decay = 5e-4, maxit = 200)
table(ird$species[-samp], predict(ir.nn2, ird[-samp,], type = "class"))

カーネル法とサポートベクターマシン

カーネルとは
- 非線形構造を線形構造に出来ると便利なのでカーネル関数でそうしたい
- K(x,x)的な感じの関数

適用範囲
- 密度関数の推定、主成分分析、正準相関分析、クラスタ分析、判別分析などなど

カーネル主成分分析
- KPCA(Kernel Principal COmponent Analysis)

解析の流れ
- カーネル関数をK(x,z)を求める
- データから写像行列Kn×nを求める
- Kn×nの固有値と固有ベクトルを求める
- 固有値と固有ベクトルの正規化

R
- パッケージ: kernlab
- 関数: kpca

サンプル

# another example using the iris
data(iris)
test <- sample(1:150,20)

kpc <- kpca(~.,data=iris[-test,-5],kernel="rbfdot",kpar=list(sigma=0.2),features=2)

#print the principal component vectors
pcv(kpc)

#plot the data projection on the components
plot(rotated(kpc),col=as.integer(iris[-test,5]),xlab="1st Principal Component",ylab="2nd Principal Component")

#embed remaining points 
emb <- predict(kpc,iris[test,-5])
points(emb,col=as.integer(iris[test,5]))

サポートベクターマシン
- 分類と回帰問題を主としたデータ解析手法
- マージン最大化問題に帰着

R
- パッケージ: kernlab
- 関数: ksvm (他にも色々ある)

サンプル

## simple example using the spam data set
data(spam)

## create test and training set
index <- sample(1:dim(spam)[1])
spamtrain <- spam[index[1:floor(2 * dim(spam)[1]/3)], ]
spamtest <- spam[index[((2 * ceiling(dim(spam)[1]/3)) + 1):dim(spam)[1]], ]

## train a support vector machine
filter <- ksvm(type~.,data=spamtrain,kernel="rbfdot",kpar=list(sigma=0.05),C=5,cross=3)
filter

## predict mail type on the test set
mailtype <- predict(filter,spamtest[,-58])

## Check results
table(mailtype,spamtest[,58])


## Another example with the famous iris data
data(iris)

## Create a kernel function using the build in rbfdot function
rbf <- rbfdot(sigma=0.1)
rbf

## train a bound constraint support vector machine
irismodel <- ksvm(Species~.,data=iris,type="C-bsvc",kernel=rbf,C=10,prob.model=TRUE)

irismodel

## get fitted values
fitted(irismodel)

## Test on the training set with probabilities as output
predict(irismodel, iris[,-5], type="probabilities")


## Demo of the plot function
x <- rbind(matrix(rnorm(120),,2),matrix(rnorm(120,mean=3),,2))
y <- matrix(c(rep(1,60),rep(-1,60)))

svp <- ksvm(x,y,type="C-svc")
plot(svp,data=x)


### Use kernelMatrix
K <- as.kernelMatrix(crossprod(t(x)))

svp2 <- ksvm(K, y, type="C-svc")

svp2


#### Use custom kernel 

k <- function(x,y) {(sum(x*y) +1)*exp(-0.001*sum((x-y)^2))}
class(k) <- "kernel"

data(promotergene)

## train svm using custom kernel
gene <- ksvm(Class~.,data=promotergene,kernel=k,C=10,cross=5)

gene


#### Use text with string kernels
data(reuters)
is(reuters)
tsv <- ksvm(reuters,rlabels,kernel="stringdot",kpar=list(length=5),cross=3,C=10)
tsv


## regression
# create data
x <- seq(-20,20,0.1)
y <- sin(x)/x + rnorm(401,sd=0.03)

# train support vector machine
regm <- ksvm(x,y,epsilon=0.01,kpar=list(sigma=16),cross=3)
plot(x,y,type="l")
lines(x,predict(regm,x),col="red")

集団学習

集団学習とは
- アンサンブル学習とも言われる
- 決して精度が高いとは言えない分類器の結果から学習を行ない制度が高い分類器の構築を行う

手法
- バギング、ブースティング、ランダムフォレスト
- ランダムフォレストは特に新しい。

バギング
- bagging(bootstrap aggregating), 1996年ブライマンによって提案(L.Breiman)
- 与えられたデータセットをブートストラップというリサンプリング法によって複数の学習データを作成
- 作成したデータセットに回帰・分析結果を統合、組み合わせることで精度をあげる
- ブートストラップ、サンプルはそれぞれ独立、学習は並列実行可能.

R
- パッケージ: adabag
- 関数: bagging

サンプル

## rpart library should be loaded
library(rpart)
data(iris)
names(iris)<-c("LS","AS","LP","AP","Especies")
lirios.bagging <- bagging(Especies~LS +AS +LP+ AP, data=iris, mfinal=10)

## rpart and mlbench libraries should be loaded
library(rpart)
library(mlbench)
data(BreastCancer)
l <- length(BreastCancer[,1])
sub <- sample(1:l,2*l/3)
BC.bagging <- bagging(Class ~.,data=BreastCancer[,-1],mfinal=25, maxdepth=3)
BC.bagging.pred <- predict.bagging(BC.bagging,newdata=BreastCancer[-sub,-1])
BC.bagging.pred[-1]

# Data Vehicle (four classes)
library(rpart)
library(mlbench)
data(Vehicle)
l <- length(Vehicle[,1])
sub <- sample(1:l,2*l/3)
Vehicle.bagging <- bagging(Class ~.,data=Vehicle[sub, ],mfinal=50, maxdepth=5)
Vehicle.bagging.pred <- predict.bagging(Vehicle.bagging,newdata=Vehicle[-sub, ])
Vehicle.bagging.pred[-1]

ブースティング
- 教師付きデータを用いて学習を行ない、学習結果を踏まえ重みの調整を繰り返す
- 複数の学習結果を求め、結果を統合・組み合わせをすることで精度をあげる
- AdaBoost(1996)が有名

R
- パッケージ: adabag
- 関数: adaboost.M1

サンプル

## rpart library should be loaded
library(rpart)
data(iris)
names(iris)<-c("LS","AS","LP","AP","Especies")
iris.adaboost <- adaboost.M1(Especies~LS +AS +LP+ AP, data=iris, boos=TRUE, 
        mfinal=10)

## rpart and mlbench libraries should be loaded
## Comparing the test error of rpart and adaboost.M1
library(rpart)
library(mlbench)
data(BreastCancer)
l <- length(BreastCancer[,1])
sub <- sample(1:l,2*l/3)

BC.rpart <- rpart(Class~.,data=BreastCancer[sub,-1], maxdepth=3)
BC.rpart.pred <- predict(BC.rpart,newdata=BreastCancer[-sub,-1],type="class")
tb <-table(BC.rpart.pred,BreastCancer$Class[-sub])
error.rpart <- 1-(sum(diag(tb))/sum(tb))
tb
error.rpart

BC.adaboost <- adaboost.M1(Class ~.,data=BreastCancer[,-1],mfinal=25, maxdepth=3)
BC.adaboost.pred <- predict.boosting(BC.adaboost,newdata=BreastCancer[-sub,-1])
BC.adaboost.pred[-1]

## Data Vehicle (four classes) 
library(rpart)
library(mlbench)
data(Vehicle)
l <- length(Vehicle[,1])
sub <- sample(1:l,2*l/3)
mfinal <- 25
maxdepth <- 5

Vehicle.rpart <- rpart(Class~.,data=Vehicle[sub,],maxdepth=maxdepth)
Vehicle.rpart.pred <- predict(Vehicle.rpart,newdata=Vehicle[-sub, ],type="class")
tb <- table(Vehicle.rpart.pred,Vehicle$Class[-sub])
error.rpart <- 1-(sum(diag(tb))/sum(tb))
tb
error.rpart

Vehicle.adaboost <- adaboost.M1(Class ~.,data=Vehicle[sub, ],mfinal=mfinal, 
        maxdepth=maxdepth)
Vehicle.adaboost.pred <- predict.boosting(Vehicle.adaboost,newdata=Vehicle[-sub, ])
Vehicle.adaboost.pred[-1]

ランダムフォレスト
- Random Forest(RF)はバギングの提唱者Breimalによって提案された新しい手法
- 精度、PCの資源節約の面でバギング、ブースティングより優秀

R
- パッケージ: randomForest
- 関数: randomForest

サンプル

## Classification:
##data(iris)
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
                        proximity=TRUE)
print(iris.rf)
## Look at variable importance:
round(importance(iris.rf), 2)
## Do MDS on 1 - proximity:
iris.mds <- cmdscale(1 - iris.rf$proximity, eig=TRUE)
op <- par(pty="s")
pairs(cbind(iris[,1:4], iris.mds$points), cex=0.6, gap=0,
      col=c("red", "green", "blue")[as.numeric(iris$Species)],
      main="Iris Data: Predictors and MDS of Proximity Based on RandomForest")
par(op)
print(iris.mds$GOF)

## The `unsupervised' case:
set.seed(17)
iris.urf <- randomForest(iris[, -5])
MDSplot(iris.urf, iris$Species)

## Regression:
## data(airquality)
set.seed(131)
ozone.rf <- randomForest(Ozone ~ ., data=airquality, mtry=3,
                         importance=TRUE, na.action=na.omit)
print(ozone.rf)
## Show "importance" of variables: higher value mean more important:
round(importance(ozone.rf), 2)

## "x" can be a matrix instead of a data frame:
set.seed(17)
x <- matrix(runif(5e2), 100)
y <- gl(2, 50)
(myrf <- randomForest(x, y))
(predict(myrf, x))

## "complicated" formula:
(swiss.rf <- randomForest(sqrt(Fertility) ~ . - Catholic + I(Catholic < 50),
                          data=swiss))
(predict(swiss.rf, swiss))
## Test use of 32-level factor as a predictor:
set.seed(1)
x <- data.frame(x1=gl(32, 5), x2=runif(160), y=rnorm(160))
(rf1 <- randomForest(x[-3], x[[3]], ntree=10))

## Grow no more than 4 nodes per tree:
(treesize(randomForest(Species ~ ., data=iris, maxnodes=4, ntree=30)))

アソシエーション分析

紙おむつとビールのアレ

アソシエーション分析とは
- POSデータ等から有益な情報を見つける際に活用される
  - POSデータはトランザクション、バスケットと呼ばれる(手法を用いる際)
- 代表的な手法に相関ルール、頻出アイテムなどの手法がある

相関ルール
- トランザクションデータベースに頻繁に出てくるアイテム間の君合わせの規則の事
- IBMで開発されたAprioriが有名

R
- パッケージ: arules
- 関数: apriori

サンプル

data("Adult")
## Mine association rules.
rules <- apriori(Adult, 
                 parameter = list(supp = 0.5, conf = 0.9,
                                  target = "rules"))
summary(rules)

頻出アイテム
- 木構造の解析
- Eclatが有名

R
- パッケージ: arules
- 関数: eclat

サンプル

data("Adult")
## Mine itemsets with minimum support of 0.1.
itemsets <- eclat(Adult,
                  parameter = list(supp = 0.1, maxlen = 15))