R

[PR]

×
[PR]上記の広告は3ヶ月以上新規記事投稿のないブログに表示されています。新しい記事を書く事で広告が消えます。

ggplot2で色付き等高線グラフを描くには、geom_point() で大きめのマーカーを並べる方法と、geom_tile()で長方形を並べる方法の2通りがある。公式のマニュアルではgeom_tile()が紹介されているが、geom_point()の方が多少動作が軽いし、結果が安定する気がする。
東急目黒線沿いの賃料マップ（codeIQ Magazine参照）を再現した。

上がgeom_point(), 下がgeom_tile()で書いたものだが、見た目をそう変わらない。左のほうが全体に滲むような色合いなのに対し、右はタイルを敷き詰めているので白い線が見える。

library(ggplot2)
library(reshape2)
x <- read.csv('http://blog.cnobi.jp/v1/blog/user/ca2e456143c0d20195537cc5daa5fd14/1399760770', 
              sep='\t', as.is=T)
z <- read.csv('http://blog.cnobi.jp/v1/blog/user/ca2e456143c0d20195537cc5daa5fd14/1399761108', 
              sep='\t', as.is=T, header=F)

fit <- loess(rent ~ lon + lat, data=x, span=0.5)
gd <- 100
lon <- seq(min(x$lon), max(x$lon), length=gd)
lat <- seq(min(x$lat), max(x$lat), length=gd)
D <- expand.grid(lon=lon, lat=lat)
v <- melt(predict(fit, D), value.name='rent')
v$lon <- as.numeric(gsub('lon=', '', v$lon))
v$lat <- as.numeric(gsub('lat=', '', v$lat))

p <- ggplot(v, aes(lon, lat, z=rent)) 
p + geom_point(aes(col=rent), linetype='blank', size=5, shape=15) +
    stat_contour(bins=10, col='white', size=1.2) + 
    stat_contour(bins=10) +
    geom_line(data=z, aes(V4, V3, z=V3), size=2, col='lightgreen') + 
    geom_line(data=z, aes(V4, V3, z=V3), size=1.2, color='coral4') + 
    geom_text(data=z, aes(V4, V3, z=V3, label=V1), 
              size=5.12, col='white', fontface=2) + 
    geom_text(data=z, aes(V4, V3, z=V3, label=V1), 
              fontface=2, size=5) + 
    scale_colour_gradientn(colours=rich.colors(10))

p + geom_tile(aes(fill=rent)) + 
    stat_contour(bins=10, col='white', size=1.2) + 
    stat_contour(bins=10) + 
    geom_line(data=z, aes(V4, V3, z=V3), size=2, col='lightgreen') + 
    geom_line(data=z, aes(V4, V3, z=V3), size=1.2, color='coral4') + 
    geom_text(data=z, aes(V4, V3, z=V3, label=V1), 
              size=5.12, col='white', fontface=2) +
    geom_text(data=z, aes(V4, V3, z=V3, label=V1), 
              fontface=2, size=5) + 
    scale_fill_gradientn(colours=rich.colors(10))

ggplot2（２）：ヒストグラム・箱ひげ図

ggplot2で、ヒストグラムと箱ひげ図を描く方法。

いずれも１変数の分布を表現する方法だが、ヒストグラムは分布の形状をより詳しく伝え、箱ひげ図はカテゴリ間の分布の差を比べるのに適している。

例として、松坂大輔投手の投球データを用いた（出典, もともとのソースは（株）データスタジアム）。

> head(x)
     game.id picher.id   type speed batter.name runner picher.name
1 2005061405     11715 カーブ   111    赤星　憲広      0   松坂　大輔
2 2005082405     11715 カーブ   111      石本　努      0   松坂　大輔
3 2005032601     11715 カーブ   112    北川　博敏      0   松坂　大輔
4 2005050504     11715 カーブ   112    セギノール      0   松坂　大輔
5 2005053102     11715 カーブ   112        ウッズ     10   松坂　大輔
6 2005053102     11715 カーブ   112    アレックス      0   松坂　大輔
  visitor.id home.id      hv
1          5       7    home
2          8       7    home
3         11       7    home
4          8       7    home
5          7       4 visitor
6          7       4 visitor

(1)
* aes(x)に、分布を調べる変数を指定する。
* colourで外枠、fillで塗りつぶす色を指定する。　

library(ggplot2)
p <- ggplot(x, aes(x=speed)) + 
     ggtitle('松坂大輔投手の球速 2005') + xlab('球速 (km/h)')
p + geom_histogram(binwidth=1, colour='grey5', fill='grey35')

(2) 密度推計を重ねる。
* ..density..というのは定数か何か。こう書くものと覚える。
* sizeで線の太さを変える。
* 白い太めの先に黒い線を重ねて見やすくしている。

p + geom_histogram(binwidth=1, aes(y=..density..), 
                   colour='grey5', fill='grey35') +
    geom_density(colour='azure', size=2) + 
    geom_density(colour='black', size=1.5)

(3) 球種ごとの分布を比べるには・・・
* グループ別の密度推計だけを重ねる。alphaを1より小さい値に指定する（薄く塗りつぶす）と重なった時に綺麗。
* またはfacet_grid()を用いて、球種ごとに描画する。
* 帯グラフっぽく、グループで色分けして重ねる手もある。

p + geom_density(aes(colour=type, fill=type), alpha=.3, size=1) 
p + geom_histogram(binwidth=2, aes(y=..density..), 
                   colour='grey5', fill='grey35') + 
    geom_density() + facet_grid(type ~. )
p + geom_histogram(aes(fill=type), 
                   binwidth=1, stackgroup=T, colour='grey')

(4) カテゴリ変数を増やす。
* facet_grid()の指定を変更する。
* facet_grid()は、欠損値を拾ってしまう、それらを除いたデータを用いて描画する。

p <- ggplot(subset(x, !is.na(hv)), aes(x=speed)) +
     ggtitle('松坂大輔投手の球速 2005') + xlab('球速 (km/h)')
p + geom_histogram(binwidth=2, aes(y=..density..), 
                   colour='grey5', fill='grey35') +
    geom_density() + facet_grid(type ~ hv)

(5) 比較するカテゴリの数が多いときは、箱ひげ図が有効。

p <- ggplot(subset(x, !is.na(hv)), aes(type, speed)) +
     ggtitle('松坂大輔投手の球速 2005') + xlab('球種') + ylab('球速 (km/h)')
p + geom_boxplot(aes(fill=hv), alpha=0.5) +
    theme(legend.position='bottom', legend.title=element_blank())

x <- read.table('http://blog.cnobi.jp/v1/blog/user/ca2e456143c0d20195537cc5daa5fd14/1396483677', 
                as.is=T, header=T)
names(x) <- c('game.id', 'picher.id', 'type', 'speed', 'batter.name', 






            'runner', 'picher.name', 'visitor.id', 'home.id')
x$hv <- NA
x$hv[x$home.id %in% 7] <- 'home'
x$hv[x$visitor.id %in% 7] <- 'visitor'

library(ggplot2)
p <- ggplot(x, aes(x=speed)) + 
     ggtitle('松坂大輔投手の球速 2005') + xlab('球速 (km/h)')
p + geom_histogram(binwidth=1, colour='grey5', fill='grey35') # ...(1)
p + geom_histogram(binwidth=1, aes(y=..density..), 
                   colour='grey5', fill='grey35') +
    geom_density(colour='azure', size=2) +
    geom_density(colour='black', size=1.5) # ...(2)
p + geom_density(aes(colour=type, fill=type), alpha=.3, size=1)
    # ...(3a)
p + geom_histogram(binwidth=2, aes(y=..density..), 
                   colour='grey5', fill='grey35') +
    geom_density() + facet_grid(type ~. ) # ...(3b)
p + geom_histogram(aes(fill=type), 
                   binwidth=1, stackgroup=T, colour='grey') # ...(3c)

p <- ggplot(subset(x, !is.na(hv)), aes(x=speed)) + 
     ggtitle('松坂大輔投手の球速 2005') + xlab('球速 (km/h)')
p + geom_histogram(binwidth=2, aes(y=..density..), 
                   colour='grey5', fill='grey35') +
    geom_density() + facet_grid(type ~ hv) # ...(4)

p <- ggplot(subset(x, !is.na(hv)), aes(type, speed)) +
     ggtitle('松坂大輔投手の球速 2005') + xlab('球種') + ylab('球速 (km/h)')
p + geom_boxplot(aes(fill=hv), alpha=0.5) +
    theme(legend.position='bottom', legend.title=element_blank()) 
    # ...(5)

ggplot2（１）：棒グラフ

ggplot2はRのグラッフィク・ライブラリの１つだ。あまり詳しい設定をいじらなくても、結構きれいな画像を描画してくれる。しかも、しっかりと仕組みを理解すればかなり自由にカスタマイズすることも可能らしい。一念発起して、ggplot2の習得を目指す。

今回は、棒グラフの描き方。例として、2013年の日本棋院賞金ランキングを使う（出所）。

基本公式

p <- ggplot(data, aes(...))  # オブジェクトの作成
p + geom_bar() ...           # 描画

ポイント

* ggplot関数は、必要なデータが１つのデータフレームに入っているようなときに便利。よりシンプルなqplot関数も用意されているが、より汎用性の高いggplot2を覚えれば、qplotでできることは全てできる。

* ggplotオブジェクトをまず作り、そこにどんどん描画（レイヤー）を足していくイメージ。

* aes (エステティック)は、グラフの構成要素と考えればよい。xは横軸変数、fillは塗りつぶしに使う色など。

* geom_bar関数は、カテゴリごとの個体数を表示するようにデザインされた。

このカテゴリに対応する変数を、

aes(x)として指定する。

* 棒の長さに対応する値はデータから個体数を数えて求めるものであって、今回のように値がすでに与えられているようなケースでは少し工夫が必要：横軸変数（順位）をaes(x)に指定し、値変数（賞金額）をaes(weight)に指定する。

* aes変数は、あとから追加できるし、指定を変更することもできる。

* 上から描画を足すことができる。aes変数は原則としてそのまま使われるが、別のものを指定してもいい。

x <- read.table('http://blog.cnobi.jp/v1/blog/user/ca2e456143c0d20195537cc5daa5fd14/1396338447',
                header=F, as.is=T)
names(x) <- c('sei', 'mei', 'shokin', 'shozoku')
x$shokin <- x$shokin/10000
x$rank <- as.factor(1:nrow(x)) # use as category
x$sei <- factor(x$sei, levels=x$sei) # use as category
barwidth <- .3 

library(ggplot2)
library(scales)  # comma() function
p <- ggplot(x, aes(rank, weight=shokin))
p + geom_bar(width=barwidth) + xlab('') + ylab('賞金額（万円）') +
    ggtitle('日本棋院賞金ランキング 2013') +
    scale_y_continuous(labels=comma) # ...(1)

p + geom_bar(aes(fill=shozoku), width=barwidth) +
    xlab('') + ylab('賞金額（万円）') +
    ggtitle('日本棋院賞金ランキング 2013') +
    scale_y_continuous(labels=comma) +
    theme(legend.position='bottom', legend.title=element_blank())
    # ...(2)

p + geom_bar(aes(x=sei), width=barwidth) +
    xlab('') + ylab('賞金額（万円）') +
    ggtitle('日本棋院賞金ランキング 2013') +
    scale_y_continuous(labels=comma)  # ...(3)

p <- p + geom_bar(width=barwidth) +
        xlab('') + ylab('賞金額（万円）') +
        ggtitle('日本棋院賞金ランキング 2013') +
        scale_y_continuous(labels=comma) # 中間地点
p + geom_text(aes(y=shokin, label=sei), 
              col='white', size=5.2, fontface=2) +
    geom_text(aes(y=shokin, label=sei)) # ...(4)

p + geom_bar(width=barwidth) +
    xlab('') + ylab('賞金額（万円）') +
    ggtitle('日本棋院賞金ランキング 2013') +
    scale_y_continuous(labels=comma) +
    geom_text(aes(y=shokin, label=sei), vjust=-0.1) # ...(5)

p + geom_bar(width=barwidth) + xlab('') + ylab('賞金額（万円）') +
    ggtitle('日本棋院賞金ランキング 2013') +
    scale_y_continuous(labels=comma) +
    geom_text(aes(y=shokin, label=sei), hjust=-0.3) # ...(6)

進捗報告を出力（Rの場合）
Pythonで進捗報告を画面に出力する方法（記事へのリンク）は、そのままRにも応用できる。

ポイント
* 出力にはcatを使う。
* flush.consoleは現在の環境では必要なかったが、おまじないとして書いておく。
* 一定時間止めるには、Sys.sleep。
* Pythonでの '='*10 のような、シンプルに繰り返し文字列をつくるコマンドはないようだ。
* Built-inの関数をつかいたい場合は、txtProgressBarというものが用意されている。
```
N <- 500
for (n in 1:N) {
    cat(sprintf('\r[%-20s %3.0f%%]', 
               paste(rep('=', n/N*20), collapse=''), n/N*100))
    flush.console()
    Sys.sleep(.01)
}
```
```
N <- 500
for (n in 1:N) {
    ch <- substring('-<>', (n %% 3) + 1, (n %% 3) + 1)
    ch <- paste(rep(ch, 10), collapse='')
    cat(sprintf('\r%s %3d/%3d %s', ch, n, N, ch))
    flush.console()
    Sys.sleep(.05)
}
```
- R
- 2014/04/01 01:46
- 0