2013-05-03

Topic ModelとかLDAとかのメモ書き。

データ分析

ちょっと野暮用があり、Topic Modelの実験。

説明とか

私のブックマーク: Latent Topic Model (潜在的トピックモデル)
- 人工知能学会の私のブックマークから。
latent Dirichlet allocation - 機械学習の「朱鷺の杜Wiki」
- LDAについて。
(pdf)Probabilistic topic models
- Blei先生の解説記事。
Topic Models - videolectures.net
- Blei先生の講義ビデオ。

Software

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for Parameter Estimation and Inference
- とりあえず使ってみたのはこれ。

plda - A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation - Google Project Hosting
- MPI環境で並列で動かせるC++の実装。

その他
- David M.Blei先生のサイトの、"Topic modeling software"というところに、複数のtopic modeling用のopen sourceが紹介されている。

2013-05-03

データ分析者不足について雑感。

データ分析雑文

ここ数年、最近は広義の意味でのデータ分析全体が熱く、分析者に対する需要も高くなっているけれども、正直、ここまでの今の状態は一過性なのかな、と。

俯瞰してみるならば、企業でのデータ分析活用のためのバリューチェーンのなかで、現在は属人的で高付加価値となる箇所が多く、また分析者の需要過多であるため、ボトルネックとなってしまっているというのが今の状態。　しかし、企業がアナリティクスを活用できる状態になっている頃には、長期的には分析と活用の標準化によって解消されている、そもそもとして、標準化がなされている状況でなければ、職人芸を超えて業界として活用できる状態にはなっていないのではないかと思う。

例えば、様々なビジネスの中では高付加価値ビジネスに位置づけられるコンサルティングサービスにしても、今や収益の圧力からか標準化された方法論に移行をしようと手探りしている状況を目の当たりにして、コンサルティングビジネスでもそうであるならば、分析業界もこれからの発展の先には同じ流れが待っているな、と感じた次第。

ここでいう属人的という背景には、多様なスキルセットが必要という意味で、現時点で広く言われているデータサイエンティストというポジションの名前の背景には、統計学、人工知能や機械学習、コンピュータサイエンスの知識とエンジニアリングの経験、EnterpriseITの理解、ビジネスサイドに繋げるためのチェンジマネジメントに必要なスキルとしてのコンサルティング能力などなどを包括して総称しているような感じではあるが、そこに統一的な定義が無い理由は、それは組織側はデータ分析を活用するためのcapabilityをそのまま単一のポジションの単一のロールに直結させてしまっていることが理由で、結果、企業の種類だけ要件はあるというある意味納得がしやすい状況なのだと思う。

これに対して、分析業界からは「分析者のスキルセットを定義しよう」という流れがあるけれども、本来は「企業活動での分析のバリューチェーンとプロセスを定義しよう」も同時にあるべきものであり、そこに対してポジションが決まり、スキルセットが決まって人材マーケットでの要件が決まる、という流れが本来の流れではあるのではないかと思う。
現状のこの取り組みのレベル感は、企業によってかなりの温度差があるのだけれども、全体の底上げというのは、特定の目的に特化したソリューションとそこでの標準化されたオペレーションプロセス、によって実現されるのではなかろうか、と個人的には思っている。

既存のプロセスがないところで、人がいないけれども、最新の技術を手に入れたいとするならば、そういったソリューションを導入することは自然な流れ。　完璧でなくても、お手軽にROIの高いところを狙うならば、レコメンデーションにしたって、ターゲッティングであるとか、コンテンツの出し別けとか、将来的には動的なプライシングとか、そういったソリューションをさくっとブラックボックスとして利用するのがより一般的になり、結果として、それらブラックボックスが入り込んだところから分析のプロセスが定義されていくのではなかろうかな、と。

人材マーケットに目を向けると、そういったソリューションが使われ始めると、極端な話、それらパッケージのパラメータ調整をするだけで、特定の目的はそこそこ達成出来てしまう。　アナリストが不足するという例のレポートがあるけれども、そんな頃にはそれらパラメータ調整をするだけのJobタイトルも「アナリスト」となっており、組織としても分析を活用しており、結果として産業全体でのアナリスト不足という問題は、もしかしたらそこまでの問題にはなっていないのかもしれない。

その時に、今現在でいうところのデータサイエンティスト的な人が必要な場所というのは、企業ごとにカスタマイズしてthe bestを狙うという場所か、若しくはブラックボックスの中身を作る人という位置づけになり、それはアナリスト全体でみるとごく一部という業界図になっているのかも。

個人的には職人芸の余地とそこの価値が残っていてほしいけれども、「マイスター制」を超えたプロセスを確立できなければ、産業としてはスケールしていないということで。
データサイエンティストが不足なのだー、という言葉を聞くと、テイラー先生の「熟練などというものは存在しない」という言葉が頭をリフレインするわけですよ。

2013-04-28

O'REILLY / "R Graphics Cookbook"

R Graphics Cookbookを買ったり。主にggplot2まわりの説明がされている感じ。
とりあえずさくっと検索して使うといった、ちょっとしたリファレンス用途によい感じ。

R Graphics Cookbook

作者: Winston Chang
出版社/メーカー: Oreilly & Associates Inc
発売日: 2013/01/03
メディア: ペーパーバック
この商品を含むブログを見る

amazonのリンクを張っておきながらアレですが、O'REILLYから直接かった方が、DRM freeなのでよさげ。
R Graphics Cookbook - O'Reilly Media

ちなみにEbook版は$31.99ですが、discount code="AUTHD"で50% offの約$16に。

2013-04-07

Rでオセロの木探索とか。

雑文 R

統計処理のための処理系であるRで、あえてオセロ。
ちゃんとコンピュータと対戦できます。
さくっJokeのつもりで書きはじめたのですが、細かいところにこだわりはじめたら意外と長くなってしまいました。

操作画面

マウスで操作できる人間にやさしい作り。
locator()で座標を取得して入力という無理矢理感。
でも、置ける候補の場所も表示してくれる、人間にやさしい親切設計。

ロジック

って言う程のものもないのですが…。
深さ優先探索をして、MinMax法で探索しています。
ミニマックス法 - Wikipedia

盤に対する評価値は、ヒューリスティック:)な固定値です。
気が向いたら強化学習の仕組みでも入れてみます。TD(λ)とか。(たぶん、気が向かなさそうですが。。)

具体的なアルゴリズムについては、こちらを。

ボードゲームの思考アルゴリズムの入門 - Thinking out loud

使い方

下のコードを読み込んだあと、

> othello()

で実行。
othello(TRUE)とすると、ターミナルに探索の過程が表示されます。

1st player( WHITE ) is 
1: Human
2: Computer

みたいな画面が出てくるので、白、黒ごとに人間が打つか、コンピュータに打たせるかを選んでください。
両方とも2(Computer)を選択すると、コンピュータ同士でほげります。

Rのコード

CONST.TIME.LIMIT <- 6  # threshold of time(sec) for searching.
CONST.MAX.DEPTH <- 3  # max depth of depth-first searching.
##' main
othello <- function(verbose = FALSE) {
    ## initialize.
    board <- as.numeric(rep(0,8^2))
    turn <- 1
    board[conv.xy2index(4,4)] <- board[conv.xy2index(5,5)] <- 1
    board[conv.xy2index(4,5)] <- board[conv.xy2index(5,4)] <- -1

    ## determine player types.
    turn.type <- c()
    for (i in c(3,1)) {
        cat(ifelse(i == 3,"1st","2nd"),"player(",ifelse(i == 3,"WHITE","BLACK"),") is ")
        turn.type[i] <- c("Human","Computer")[menu(c("Human","Computer"))]
    }
    
    while (TRUE) {
        if ( length(get.index.available(turn, board)) == 0 ) {
            if ( length(get.index.available(-turn, board)) == 0)
                break  # game over, no one can put disc.
            turn <- turn * -1
            next  # pass
        }
        if ( turn.type[turn + 2] == "Human")
            board <- set.disc(display.board.plot(board, input=TRUE, turn=turn), turn=turn, board=board)  # input by player
        if ( turn.type[turn + 2] == "Computer") {
            display.board.plot(board,turn=turn)
            index <- search.tree(depth = 0, depth.max=CONST.MAX.DEPTH, turn, board, verbose)  # computer search the next.
            board <- set.disc(index, turn=turn, board=board)
        }
        turn <- turn * -1
    }
    display.board.plot(board,turn=turn)
}
##' evaluating board as score
eval.board <- function(board) {
    score <- c( 8,-2, 3, 3, 3, 3,-2, 8,
               -2, 1, 2, 2, 2, 2, 1,-2,
                3, 2, 1, 1, 1, 1, 2, 3,
                3, 2, 1, 1, 1, 1, 2, 3,
                3, 2, 1, 1, 1, 1, 2, 3,
                3, 2, 1, 1, 1, 1, 2, 3,
               -2, 1, 2, 2, 2, 2, 1,-2,
                8,-2, 3, 3, 3, 3,-2, 8)
    return (sum(score * board) + sum(board))
}
##' depth-first search, root must be depth = 0.
search.tree <- function(depth = 0, depth.max, turn, board, verbose=FALSE) {

    ## for verbose
    outputlog <- function(str="") {
        cat(paste(paste(rep(".",depth),collapse=""),"f(",index,")=",minmax$score, str, collapse=""),"\n")
    }

    if (depth == 0) {
        search.tree.time <<- proc.time()  # for checking condition on interruption.
        search.tree.terminate <<- FALSE  # for interruption event.
    }
    if (depth == depth.max) {
        if (CONST.TIME.LIMIT <= (proc.time() - search.tree.time)[3])
            search.tree.terminate <<- TRUE
        return (eval.board(board))
    }
    minmax <- c()
    minmax$score <- -1e9 * turn
    minmax$index <- NA
    for (index in get.index.available(turn, board)) {
        board.tmp <- set.disc(index=index, turn=turn, board=board)
        ## depth first search by recursive function.
        score <- search.tree(depth = depth + 1, depth.max = depth.max, turn= -1 * turn, board=board.tmp, verbose=verbose)

        ## Min Max Algorithm
        ## - for white turn
        if (turn == 1 && minmax$score <= score) {
            minmax$score <- score
            minmax$index <- index
            if (verbose)
                outputlog()
        }
        ## - for black turn
        if (turn == -1 && score <= minmax$score ) {
            minmax$score <- score
            minmax$index <- index
            if (verbose)
                outputlog()
        }
        ## Check exceeding the threshold time(sec) in searching.
        if (search.tree.terminate) {
            minmax$score <- eval.board(board)
            if (verbose)
                outputlog("* Termination")
            break
        }
    }
    ## root on searching
    if (depth == 0)
        return (minmax$index)
    else
        return (minmax$score)
}
##' Display board (Or input)
##' (input = TRUE) : input by user.
display.board.plot <- function(board, input = FALSE, turn = NA) {
    image(matrix(rep(c(seq(8) %% 2,(seq(8) + 1) %% 2), 4), nrow=8) ,col=c("green","darkgreen"), axes=FALSE)
    title(paste(ifelse(turn== 1,"[White]"," White ")," = ",sum(board==1),"  vs  ",
                ifelse(turn==-1,"[Black]"," Black ")," = ",sum(board==-1),sep=""))
    grid(8,8,lty=1,col="green")
    for (i.turn in c(1,-1)) {  # plot disc
        place <- sapply(which(board == i.turn), conv.index2xy)
        points(as.integer(place / 10 - 1) * 1/7, (place %% 10 - 1) * 1/7,
               cex=5/7*min(dev.size()), col=ifelse(i.turn == 1,"white","black"),pch=16)
    }
    if (!is.na(turn)) {  # plot availability
        place.xy <- sapply(place.index <- get.index.available(turn, board), conv.index2xy)
        if (0 < length(place.index))
            points(as.integer(place.xy / 10 - 1) * 1/7, (place.xy %% 10 - 1) * 1/7,
                   cex=5/7*min(dev.size()), col=ifelse(turn == 1,"white","black") ,pch=1,lty=4)
    }
    while (input) {  # input using locator()
        input.location <- locator(1)
        if ((input.place <- conv.xy2index(min(which(input.location$x < seq(1/7/2,1+1/7/2,by=1/7) )),
                                          min(which(input.location$y < seq(1/7/2,1+1/7/2,by=1/7) )))) %in% place.index) {
            return ( input.place )
        }
    }
}
##' convert b/w coordinate and index.
conv.xy2index <- function(x, y) {
    (y-1) * 8 + x
}
conv.index2xy <- function(index) {
    tmp <- index %% 8
    ifelse(tmp == 0, 8, tmp) * 10 + ifelse(tmp == 0, as.integer(index / 8), as.integer(index / 8) + 1)
}
##' get all index available.
get.index.available <- function(turn, board) {
    which(sapply(seq(1,8),
                  function(y) { sapply(seq(1,8),
                                       function(x) {
                                           set.disc(conv.xy2index(x, y) ,turn=turn, board=board, check=TRUE)
                                       })}))
}
##' (check=FALSE) set turn's disc on the specified.
##' (check=TRUE)  check to set turn disc on the specified index.
set.disc <- function(index=NA, turn, board, check = FALSE) {
    if (check && board[index] != 0)
        return(FALSE)
    if (!check)
        board.new <- board
    for (x in seq(-1,1)) {
        for (y in seq(-1,1)) {
            if (x == 0 && y == 0)
                next
            x.new <- as.integer(conv.index2xy(index) / 10) + cumsum(rep(x,7))
            y.new <- conv.index2xy(index) %% 10 + cumsum(rep(y,7))
            index.new <- 1 <= x.new & x.new <= 8 & 1 <= y.new & y.new <= 8  # boundary validation.
            move.list <- unlist(mapply(conv.xy2index, x.new[index.new], y.new[index.new]))
            tmp <- unique(board[move.list])
            if (2 <= length(tmp) && tmp[1] == -turn && tmp[2] == turn) {
                if (check) return (TRUE)  # possible to put.
                board.new[ c(index, move.list[seq(1, min(which(board[move.list] == turn))-1)]) ] <- turn  # reverse discs.
            }
        }
    }
    if (check)
        return (FALSE)  # impossible to put.
    return (board.new)
}
## play
othello()

また、無駄な物を書いてしまった。。

2013-03-25

Rでべき乗分布(power-law)分布の乱数生成

R データ分析

お仕事で、べき乗分布(power-law)に従った乱数発生をする必要が生じたので、さくっと。

べき乗分布の乱数

検索したら出てきたのがこれ。
Random Number -- from Wolfram MathWorld
べき乗分布に従った乱数 $X$ は、 $y$ が $[0,1]$ の一様乱数、 $n$ がべき乗係数、 $X \in [x_0, x_1]$ のとき、以下となる。
$X = [ (x_1^{n+1} - x_0^{n+1})y + {x_0}^{n+1}]^{\frac{1}{(n+1)}}$

とりあえず、べき乗分布乱数を発生させる関数、rpowerlaw() をほげってみる。

rpowerlaw <- function (n, dist.power = 0, x0 = 0, x1 = 1) {
    sapply(runif(n),
           function (x.unif) (((x1 ^(dist.power + 1) - x0 ^(dist.power + 1))
                               * x.unif + x0 ^(dist.power + 1))
                              ^(1 / (dist.power + 1)))
           )
}

使用例

10000個の、0..1の範囲で分布に従った乱数を生成する場合、とか。

hist(rpowerlaw(10000,dist.power = 7, 0, 1))

f:id:alcuin:20130325042739p:plain

うむ、ろんぐてーるー。

参考

Power-law Distributions

This page is a companion for the SIAM Review paper on power-law distributions in empirical data, written by Aaron Clauset (me), Cosma R. Shalizi and M.E.J. Newman. This page hosts implementations of the methods we describe in the article, including several by authors other than us. Our goal is for the methods to be widely accessible to the community.