goinger的日記

SIG-KBS(知識ベースシステム研究会)のGoogle Marketingにおけるコンピューターサイエンスと統計学の講演メモ

SIG-KBSでGoogleの講演があり、それのメモを取っておいたので載せておきます。Google関連の講演はこれまで三回くらい聞いたことがあるんですけど、統計関連は聞いたことが無かったので非常に面白かったですね。

Ustとかスライド撮ってそれをwebに挙げるってのは勘弁してください><という感じだったのですが、Blogにするくらいなら良いらしいので、載せておきます。メモそのまんまなんで特に補足説明とかないですけど。。

社内専用最適化言語とか、統計解析に使っている統計モデルとかの説明は面白かったですね。

CSと統計学

Computer Science and Statistics at Google Marketing

東大卒業後Google

2009 Quantitative Marketing Manager @ Tokyo

PhD in Engineering from the University of Tokyo (focused on computational science)

modeling, computational and simulation analysis of complex networks

Visualization of large-scale complex graphs!

一人しかいない

Member

Web search

統計

1 Let others speak for you
2,3 重要
2 Data not hype
3 Results must be trackable
4 Promote trial
5 YOu're smart and your time matters.
6 We're serious. Except when we're not.
7 Big ideas move us.

Seven principles of Google Marketing

社内の全ての社員が結構データのアクセスができるかというと違う

イントラネット検索エンジン

ソースコードのアクセスはソフトエンジニアのみだけど統計関連の人も見れるとか

制限厳しい

エンジニアでない人がlog データをどう見ればいいのか
レポーティングのためのツール

analysis skills
engineering skills
product and market-specific knowledge and expertise
extensive analytical and statistical skills.

analyses to help inform marketing strategies for key products

We have the same data/logs access privileges as software engineers

We are supposed to be data analysis professionals

二つの事例

Display Ad Expertiments

100から120に上がる
- これだけでは分からない

こうか

Test & control ads
- 動画の広告効果の測定は難しい

比較実験

Media Mix

統計モデル

how does data analysis work at Google

Data analysis Visualization(R, SQL, visualization SQL)
Python
SQL
統計モデル

Google Technology
MapReduce
Google FIle System
Bigtable
Visualization API

Program Language
- Sawzall
- Python
- Javascript

Statistical Analysis
- R
- SQL

やってる事は大学の研究と似ている

Rのライブラリを作っている人がGoogle社内に結構いる。

Data analysis procedure

datapull from various logs
datapull from other data sources
aggreage and process
statistical analysis -- apply statistical models on the data
visualize and publish as presentation or report

Logs(70%)

access log
client download/update log

we dont' use rdbms at this stage

simply data is too huge
requires distributed computing with many machines
ofen no complex data manipulation is needed

Goal of data analysis is ofen rather simple
- sum histogram max min topN filtering

解析手法はシンプルであったりする

テラバイトは一気に行く

RDBMSは必要ない
- 難しいのはjoin
- そういう事をする必要はない

logの構造
- request URL

http://code.google.com/p/protocolbuffers

MapReduceを簡単につかうためにSawzallというプログラミング言語を使っている。

Sawzall

Query Geo Distribution

datum: table summ[t: time][lat :int][lon: int] of int:proto "querylog.proto"

log_record:QueryLogProto

一日で習得出来る言語
MapReduceの処理を隠蔽して気楽に使える。

社内最適化激しい

90数%

速い

Failure-obvious
- discard and re-calcuatet the record with error rather than stall whole computation.

MySQL like database negine

Csv//text file in GFS
BigTable
local MySQL

Need to parse and aggreage different dat sources
- usually write Python script
- sometimes use local MySQL database for aggragation.

統計モデルを作る

Apply appropriage statistical methods for given problems Some examples

Time-series(seasonal ARIMA) model
LIME mixed effects(LME)
Random forest models
DhD propensity scoring
Experimental design

20+ statisticans and quant analysts on the team.

R mostly commonly used

今後伸びていくであろうクライアントの解析 decision tree

自己相関と移動平均で向こう何千年のどうのこうの
時系列分析!!!

Visualization and Presentation

Google Visualization API 利用

時系列相関のアニメーション

Adhering to Engineering standards

Sharing al source codes with all other software engineers
check code into single repository for the whole company
- your code may be used or edited by someone in the future
all codes have to follow coding styles
all codes have to be reviewed by peers before check in

Sharing computing resources with all other engineers.
- K distributed machines.
same infrastructure as production.

使い捨てコードもレビューされる

守備範囲が広い

問題は定義されていない

YoutubeのTraficをなんかに使えないか

どういったアプローチが必要か考えるところからプロジェクトスタート

Challenges we are facing

Complex questions without simple solutions

Large volumes of data
- can't achieve w/o sophosticated computing infrastructure
- analysts need to have necessary technical / engineering and quantitative skills

Limited resources (hiring & training)

Privacy

社内向け統計学者は全然いない。

CS + Statistical + math backgraound is difficult.

統計の教科書を出して勉強している。

Statistics専攻は日本にはあまりいない。

Google has many data analyst teams, including us QM

We are NOT software engineers but are equipped with either engineering or statistics backgrounds and adhere to engineering standards at Google

We undertake complex research and modeling projects that involve large-scale data processing and intensive statistical analysis.

We are hoping

Datacenters
MapReduce
Sawzall
other Google technologies(GFS, Bigtable)

google realted papers
http://labs.google.com/papers

QL also presents some papers at JSM(Joint Statistical Meetings) conference
http://www.amstat.org/meetings/jsm/2010/

手に入るものは何でも使う方向
low dataがいい
金額との兼ね合い