このページは http://www.slideshare.net/shimonotoshiyuki/ellipsoidal-representations-about-correlations-201111-tsukuba-kakenhisymposium の内容を掲載しています。

掲載を希望されないスライド著者の方は、削除申請よりご連絡下さい。

埋込み型プレイヤーを使用せず、常に元のサイトでご覧になりたい方は、自動遷移設定をご利用下さい。

by寿之 下野

6年弱前 (2012/02/01)にアップロードin学び

A fundamental theory in statistics, possibly applicable to data mining, machine learning, as well...

A fundamental theory in statistics, possibly applicable to data mining, machine learning, as well as epistemology. The principia mathematica of mine, 2nd version.

- 企業等に蓄積されたデータを分析するための処理機能の提案5ヶ月前 by 寿之 下野
- ページャ lessを使いこなす7ヶ月前 by 寿之 下野
- Macで開発環境を整える1704208ヶ月前 by 寿之 下野

- Ellipsoidal representations about correlations

(Towards general correlation theory)

Toshiyuki Shimono

tshimono@05.alumni.u-tokyo.ac.jp

KAKENHI* Symposium

*Grant-in-Aid for Scientific Research

University of Tsukuba

2011-11-8 - My profile

• My jobs are mainly building algorithms using

data in large amounts such as:

o web access log

o newspaper articles

o POS(Point of Sales) data

o tags of millions of pictures

o links among billions of pages

o psychology test results of a human resource company

o data produced used for recommendation engines

o data produced an original search engine

• This presentation touches on those above. - Background

1. Paradoxes of real world data :

o any elaborate regression analysis mostly gives ρ < 0.7

(This is when the observation is not very accurate, and 0.7 is arbitrary.)

-> so how to deal with them?

o data accuracy seems not important to see ρ if ρ < 0.7,

-> details shown later.

2. My temporal answer :

o The correlations are very important,

so we need interpretation methods.

o The el ipsoids wil give you insights.

3. Then we wil :

o understand the real world dominated by weak correlations.

o find new rules and findings in broad science, hopeful y. - Main contents

§1. What is ρ?

o Shape of el ipse/el ipsoid

o Mysterious robustness

§2. Geometry of regression

o Similarity ratio of el ips＊s

o Graduated rulers

o Linear scalar fields - §1. What is ρ ?

(ρ : the correlation coefficient)

It was developed by Karl Pearson from a similar but slightly

different idea introduced by Francis Galton in the 1880s.

(quoted from en.wikipedia.org) - The shapes of correlation ellipses (1)

Each entry of the left

figure shows the 2-

dimensional Gaussian

distributions with ρ

changing from -1 to +1

stepping with 0.1.

(5000 points are

plotted for each) - The shapes of correlation ellipses (2)

The density function of 2-dim Gauss-

distribution with standardizations.

Note: for higher dimensions,

The ellipse inscribes the unit

square at 4 points (±1,±ρ)

and (±ρ,±1). - The shapes of correlation ellipses (3)

•

Displacement and axial-

rescaling are al owed.

(Rotation or rescaling along

other direction is prohibited.)

When you draw the el ipses above,

1. draw an el ipse with the height and width of

√(1±ρ),

2. rotate it 45 degree,

3. do paral el-shift and axial-rescaling. - The shapes of correlation ellipses (4)

[Baseball example] 6 teams of the Central League played 130 games in the

each of past 31 years. Each dot below corresponds to each team and each year

(N = 186 = 6 × 31).

x : total score lost(L)

x : total score gained(G)

y : - rank

y : - rank

ρ = -0.471

ρ = 0.419

x : total score gained

x : -rank prediction

y : total score lost

from both G & L

ρ = 0.423

y : - rank

ρ = -0.828

(The prediction is

through the multiple

regression analysis) - The shapes of correlations (5) SKIP
- Correlation ellipsoid (higher dimension)

z

( 0.5 , 0.7 , 1 )

ρ-matrix herein is,

(-1,-0.3,-0.5)

1 0.3 0.5

( 0.3 , 1 , 0.7 ) 0.3 1 0.7

0.5 0.7 1

( 1 , 0.3 , 0.5 )

(-0.3 ,-1 ,-0.7 )

y

x

(-0.5 ,-0.7 ,-1 )

For 3-dim case, the probability ellipsoid touches the unit cube

at 6 points of ±( ρ , ρ , ρ ) where ・ = 1,2,3.

・1

・2

・3

(For k-dimensions, the hyper-ellipsoid touches the unit hyper-cube

at 2×k points of of ±( ρ , ρ ,.., ρ ) where ・ = 1,2,..,k.

・1

・2

・k - The mysterious robustness (1)

ρ[X:Y] and ρ [ f(X) : g(Y) ] seems to differ only little each

other

• when f and g are both increasing functions

• unless X, Y, f(X) or g(Y) contains `outlier(s)'.

(Sampling fluctuations of ρ are much more than the effect

caused by non-linearity as wel as error ε.)

* A function f(・) is increasing iff f(x)

≦ f(y) holds for any x

≦ y. - The mysterious robustness (2)

ρ[X:Y]=0.557

ρ[X2:Y]=0.519

ρ[X:Y2]=0.536 ρ[X:log(Y)]=0.539

Xを2乗

Yを2乗

(x,y)=(u,0.5*u+0.707*v) with

Yを対数化

(u,v) from an uniform square.

ρ[Xrank:Yrank]=0.537 ρ[X(7):Y(7)]=0.524 ρ[X(5):Y(5)]=0.507

X,Yを順位化

X,Yを7値化

X,Yを5値化

Even N=200 causes the sampling

• The deformations cause less effect on ρ,

correlations rather big fluctuations,

whereas the X marks from the

• N=200

≫ 1 causes bigger ρ fluctuations.

experiments rather concentrates. - The mysterious robustness (3)

Sampled ρ are perturbed corresponding to the sampling size with

N=30(blue) or N=300(red). The deformation effect by f( ) is less. - Where does the champion come from?

The champion of a game is often not the true champion.

potential ability

If ρ of the game is not close to 1, the true cannot win.

The winner is approximately ρ times as strong as the true guy.

(If the results and abilities form a 2-dim 0-centered Gaussian.) - Summary of `§1. What is ρ? '

• ρ is recognizable as an ellipse.

• ρ-matrix is recognizable as an ellipsoid.

• ρ seems robust against axial deformations unless outliers exist.

• ρ of a game is suggested by the champions. - §2. Geometry of Regression

The figures herein show the

possible region where

(x,y,z)=(ρ[Y:Z],ρ[Z:X],ρ[X:Y])

can exist. - Multiple-ρ is the similarity ratio of ellipses

[ Formulation of MRA ]

[ Multiple - ρ ]

The multiple-ρ (

≦ 1) is the

similarity ratio of the ellipses.

(When X is k-dimentional, the hyper-el ipsoid is determined by k×k matrix whose

・

elements are ρ [ X : X ], and the inner point is at p-dimensional vector whose

i

j

elements are ρ [ X : Y ] . )

i - Examples : Multiple-ρ from the ellipses

Many interesting phenomena would be systematically

explained. - Partial-ρ is read by a ruler in the ellipse

The partial correlation r ' comes form the idea of the

1

correlation between X1 and Y but X2 is fixed.

The red ruler

• paral el to the corresponding axis,

• passing through (r ,r ),

1 2

• ful y expanding inside the el ipse,

• graduated linearly ranging ±1,

reads the partial-ρ.

r ' = 0.75 for this case.

1

r ' is also read by changing the ruler direction vertically.

2 - Standardized partial regression coefficients

• a are called the partial regression coefficients.

i

• Assume X ,X ,Y are standardized.

1

2

Make a scalar field inside the el ipse

• 1 on the plus-side boundary of k-th axis,

• 0 on the boundary of the other axis,

• interpolate the assigning values linearly.

Then, a is read by the value at (r ,r ).

k

1 2

Note:

• Extension to higher dimensions are easy.

• Boundary points at each facet is single.

• This pictorialization may be useful to SEM

(Structural Equation Analysis). - The elliptical depiction for the baseball example

This page is added after the symposium

Red : for the multiple-ρ (0.828),

Blue : for the two partial-ρ

Magenta : for the partial regression coefficients.

Each value corresponds to the length ratio of the

bold part to the whole same-colored line section.

X1 : annual total score gained

X2: annual total score lost

Y: zero minus annual ranking

( ρ[Y:X1] , ρ[Y:X2] ) = (0.419,-0.471) is plotted

inside the ellipse slanted with ρ[X1:X2]=0.423.

-> The meaning of numbers becomes clearer. - Summary and findings

of §2 Geometry of regression

• Multiple-ρ is the similarity ratio of two ellipses/ellipsoids.

• Partial-ρ is read by a graduated ruler in the ellipse/ellipsoids.

• Each regression coefficients are given by the schalar field.

So far, the derived numbers from MRA (Multiple Regression Analysis)

have often said to be hard to recognize. But this situation can be

changed. - Summary as a whole

[ Main resutls ]

Using the el ipse or hyper-el ipsoid,

• any correlation matrix is wholly pictorialized.

• multiple regression is translated into geometric quotients.

[ Sub results ]

• ρ seems quite robust against axial deformations unless outliers exist.

• (Spherical trigonometry may give you insights). <- Not referred today.

[ Next steps ]

• treat the parameter/sampling perturbations

• systematize interesting statistical phenomena

• produce new theories further on

• give new twists to other research areas

• make useful applications to the real world cases

• organize a new logic system for this ambiguous world. - Refs

1. 岩波数学辞典

Encyclopedic Dictionary of Mathematics, The Mathematical Society of Japan

2. R, http://www.r-project.org/

3. 共分散構造分析 [事例編]

The author sincerely welcomes any related literature. - Background of this presentation SKIP

1. We make judgements from related things

in daily or social life, but this real world is

noisy and fil ed with exceptions.

e.g. "Does the better posture and mental

concentration cause the better performance?"

2. The real world data causes paradoxes :

o any elaborate regression analysis mostly gives ρ < 0.7, how to deal?

o data accuracy is not important when ρ < 0.7, details shown later.

o why subjective sense works in the real?

3. Geometric interpretations of multiple regression analysis may be useful

o that whol y takes in any correlation matrix

o that is geometric using el ipsoids

to observe, analyze the background phenomena in detail.

4. Then we wil understand weak correlations that dominates our world. - A primitive question SKIP

Question

Why(How) is data analysing important?

My Answer

It gives you inspirations and

updates your recognition to the real world.

Knowing the numbers μ, σ, ρ, ranking, VaR *

from phenomena you have met

is crucially important to make your next action

in either of your daily, social or business life!!

* average, std deviation, correlation coefficient, the rank order, Value at Risk

And so, the interpretation of the numbers is necessary.

(And I provides you that of ρ today!) - Main ideas in more detail SKIP

Using the el ipse or hyper-el ipsoid,

• 2nd order moments are completely imaginable in a picture.

• the numbers from Multiple-Regression are also imaginable.

1. (Pearson's) Correlation Coefficient

• basic of statistics (as you know)

• may change wel when outliers are contained

• however, changes only few against `monotone' map

• depicted as 'correlation el ipse'

2. Multiple Regression Analysis

• (Spherical Surface Interpretation)

• El ipse Interpretation - Main ideas SKIP

1. What is the correlation coefficient after all?

2. Geometric interpretations of Multiple Regression

Analysis. - The mysterious robustness (3) SKIP

front figures: x - original sampling correlation. y - 3-valued then

correlation calculated. back figures: sample of 100. - Summary of `§1. What is ρ?

'REDUNDANT

• A correlation ρ is recognizable as an ellipse.

• A correlation matrix is also recognizable as an ellipsoid.

• ρ seems robust against axial deformations unless outliers exist.

• You can guess `ρ' of a game by the champion. - When partial-ρ is zero. (SKIP)

The condition partial-ρ = 0 ⇔

• The inner angle of the spheric triangle is 90 degrees.

• The two `hyper-planes' cross at 90 degrees at the `hyper-

axis'. The axis corresponds the fixed variables and each of

the planes contains each of the two variables.

• On the el ipse/ellipsoid, the characteristic point is on the

midpoint of the ruler. - Multiple-ρ is the similarity ratio of ellipses

RE

[ F DUNDA

ormulation N

of T

MRA ]

[ Multiple - ρ ]

The multiple-ρ (

≦ 1) is the

similarity ratio of the ellipses.

For arbitrary variables number case, you

calculate: the inverse of the correlation matri (W

x hen X is k-dimentional, the hyper-

・

→ the reciprocal of each of the diagonal

el ipsoid is determined by k×k matrix whose

elements → 1 minus each of them → take elements are ρ [ X : X ], and the inner point

the square root of each → each are the

i

j

multiple-ρ of the corresponding variable fromis

at p-dimensional vector whose elements

the rest variables.

are ρ [ X : Y ] . )

i - Summary and findings

of §2 Geometry of regressionREDUNDANT

• Multiple-ρ is the similarity ratio of two ellipses/ellipsoids.

• Partial-ρ is read by a graduated ruler in the ellipse/ellipsoids.

• Each regression coefficients are given by the scholar field.

• (Spherical trigonometry)

So far, the derived numbers from MRA have often said to be hard

to recognize. But this situation can be changed. - Introduction This page is added after the symposium

This page may need intensive proofreading by the

There is a Japanese word `kaizen', which means improvement.

author.

The problems still existing today are as follows:

The real world is, however, so ambiguous that it often is hard to - The meaning of correlation value is not yet well known.

know whether any kaizen action would make positive effect or not. - The meaning of multiple regression analysis is also not yet

well known(, although when the correlation is weak the reasonable

choice of analysis is multiple analysis or its elaborate

Sometimes your action may cause negative effect or zero effect in derivatives).

an averaged sense even if you believe your action is a good one.

Assume a situation that you can control a variable to make some

effect on the outcome variable (the number of control variables The author found that correlation is very robust against any

would increase in the following).

`axial deformations’ unless variables contain outliers. Rather

sampling correlation coefficient perturbs much more in many

The author's hypothetical proposition is that the correlation

cases when N is less than 1000. The author also found

coefficient indeed plays important role. A reason is that when the geometrical backgrounds of correlations of multiple regression

correlation is positive then your rational action is just increasing analysis (Perhaps R.A.Fisher already knew that, but any person

the value of the control variable. And it seems very reasonable

around me didn’t know that) that is producing many insights.

that you should select a strongly correlated variable to the output

variable.

(The robustness is not well analyzed at this moment (some

pieces of analysis and numerical examples) The

geometrical background is analyzed in basic points so

the author is considering to investigate further for parameter

perturbations.)