このページは http://www.slideshare.net/gill1109/heiser-symposium の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/05/01)にアップロードin学び

Scientific integrity, scientific fraud, questionable research practices, Smeesters affair, Geraer...

Scientific integrity, scientific fraud, questionable research practices, Smeesters affair, Geraerts affair, Förster affair: lecture at Willem Heiser farewell symposium.

- 初學R語言的60分鐘1年以上前 by Chen-Pan Liao
- Booz Allen Field Guide to Data Science 9ヶ月前 by Booz Allen Hamilton
- PDF4ヶ月前 by Dev Admin

- Worst Practices in

Statistical Data Analysis

Best Practices in Statistical Data Analysis

Willem Heiser farewell symposium 30 January 2014

Richard Gill

http://www.math.leidenuniv.nl/~gill - A talk within a talk,

& some reflections on scientific integrity

• Flashback: one year ago in Tilburg

• Meeting of the social science section of the

Dutch statistical society: Statistiek uit de bocht

(Statistics round the bend)

Ideas stolen from Marcel van Assen, Jelte Wicherts,

Frank van Kolfschooten, Han van der Maas, Howard Wainer … - Integrity or fraud …

or just

questionable research practices

?

… or … ?

Richard Gill

Tilburg, December 2012

http://www.math.leidenuniv.nl/~gill - • Smeesters affair

• Geraerts affair

Smeesters: closed

Geraerts: open,

controversial - Smeesters

• August 2011: a friend draws attention of

Uri Simonsohn (Wharton School, Univ.

Penn.) to “The effect of color ... ”

by D. Smeesters and J. Liu - • Simonsohn does preliminary statistical

analysis indicating results are “too good to

be true”

Hint: text mentions a number of within group SD’s; group sizes ≈ 14 - 3×2×2 design,

≈14 subjects per group

• Outcome: # correct answers in 20 item

multiple choice general knowledge

quiz

• Three treatments:

• Colour: red, white, blue

• Stereotype or exemplar

• Intelligent or unintelligent - Unintelligent Intelligent

Exemplar Kate Moss

Albert Einstein

Stereotyp A

A professor

e

supermodel - Priming

• Red makes one see differences

• Blue makes one see similarities

• White is neutral

• Seeing an intelligent person makes you feel intelligent

if you are in a “blue” mood

• Seeing an intelligent person makes you feel dumb

if you are in a “red” mood

• Effect depends on whether you see exemplar or

stereotype - • The theory predicts something very like the

picture (an important three way

interaction!) - • August 2011: a friend draws attention of Uri Simonsohn

(Wharton School, Univ. Penn.) to “The effect of color”

by D. Smeesters and J. Liu.

• Simonsohn does preliminary statistical analysis indicating

“too good to be true”

• September 2011: Simonsohn corresponds with Smeesters,

obtains data, distribution-free analysis confirms earlier

findings

• Simonsohn discovers same anomalies in more papers by

Smeesters, more anomalies

• Smeesters’ hard disk crashes, all original data sets lost.

None of his coauthors have copies.

All original sources (paper documents) lost when moving

office

• Smeesters and Simonsohn report to authorities

• June 2012: Erasmus CWI report published, Smeesters resigns,

denies fraud, admits data-massage “which everyone

does” - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2114

571

What did Simonsohn actually

do?

• Erasmus report is censored, authors refuse to answer questions,

Smeesters and Liu data is unobtainable, identity Simonsohn

unknown

• Some months later: identity Simonsohn revealed, uncensored

version of report published

• November 2012: Uri Simonsohn posts “Just Post it: The Lesson

from Two Cases of Fabricated Data Detected by Statistics

Alone”

Two cases? Smeesters, Sanna; third case,

inconclusive (original data not available)

• December 2012: original data still unavailable, questions to

Erasmus CWI still unanswered

• March 2013: Simonsohn paper published, data posted - • Theory predicts that the 12 experimental

groups can be split into two sets of 6

• Within each set, groups should be quite similar

• Smeesters & Liu report some of the group

averages and some of the group SD’s

• Theory:

variance of group average = within group

variance divided by group size!

• The differences between group averages are

too small compared to the within group

variances! - • Simonsohn proposes ad-hoc(*) test-statistic

(comparing between group to within group

variance), null distribution evaluated using

parametric bootstrap

• When original data is made available, can

repeat with non-parametric bootstrap

• Alternative: permutation tests

• Note: to do this, he pools each set of six

groups. “Assumption” that there is no

difference between the groups within each of

the two sets of six groups is conservative

(*) Fisher: use left tail-probability of F- test for testing too good to be true - A picture tells 1000 words

sigma <- 2.9

pattern <- c(rep(c(1, 0), 3), rep(c(0, 1), 3))

means <- pattern

means[pattern == 1] <- 11.75

means[pattern == 0] <- 9.5

set.seed(2013)

par(mfrow = c(3,4), bty = "n", xaxt = "n", yaxt = "n")

for (i in 1:12) { averages <- rnorm(12, mean = means, sd = sigma/sqrt(14))

dim(averages)<- c(2, 6)

averages <- rbind(averages-6, 0)

plot(c(0, 20), c(0, 7), xlab = "", ylab = "", type = "n")

abline(h = 0:6)

barplot(as.vector(averages), col = rep(c("black", "white", "white"), n = 6),

add = TRUE)

} - Spot the odd one out!
- Spot the odd one out!
- Further analyses
- Further analyses

• Simonsohn’s test-statistic is actually

equivalent to standard ANOVA F-test of

hypothesis “each of two groups of six

conditions have the same mean” – except

that we want to reject if the statistic is too

small - data <- data.frame(score = scores, colour = colour, prime = prime,

dimension=dimension, pattern=pattern.long)

result.aov.full <- aov(score~colour*prime*dimension, data = data)

result.aov.null <- aov(score~(colour+prime+dimension)^2, data = data)

result.anova <- anova(result.aov.null, result.aov.full)

result.anova

result.aov.zero <- aov(score ~ pattern, data=data)

result.anova.zero <- anova(result.aov.zero, result.aov.full)

result.anova.zero$F[2]

pf(result.anova.zero$F[2], df1=10, df2=156)

RDG

Test of the 3-way interaction

> result.anova

Analysis of Variance Table

Model 1: score ~ (colour + prime + dimension)^2

Model 2: score ~ colour * prime * dimension

Res.Df RSS Df Sum of Sq F Pr(>F)

1 159 1350.8

2 157 1299.6 2 51.155 3.0898 0.04829 *

Smeesters and Liu

(OK, except d.f.) - data <- data.frame(score = scores, colour = colour, prime = prime,

dimension=dimension, pattern=pattern.long)

result.aov.full <- aov(score~colour*prime*dimension, data = data)

result.aov.null <- aov(score~(colour+prime+dimension)^2, data = data)

result.anova <- anova(result.aov.null, result.aov.full)

result.anova

result.aov.zero <- aov(score ~ pattern, data=data)

result.anova.zero <- anova(result.aov.zero, result.aov.full)

result.anova.zero$F[2]

pf(result.anova.zero$F[2], df1=10, df2=156)

Test of too good to be true

> result.anova.zero$F[2]

[1] 0.0941672

> pf(result.anova.zero$F[2], df1=10, df2=156)

[1] 0.0001445605 - Further analyses

• Scores (integers) appear too uniform

• Permutation test: p-value = 0.00002 - Geraerts

Paper 1:

“Memory” - Paper 2: “JAb

(submitted)” - Geraerts

• Senior author Merckelbach becomes

suspicious of data reported in papers 1 and 2

• He can’t find “Maastricht data” among

Geraerts combined “Maastricht + Harvard”

data set for paper 2 (JAb: Journal of Abnormal

Psychology) - > tapply(ProbeTotalNeg, group, mean)

14.666667 6.600000 6.233333 6.133333

> tapply(TotalNeg,group,mean)

21.76667 21.70000 21.73333 22.43333 Too

good to > tapply(ProbeTotalNeg, group, sd)

2.564120 3.864962 3.287210 3.598212

> tapply(TotalNeg,group,sd)

2.896887 4.094993 5.930246 6.770541 b

e true?

> tapply(SelfTotalNeg, group, mean)

7.1 15.1 15.5 16.3

> tapply(SelfTotalNeg, group, sd)

2.324532 3.457625 4.462487 4.587464

(JAb) - Curiouser and curiouser:

Self-rep + Probe-rep (Spontaneous) = idem

(Others)

Self-rep (Spontaneous) = Probe-rep (Others)

Samples matched (on sex, age education), analysis

does not reflect design

(JAb) - Geraerts

• Merckelbach reports Geraerts to Maastricht and

to Rotterdam authorities

• Conclusion: (Maastricht) some carelessness but

no fraud; (Rotterdam) no responsibility

• Merckelbach and McNally request editors of

“Memory” to retract their names from joint paper

• The journalists love it (NRC; van Kolfschooten ...) - Summary statistics

(Memory paper) - Picture is “too good

> resultsto be t

[,1] [,2]

> sum(-log(results))

[1,] 0.13599556 0.37733885

[1] 12.95321

[2,] 0.01409201 0.25327297

rue”

> pgamma(sum(-log(results)),

[3,] 0.15298798 0.08453114

6, lower.tail = FALSE)

[1] 0.01106587

• Parametric analysis of Memory tables confirms, esp. on

combining results from 3×2 analyses (Fisher comb.)

• For the JAb paper I received the data from van Kolfschooten

• Parametric analysis gives same result again (4×2)

• Distribution-free (permutation) analysis confirms!

(though: permutation p-value only 0.01 vs

normality+independence 0.0002)

> results

[,1] [,2]

[1,] 0.013627082 0.30996011

[2,] 0.083930301 0.24361439

[3,] 0.004041421 0.05290153

[4,] 0.057129222 0.31695753

> pgamma(sum(-log(results)), 8, lower.tail=FALSE)

[1] 0.0002238678 - The morals of the story (1)

• Scientific = Reproducible: Data preparation and

data analysis are integral parts of experiment

• Keeping proper log-books of all steps of data

preparation, manipulation, selection/exclusion

of cases, makes the experiment reproducible

• Sharing statistical analyses over several authors

is almost necessary in order to prevent errors

• These cases couldn’t have occurred if all this

had been standard practice - The morals of the story (2)

• Data collection protocol should be written down in

advance in detail and followed carefully

• Exploratory analyses, pilot studies … also science

• Replicating others’ experiments: also science

• It's easy to make mistakes doing statistical analyses:

the statistician needs a co-pilot

• Senior co-authors co-responsible for good scientific

practices of young scientists in their group

• These cases couldn’t have occurred if all this had

been standard practice - Memory affair

postscript

• Geraerts is forbidden to talk to Gill

• Erasmus University Psychology Institute asks

Han van der Maas (UvA) to investigate “too

good to be true” pattern in “Memory” paper

• Nonparametric analysis confirms my findings

• Recommendations: 1) the paper is retracted o;

2) report is made public o; 3) data-set

P

~

idem o

http://www.erasmusmagazine.nl/nieuws/detail/article/6265-geraerts-trekt-memory-artikel-terug/

Obtaining the data “for peer review”: send request to secretariaatpsychologie@fsw.eur.nl - Main findings

• No proof of fraud (fraud = intentional deception)

• Definite evidence of errors in data management

• Un-documented, unreproducible reduction

from 42 + 39 + 47 + 33 subjects to 30 + 30 +

30 + 30

Together, mega-opportunities for Questionable Research

Practice number 7: deciding whether or not to exclude

data after looking at the impact of doing so on the results

(Estimated prevalence near 100%, estimated acceptability rating near

100%) - Remarks

• A balanced design looks more scientific but is

an open invitation to QRP 7

• Identical “too good to be true” pattern is

apparent in an earlier published paper; the

data has been lost

E. Geraerts, H. Merckelbach, M. Jelicic, E. Smeets (2006),

Long term consequences of suppression of intrusive anxious thoughts and repressive

coping,

Behaviour Research and Therapy 44, 1451–1460 - The latest developments

• I finally got the data from Geraerts

(extraordinary confidentiality agreement)

• But you can read it off the pdf pictures in the

report!

• So let’s take a look… - The latest developments

• I cannot identify Maastricht subset in this data

• The JAb paper does not exhibit any of these

anomalies!

• All data of the third paper with same “too

good to be true” is lost

• A number of psychologists also saw “too good

to be true” w.r.t. content (not statistics) - The next case?

• Spot the difference between the following two slides - Exhibit A
- Exhibit B
- Notes

• Exhibit A is based on several studies in one paper

• Exhibit B is based on similar studies published in several

papers of other authors

• More papers of author of Exhibit A exhibit same pattern

• His/her graphical displays are bar charts, ordered

Treatment 1 (High) : Treatment 2 (Low) : Control

• Displays here based on the ordering

Low : Medium (control) : High - Remarks

• IMHO, “scientific anomalies” should in the first place be discussed

openly in the scientific community; not investigated by disciplinary

bodies (CWI, LOWI, …)

• Data should be published and shared openly, as much as possible

• Never forget Hanlon’s razor (“do not attribute to malice that which

can be adequately explained by incompetence”; or in plain

English: “cockup before conspiracy”)

• I never met a science journalist who knew the “1 over root n” law

• I never met a psychologist who knew the “1 over root n” law

• Corollary: statisticians must participate in social debate,

must work with the media - Exercises

• How/why could faked data exhibit this latest

“too good to be true” pattern?

• Develop a “model-free” statistical test for “excess

linearity”

[problem: no replicate groups, so no permutation test]

• How could Geraerts’ data ever get the way it is?

In three different papers, different anomalies,

same overall “too good to be true” pattern?

• What do you consider could be the moral of this story?