Understanding and Applying Effect Size in Procedural Learning

Effect sizes are crucial statistical outcomes in empirical studies, offering insights into the magnitude of observed effects beyond mere statistical significance. This article delves into the concept of effect sizes, particularly within the context of procedural learning, exploring their calculation, interpretation, and significance in educational research.

The Importance of Effect Sizes

Researchers are often reminded to report effect sizes because they are useful for three reasons. First, they allow you to present the magnitude of the reported effects, which in turn allows you to reflect on the practical significance of the effects, in addition to the statistical significance. Second, effect sizes allow researchers to draw meta-analytic conclusions by comparing standardized effect sizes across studies. A measure of effect size is a quantitative description of the strength of a phenomenon. It is expressed as a number on a scale.

While a p-value indicates whether an effect exists or if observed variations are due to chance, an effect size quantifies the magnitude of that effect. This makes effect size estimates a vital complement to p-values in most studies.

Unstandardized Effect Sizes

For unstandardized effect sizes, the effect size is expressed on the scale that the measure was collected on. This is useful whenever people are able to intuitively interpret differences on a measurement scale. For example, children grow on average 6 centimeters a year between the age of 2 and puberty. We can interpret 6 centimeters a year as an effect size, and many people in the world have an intuitive understanding of how large 6 cm is.

Standardized Effect Sizes: Bridging the Gap

To facilitate a comparison of effect sizes across situations where different measurement scales are used, researchers can report standardized effect sizes. A standardized effect size, such as Cohen’s d, is computed by dividing the difference on the raw scale by the standard deviation, and is thus scaled in terms of the variability of the sample from which it was taken. An effect of d = 0.5 means that the difference is the size of half a standard deviation of the measure. This means that standardized effect sizes are determined both by the magnitude of the observed phenomenon and the size of the standard deviation. As standardized effect sizes are a ratio of the mean difference divided by the standard deviation, different standardized effect sizes can indicate the mean difference is not identical, or the standard deviations are not identical, or both.

Read also: Your Guide to Nursing Internships

Standardized effect sizes are common when variables are not measured on a scale that people are familiar with, or are measured on different scales within the same research area. If you ask people how happy they are, an answer of ‘5’ will mean something very different if you ask them to answer on a scale from 1 to 5 versus a scale from 1 to 9. Standardized effect sizes can be understood and compared regardless of the scale that was used to measure the dependent variable.

Families of Standardized Effect Sizes

Standardized effect sizes can be grouped in two families (Rosnow & Rosenthal (2009)): The d family (consisting of standardized mean differences) and the r family (consisting of measures of strength of association). Conceptually, the d family effect sizes are based on the difference between observations, divided by the standard deviation of these observations, while the r family effect sizes describe the proportion of variance that is explained by group membership. For example, a correlation ((r)) of 0.5 indicates that 25% of the variance ((r^2)) in the outcome variable is explained by the difference between groups.

Cohen's d: A Closer Look

Cohen’s d (the d is always italicized) is used to describe the standardized mean difference of an effect. This value can be used to compare effects across studies, even when the dependent variables are measured with different scales, for example when one study uses 7-point scales to measure the dependent variable, while the other study uses a 9-point scale. We can even compare effect sizes across completely different measures of the same construct, for example when one study uses a self-report measure, and another study uses a physiological measure.

Cohen’s d ranges from minus infinity to infinity (although in practice, the mean difference in the positive or negative direction that can be observed will never be infinite), with the value of 0 indicating that there is no effect. Cohen (1988) uses subscripts to distinguish different versions of d, a practice I will follow because it prevents confusion (without any specification, the term ‘Cohen’s d’ denotes the entire family of effect sizes). Cohen refers to the standardized mean difference between two groups of independent observations for the sample as (d_s).

Calculating Cohen's d

(d_s) can be calculated using the following formula:

(ds = \frac{{\overline{M}}{1}{- \overline{M}}{2}}{\text{SD}{\text{pooled}}})

Where:

({\overline{M}}{1}{- \overline{M}}{2}) is the difference between the means.
(\text{SD}_{\text{pooled}}) is the pooled standard deviation (Lakens, 2013), and n1 and n2 are the sample sizes of the two groups that are being compared.

Cohen's d and the t-value

The t-value is used to determine whether the difference between two groups in a t-test is statistically significant (as explained in the chapter on p-values. As you can see, the sample size in each group ((n1) and (n2)) is part of the formula for a t-value, but it is not part of the formula for Cohen’s d (the pooled standard deviation is computed by weighing the standard deviation in each group by the sample size, but it cancels out if groups are of equal size). This distinction is useful to know, because it tells us that the t-value (and consequently, the p-value) is a function of the sample size, but Cohen’s d is independent of the sample size. If there is a true effect (i.e., a non-zero effect size in the population) the t-value for a null hypothesis test against an effect of zero will on average become larger (and the p-value will become smaller) as the sample size increases. The effect size, however, will not increase or decrease, but will become more accurate, as the standard error decreases as the sample size increases.

Practical Significance vs. Statistical Significance

What is the most important outcome of an empirical study? You might be tempted to say it’s the p-value of the statistical test, given that it is almost always reported in articles, and determines whether we call something ‘significant’ or not. One reason to report effect sizes is to facilitate future research. It is possible to perform a meta-analysis or a power analysis based on unstandardized effect sizes and their standard deviation, but it is easier to work with standardized effect sizes, especially when there is variation in the measures that researchers use. But the main goal of reporting effect sizes is to reflect on the question whether the observed effect size is meaningful. For example, we might be able to reliably measure that, on average, 19-year-olds will grow 1 centimeter in the next year. This difference would be statistically significant in a large enough sample, but if you go shopping for clothes when you are 19 years old, it is not something you need care about.

The Importance of Context: An Example

In Figure 6.1 we see a graphical representation of the proportion of favorable parole decisions that real-life judges are making as a function of the number of cases they process across the day in Figure 6.1. The study from which this plot is taken is mentioned in many popular science books as an example of a finding that shows that people do not always make rational decisions, but that “judicial rulings can be swayed by extraneous variables that should have no bearing on legal decisions” (Danziger et al., 2011). We see that early on in the day, judges start by giving about 65% of people parole, which basically means, “All right, you can go back into society.” But then very quickly, the number of favorable decisions decreases to basically zero. After a quick break which, as the authors say, “may replenish mental resources by providing rest, improving mood, or by increasing glucose levels in the body” the parole decisions are back up at 65%, and then again quickly drop down to basically zero. If we calculate the effect size for the drop after a break, and before the next break (Glöckner, 2016), the effect represents a Cohen’s d of approximately 2, which is incredibly large. There are hardly any effects in psychology this large, let alone effects of mood or rest on decision making. And this surprisingly large effect occurs not just once, but three times over the course of the day. If mental depletion actually has such a huge real-life impact, society would basically fall into complete chaos just before lunch break every day. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion. Just like manufacturers take size differences between men and women into account when producing items such as golf clubs or watches, we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal.

We can look at a meta-meta-analysis (a paper that meta-analyzes a large number of meta-analyses in the literature) by Richard, Bond, & Stokes-Zoota (2003) to see which effect sizes in law psychology are close to a Cohen’s d of 2. They report two meta-analyzed effects that are slightly smaller. The first is the effect that a jury’s final verdict is likely to be the verdict a majority initially favored, which 13 studies show has an effect size of r = .63, or d = 1.62. The second is that when a jury is initially split on a verdict, its final verdict is likely to be lenient, which 13 studies show to have an effect size of r = .63 as well. In their entire database, some effect sizes that come close to d = 2 are the finding that personality traits are stable over time (r = .66, d = 1.76), people who deviate from a group are rejected from that group (r = .6, d = 1.5), or that leaders have charisma (r = .62, d = 1.58). You might notice the almost tautological nature of these effects. We see how examining the size of an effect can lead us to identify findings that cannot be caused by their proposed mechanisms. The effect reported in the hungry judges study must therefore be due to a confound. Indeed, such confounds have been identified, as it turns out the ordering of the cases is not random, and it is likely the cases that deserve parole are handled first, and the cases that do not deserve parole are handled later (Chatziathanasiou, 2022; Weinshall-Margel & Shapard, 2011).

Identifying Implausible Effect Sizes

An additional use of effect sizes is to identify effect sizes that are too large to be plausible. Conceptually, the d family effect sizes are based on a comparison between the difference between the observations, divided by the standard deviation of these observations. This means that a Cohen’s d = 1 means the standardized difference between two groups equals one standard deviation.

Effect Sizes in Educational Settings

In the last week you will probably have been in a staff development session where the presenter - be it a senior leader, school research lead or consultant - will have made some reference to effect size. Indeed, there is a very strong chance - particularly if the speaker is an advocate of work of Professor John Hattie and Visible Learning - that they will make reference to a 0.4 SD effect size as the average expected effect size for one year of progress in school (Hattie, 2015). In other words, over the course of an academic you should expect your pupils to make at least 0.4 SD of progress.

Unfortunately, although there is some appeal in having some form of simple numerical measure to represent a year’s worth of progress, it is not quite that simple and is potentially highly misleading. (Wiliam, 2016) states that when working out the standardised effect size in an experiment, this is quite simple the difference between the mean of the experimental group and the mean of the control group, divided by standard deviation of the population. However, as the standard deviation for the achievement of older pupils tends to be greater than for younger pupils, this means that with all other things being equal you would expect small standardised effect sizes for experiments involving older pupils than for experiments with younger pupils.

Wiliam then goes onto cite the work of (Bloom, Hill, Black, and Lipsey, 2008) which looked at the annual progress made by pupils. Using a number of standardised assessments, Bloom et al looked at the differences in scores achieved by pupils from one year to the next, and then divided this by the pooled standard deviations - which allowed to calculate the effect size for a year’s worth of teaching. They found that for six-year olds a year’s worth of growth is approximately 1.5 standard deviations, whereas for twelve year olds a year’s worth of growth was 0.2 standard deviations. As such, although average growth for school pupils may be approximately, 0.4 standard deviations, this average is largely meaningless and has little or no value.

Elsewhere, in the Sutton Trust and EEF’s Teaching and Learning Toolkit manual (EEF, 2018) they use the assumption that 1 year’s worth of progress is equivalent to one standard deviation. However, the EEF recognise that the notion of one standard deviation representing one year’s progress does not hold for all ages. For example, data from National Curriculum tests indicates annual progress of about 0.8 of a standard deviation at age 7, falling to 0.7 at 11 and 0.4 at age 14.

In another study, (Luyten, Merrell and Tymms, 2017) looked at the impact of schooling on 3500 pupils in Year 1 - 6 in 20 - predominately private sector - English primary schools. They found that the year to year gains of schooling declined as pupils got older. For example, for the youngest pupils the effect size for progress in Reading was 1.15 standard deviations, whereas for the oldest pupils the effect size for year to year progress was 0.49 standard deviations. This declining trend in effect size was also seen for measures of Mental Maths and Developed Ability, although General Maths deviates from this pattern, with effect sizes in general being consistent from year to year.

Implications for Educators

First, it’s important to remember that this analysis has focussed on year to year progress made by groups pupils. It has not looked at the impact of specific interventions or group effect sizes. As the 0.4 SD effect size for one year’s progress, should not be confused or conflated with the 0.4 SD effect size put forward by Professor Hattie as the average effect size for factors influencing education.

Second, if the presenter of your staff development session is not aware of the issues raised by this post - you may want to very professionally point them in the direction of the references listed at the end of this post. This is not about embarrassing senior colleagues or guests by saying that they are wrong, but more it’s about saying that the claims they are making are not uncontested.

Third, given how effect size vary with both age and the diversity of population, it suggests that any attempts by teachers to make judgments about their effectiveness of teaching by calculating effect sizes may be seriously flawed. For primary school teachers there is a risk that they will overestimate their effectiveness, whereas for secondary school teachers the opposite is true. Indeed, given some critical issues with effect sizes - see (Simpson, 2017) and (Simpson, 2018) it’s probably wise for individual teachers to steer clear of their calculation.

Finally, this blog raises all sorts of issues about when to trust the experts (Willingham, 2012). In this blog you have an edublogger challenging the claims made by world renowned educational researcher. It may be that I have misunderstood the claims by Professor Hattie. It maybe that I have misunderstood the arguments of Dylan Wiliam, Hans Luyten, Steve Higgins the EFF and others. However, what it does suggest is that it is maybe unwise to rely upon a single expert, instead, particularly in education it’s worth making sure your evidence-informed practice is influenced by a range of experts.

tags: #effect #size #in #procedural #learning