Children’s understanding of experimental contrast and
experimental control: an inventory for primary school

Christopher Osterhausa, Susanne Koerbera, Beate Sodianb

aFreiburg University of Education, Department of Psychology, Germany

bLudwig-Maximilians-University Munich, Department of Psychology, Germany

 

Article received 18 October / revised 29 November  / accepted 4 December / available online 19 January

Abstract

Experimentation skills are a central component of scientific thinking, and many studies have investigated whether and when primary-school children develop adequate experimentation strategies. However, the answers to these questions vary substantially depending on the type of task that is used: while discovery tasks, which require children to engage in unguided experimentation, typically do not reveal systematic skills in primary school, choice tasks suggest an early use of adequate experimentation strategies. To acquire a more accurate description of primary-school experimentation, this article proposes a novel multiple-select paper-and-pencil inventory that measures children’s understanding of experimental design. The two reported studies investigated the psychometric properties of this instrument and addressed the development of primary-school experimentation. Study 1 assessed the validity of the item format by comparing 2 items and an interview measure in a sample of 71 third- and fourth-graders (9- and 10‑year-olds), while Study 2 investigated the reliability and the convergent validity of the inventory by administering it to 411 second-, third- and fourth-graders (8-, 9- and 10‑year-olds) and by comparing children’s performance in the 11-item scale to 2 conventional experimentation tasks. The obtained results demonstrate the reliability and validity of the inventory and suggest that a solid understanding of experimental design first emerges at the end of primary school.

Keywords: experimentation skills; control-of-variable strategy (CVS); scientific thinking; primary (elementary) school; paper-and-pencil assessment

osterhaus@ph-freiburg.dehttp://dx.doi.org/10.14786/flr.v3i4.220


1.      Introduction

Experimentation skills constitute a fundamental component of scientific thinking, and many developmental research studies have investigated children’s acquisition of adequate experimentation strategies (e.g., Case, 1974; Inhelder & Piaget, 1958; Kuhn & Phelps, 1982; Siegler & Liebert, 1975; for a review see Zimmerman, 2007). Two central research questions have been whether and when children begin to master the so-called control-of-variables strategy (CVS). This strategy, which is also referred to as the “vary-one-thing-at-a-time” strategy (Tschirgi, 1980), requires informative experiments to contrast a single (focal) variable while keeping all other (non-focal) variables constant. Most studies of experimentation skills probe children’s use of CVS in two different task settings: (1) discovery tasks with unrestricted variable configurations (production of CVS) and (2) choice tasks with restricted response options (choice of CVS).

Discovery tasks typically require children to explore the causal relations between different candidate causes and an outcome over a set of experimentation trials (e.g., Kuhn et al., 1995; Schauble, 1990). In each of these trials, mature reasoners form hypotheses about the system of independent variables (i.e., which variables are causal and which are non-causal), and based on these, they use CVS to isolate a single variable, which they contrast in an experiment that controls all non‑focal variables (controlled-contrastive experiment). By updating their hypotheses and repeating this procedure over multiple trials, mature reasoners arrive at a final theory of the causal standing of each variable. Discovery tasks hence involve reasoners in multiple phases of constructing scientific knowledge (i.e., reiterative formation of hypotheses, experimentation and data interpretation), and therefore they offer a high ecological validity and are particularly well suited for microgenetic studies that investigate strategy change in detail by repeatedly using the same task with a high density of observations in the short period of time when change is assumed to occur.

Choice tasks, in contrast, typically present children with a specific and directed hypothesis regarding one of the candidate causes. In addition, they include restricted answer options that represent distinct variable configurations from which children are allowed to choose (multiple choice [MC]; e.g., Croker & Buchanan, 2011; Koerber, Mayer, Osterhaus, Schwippert, & Sodian, 2015; Mayer, Sodian, Koerber, & Schwippert, 2014; Tschirgi, 1980). Mature reasoners understand that they need to compare conditions, and their command of CVS is demonstrated by selecting the answer option in which only the focal variable is varied. This requirement of a single choice makes choice tasks easier to administer and hence well suited for large-scale, paper-and-pencil-based assessments, which are important research tools for investigating the large interindividual differences that already exist in primary-school experimentation (Bullock, Sodian, & Koerber, 2009).

Despite their common measurement focus on CVS and their same basic task requirement (i.e., understanding that conditions need to be compared and an appropriate comparison needs to be made), discovery and choice tasks reveal a substantially different picture of primary-school children’s experimentation skills. While discovery tasks tend to reveal that only a small number of students produces experiments that are designed in accordance with CVS (e.g., 17% initially used CVS in a sample of fifth- and sixth-graders; Schauble, 1990), choice tasks reveal that most primary-school children prefer controlled-contrastive experiments over confounded experiments when they are allowed to choose between different experimental designs (e.g., 54% correct CVS choices by fourth-graders in Koerber, Mayer, et al., 2015; around 60% in Bullock & Ziegler, 1999). Although differences between production and choice tasks (i.e., between open-answer and closed-response items) are a common empirical finding for knowledge scales (e.g., Nehm & Schonfeld, 2008), the large discrepancies between the two types of CVS tasks suggest that differences in performance are not solely attributable to the increased probability of correct guessing in choice tasks.

Performance differences between discovery and choice tasks might be attributable to their specific and discrepant task demands and solution strategies (see Table 1). While discovery tasks require the repeated generation of hypotheses, which is an ability that may develop later than do experimentation skills (Piekny & Maehler, 2013) and which requires extensive memory and processing capacities, CVS choice tasks might not assess children’s understanding of experimentation because they can be solved by lower-level heuristics, such as varying a single variable while keeping the others constant, which does not need to be motivated by children’s full understanding of experimentation. Other evidence, such as children’s justifications, is therefore required before concluding whether children fully understand the rationale for CVS.

 

Table 1

Student performance in and characteristics of discovery tasks, choice tasks and understanding of experimental design (UNEX).

 

 

Discovery tasks

Choice tasks

 

UNEX

Description

iterative experimenta-tion and use of CVS to uncover causal effects

single choice of a (controlled-contrastive) experimental comparison

 

identification of design errors in (controlled-) contrastive experiments

Task format

open;
often computerized

MC;
sometimes other formats that do not favour guessing (e.g., Bullock & Ziegler, 1999)

 

MS

Use of CVS

17% initial use in at ages 11 and 12 (Schauble, 1990)

8–68%
(Croker & Buchanan, 2011), 54% (Koerber, Mayer, et al., 2015), and 60% (Bullock & Ziegler, 1999) all at age 10

 

29% at age 10
(cf. Study 2)

Validity

moderate

guessing is not a problem

low

guessing is a problem
(especially for MC items)

 

high

guessing is not a problem*

 

task requires non-essential skills: tracking of variables (memory and processing skills), hypothesis generation

may be solved by lower-level heuristics

 

understanding features of experimental contrast and experimental control required

Reliability

high

performance estimates stable across problems

low

performance estimates differ across tasks and contexts

 

high

performance estimates stable across problems*

Assessment

time-consuming

rapid; allows for large-scale assessment

 

rapid; allows for large-scale assessment







Note. *Characteristics investigated in the two studies presented in this article; CVS = control-of-variables strategy;  MC = multiple choice; MS = multiple select.

A further criticism of both discovery and choice tasks is that studies of scientific thinking—with their narrow focus on CVS—have not explored the broader context of children’s understanding of the experimental method. In addition to CVS, a basic feature of experimentation is that planned comparisons are necessary to test for differences between conditions, as are randomization and an understanding of local control, which is necessary to reduce variation due to extraneous factors and which goes beyond the control of non-focal variables.

Bullock et al. (2009) developed an interview that investigates children’s metaconceptual understanding of experimental design (UNEX) in participants aged 12–22 years. This instrument asks children to review a set of fictitious experiments that contain diverse design errors that violate the principles of local control or experimental contrast. Children need to recognize that ill-designed controlled-contrastive experiments do not provide an adequate test of hypothesis because variation due to extraneous factors is not reduced or non-focal variables are not controlled for (violation of CVS); they also must understand that hypothesis testing is impossible in ill-designed contrastive experiments because a single observation is made (either due to the missing variation of the focal variable, or the lack of an initial measurement in a pre-post design) and the principle of experimental contrast therefore is ignored. This latter experiment type provides no information about the control of non-focal variables, and the ‘violated’ design feature is simply the principle of experimental contrast, which should be easier to understand than ‘experimental control’.

Primary-school children’s UNEX has not been investigated previously. However, Bullock and her colleagues found that sixth-graders (the youngest age group interviewed with such an instrument) solved around 50% of tasks correctly, by identifying the experimental design error and justifying their opinion about whether the experiment was appropriately designed. It therefore appears that even very young children can show UNEX if they are questioned in an age-appropriate format that offers contextual support, such as providing a graphic representation of fictitious experiments and offering answer alternatives in a closed-response format.

The present studies investigated whether a paper-and-pencil version of UNEX can provide valid and reliable measurements in primary-school children, in order to determine whether early abilities can be identified in this age group. Our inventory of primary-school UNEX uses 11 closed-response, multiple-select (MS) items (examples are provided in Figure 1, and Appendix 1 provides the full item set). For each item the children have to answer whether or not they consider a fictitious experiment to be well designed, and whether or not they agree with each of three separate justifications concerning the good or bad quality of the experiment (which includes one that identifies the design error in question).

More conventional MC items provide a single choice, whereas the three separate decisions required by our MS procedure have two important advantages: (1) the probability of correct guessing is substantially reduced (i.e., 12.5% instead of 33% in the case of three statements), and (2) the MS format investigates potential inconsistencies in children’s understanding of experimentation. MC items only make it possible to conclude that children consider their chosen answer to be superior to the non-selected alternatives, whereas MS items give additional information about their view on all response options (i.e., although children may recognize the design error, they may still hold naïve beliefs regarding the production of effects).

The correct and incorrect answer options used in the present studies are based on children’s answers to open items in prestudies and they draw on a conceptual-development model of scientific thinking (Koerber, Mayer, et al., 2015). Specifically, each item includes naïve, intermediate and advanced-level answers. Naïve-level answers reveal no understanding of hypothesis testing, instead referring to the production of an effect (cf. answer option 3 in Figure 1), intermediate-level answers demonstrate a first understanding of the necessity of hypothesis testing and the existence of relevant design features (e.g., testing and sample size are important; cf. answer option 2), and advanced-level answers identify the specific design error and recognize that it restricts the information that can be drawn from the experiment (cf. answer option 1).

Figure 1. Sample item. (see pdf)

While research has shown that children’s (dis-)agreement with conceptually different levels on MS items matches the levels found in an interview conducted after exposure to the MS answer options (MS item then open interview [I-after]; cf. Koerber, Osterhaus, & Sodian, 2015), little is known about the relation between children’s MS choices and the beliefs they hold before being presented with the MS item (open interview then MS item [I-before]). An instrument’s reliability and validity depends on a close relation between initial beliefs and the levels identified in the MS item.

The present work therefore first investigated (in Study 1) whether the MS item format is valid, resulting in ascriptions of levels that are significantly related with those levels found in I-before, and then (in Study 2) addressed whether the inventory results in a reliable scale with satisfactory content and convergent validities. Because primary-school children’s UNEX has not been investigated previously, Study 2 also addressed the abilities and development of primary-school children.

 

2.      Study 1

Study 1 investigated the validity of the MS item format by comparing children’s performance in two interview measures, which were conducted before (I-before) and after (I-after) the presentation of the closed-response answer options. Study 1 also investigated the relation between the novel MS and conventional MC formats, as well as the relation between MC and the interview, both before and after the presentation of the MS task (i.e., I-before and I-after), to obtain a better understanding of the relative performances of diverse testing formats.

2.1       Methods

2.1.1    Participants

The 71 included primary-school children comprised 56 third-graders (mean age 8 years, 8 months, SD=5 months; 31 girls) and 15 fourth-graders (mean age 10 years, 1 month; SD=4 months; 7 girls). Children were recruited from three predominantly middle-class schools in Germany. Parental informed consent was obtained for all children.

2.1.2.   Materials

The children were presented with and interviewed about (see Procedure) two items from our inventory: one contrastive and one controlled-contrastive experiment. Both experiments were set up in an artificial context in order to reduce interferences of children’s content knowledge on design evaluations.

Item 1 (contrastive experiment) presented the children with a story about a scientist who wants to test the hypothesis that yellow fertilizer increases plant growth significantly more than blue fertilizer. The scientist administers yellow fertilizer to 100 plants. He observes that all plants got bigger blossoms and therefore concludes that the yellow fertilizer works better than the blue one. Children were asked whether or not this was a good experiment, and evaluated three explanations for their opinion (MS) on three hypothesized levels: (1) an explanation based on the production of effects (naïve level; “It was a good experiment because all plants that received the yellow fertilizer became bigger”), (2) an explanation that contained a design feature that was not the crucial to the experiment’s validity (intermediate level; “It was a good experiment because he tested the yellow fertilizer on many plants”) and (c) the correct explanation that identified the design error in question (advanced level; “It was not a good experiment because he only tested the yellow fertilizer”). After evaluating each of the three levels (MS), the children also indicated which of the three explanations they considered the best (MC).

Item 2 (controlled-contrastive experiment) presented the children with a story about two grandmothers who use lake or river water to water their plants. While those plants watered with lake water grow well, plants that are watered with river water are withering. Children had to evaluate whether the experimental data was sufficient to support the hypothesis that lake water causes plants to grow well, or whether more information would be needed (i.e., control of non-focal variables). Analogously to Item 1, the three explanations reflected the distinct levels; however, while the naïve level as was the case for Item 1 referred to the production of effects and the advanced level identified the design error in question, the intermediate level for this item included a reference to a potential causal mechanism rather than to a non-crucial design feature (i.e., “Lake water contains more minerals than lake water. Therefore, lake water is better for plants”).

2.1.3    Procedure

Both items were read out loud to the children in a one-on-one interview. Children marked their answers in their own booklets. Interviews were conducted at two points during the presentation of the item: (1) after the children’s initial design evaluation and before presenting the MS answer options (I-before; “Why was the experiment good/bad?”), and (2) after presenting the answer options and the children choosing the best one (I-after; “Why did you consider this answer to be the best one?”). Both instances included follow-up questions such as “Why did this [the reason children named] make the experiment a good/bad one?” or “Would you have done anything differently?”

 

Table 2

Coding of MS and interviews

 

 

Levels children agreed to

 

Naïve

Intermediate

Advanced

Coding of MS (final level)

 

 

 

Naïve

x

---

---

Naïve

x

x

---

Naïve

x

x

x

Naïve

x

---

x

Naïve

---

---

---

Intermediate

---

x

---

Intermediate

---

x

x

Advanced

---

---

x

Coding of interviews

Sample answers

 

Naïve

“because the fertilizer made the plants look more beautiful” (production of effect)

 

Intermediate

“because he [the scientist] tried it on so many plants”
(reference to non
-relevant design feature)

“because lake water is often dirty, whereas not much rubbish is thrown into river water” (reference to mechanism)

 

Advanced

“because he [the scientist] did not try both fertilizers”
(correct identification of design error)

2.1.4    Transcription and coding of children’s answers

Interviews were audiotaped, transcribed verbatim and coded by two independent raters (see Table 2 for coding examples and the MS coding). Using a strict criterion, the lowest level which children agreed to was taken as the final level in the coding of the MS item (e.g., if a child accepted the naïve and intermediate levels simultaneously, the MS item was coded as naïve). The interrater kappa reliability values for I-before and I-after were .94 and .88, respectively, for Item 1, and .73 and .90 for Item 2. Since some children gave invalid answers (especially on I-before; e.g., they refused to answer or only gave answers that were irrelevant to the question), some of the subsequent analyses involved a smaller sample.

2.2.      Results and discussion

2.2.1    Core performance

Core performance data for Items 1 and 2 (Table 3) and a Wilcoxon signed-rank test of I-before revealed that, as expected, more children recognized the design error in the contrastive (Mdn=1) than the controlled-contrastive (Mdn=0) experiment, Z=2.92, p<.01, r=.30.

2.2.2    Comparison of the different formats

To investigate whether meaningful relations between the different formats in the two items existed (i.e., whether there was a high agreement in the levels assigned), we assessed Spearman correlations (rho) and, in addition, computed Wilcoxon signed-ranks statistics (Z) when non-significant correlations suggested low convergence, which may have been a result of over- or underestimation in one of the formats.

For Item 1 (Table 4), correlations between formats were significant for all comparisons, except for I-before and MC, where the correlation did not reach significance. However, as a Wilcoxon signed-ranks test revealed, there was no significant difference in difficulty of MC or I-before, and none of the formats led to a systematic over- or underestimation. Rather, 24% of all children performed better on MC than on I-before while 27% were assigned a higher level in I-before than in MC (cf. Table 4).

For Item 2 (Table 5), only the correlation between I-before and the MS task was significant. A Wilcoxon signed-ranks test revealed that the MC question (Mdn=1) significantly overestimated performance with respect to I-before. Similarly, MS was significantly more difficult than MC, just as I-before was more difficult than I-after, suggesting that presenting answer options significantly decreased the difficulty of identifying the design error in this second item with a more complex experimental design.

 

Table 3

Core performance data for Items 1 and 2 (contrastive and controlled-contrastive experiment)

 

Item / format

Naïve level

Intermediate level

Advanced level

Total

Item 1

 

 

 

 

I-before

15 (33)

11 (24)

19 (42)

45 (100)

MS

62 (87)

1 (1)

8 (11)

71 (100)

MC

23 (32)

14 (20)

34 (48)

71 (100)

I-after

18 (26)

29 (41)

23 (33)

70 (100)

Item 2

 

 

 

 

I-before

25 (51)

16 (33)

8 (16)

49 (100)

MS

53 (75)

13 (18)

5 (7)

71 (100)

MC

10 (14)

26 (37)

35 (49)

71 (100)

I-after

21 (30)

26 (37)

23 (33)

70 (100)

Notes. Data are n (%) values. I-before = open interview then MS item; I-after = MS item then open interview.


Table 4

Convergence between different item formats for Item 1 (contrastive experiment)

 

 

Same level (a=b)

 

 

Overestimation (b>a)

 

 

Underestimation (b<a)

 

 

 

 

Test pair (a/b)

Naïve-naïve

Int.-int.

Adv.-adv.

Total

 

Naïve-int.

Naïve-adv.

Int.-adv.

Total

 

Adv.-int.

Adv.-naïve

Int.-naïve

Total

n

rho

Z

I-before & I-after

4 (9)

5 (11)

14 (31)

23 (51)

 

7 (16)

4 (9)

6 (13)

17 (38)

 

1 (2)

4 (9)

0 (0)

5 (11)

45 (100)

.43**

---

I-before & MC

5 (11)

3 (7)

14 (31)

22 (49)

 

4 (9)

6 (13)

1 (2)

11 (24)

 

0 (0)

5 (11)

7 (16)

12 (27)

45 (100)

.24

-.08

I-before & MS

15 (33)

0 (0)

5 (11)

20 (44)

 

0 (0)

0 (0)

0 (0)

0 (0)

 

0 (0)

14 (31)

11 (24)

25 (55)

45 (100)

.38*

---

MC & MS

23 (32)

1 (1)

8 (11)

32 (44)

 

0 (0)

0 (0)

0 (0)

0 (0)

 

0 (0)

26 (36)

13 (18)

39 (54)

71 (100)

.31**

---

Notes. Data are n (%) values. Int. = intermediate; adv. = advanced; = Wilcoxon signed-rank test. Over- and underestimation of the second comparison (b) relative to the first (a). Rho and Z are computed for children with complete observations on both measures for each test pair (i.e., children who did not give a valid answer on a measure that was part of the respective test pair were excluded from the respective analysis).
* p<.05. ** p<.01.

 

Table 5

Convergence between different item formats for Item 2 (controlled-contrastive experiment)

 

 

Same level (a=b)

 

 

Overestimation (b>a)

 

 

Underestimation (b<a)

 

 

 

 

Test pair (a/b)

Naïve-naïve

Int.-int.

Adv.-adv.

Total

 

Naïve-int.

Naïve-adv.

Int.-adv.

Total

 

Adv.-int.

Adv.-naïve

Int.-naïve

Total

n

rho

Z

I-before & I-after

7 (14)

6 (12)

6 (12)

19 (38)

    10 (20)

8 (16)

5 (10)

23 (46)

 

2 (4)

0 (0)

5 (10)

7 (14)

49 (100)

.23

3.29**1

I-before & MC

4 (8)

7 (14)

6 (12)

17 (34)

 

9 (18)

12 (25)

7 (14)

28 (57)

 

2 (4)

0 (0)

2 (4)

4 (8)

49 (100)

.15

4.31***2

I-before & MS

22 (45)

1 (2)

2 (4)

25 (51)

 

2 (4)

1 (2)

2 (4)

5 (10)

 

4 (8)

2 (4)

13 (27)

19 (39)

49 (100)

.64***

---

MC & MS

9 (13)

5 (7)

5 (7)

19 (27)

 

1 (1)

0 (0)

0 (0)

1 (1)

 

7 (10)

23 (32)

21 (30)

51 (72)

71 (100)

.24

6.34***3

Notes. Data are n (%) values.
** p<.01. *** p<.001 . 1 r=.29. 2 r=.39. 3 r=.53

Overall, MC items reliably identified advanced conceptions from I-before (14 out of 19 for Item 1, 6 out of 8 for Item 2) while MS identified naïve conceptions particularly well (15 out of 15 for Item 1, 22 out of 25 for Item 2). However, with respect to I-before, the MC item produced several false-positive advanced conceptions (i.e., many children were assigned the advanced level even if they still held naïve beliefs), while the MS pattern revealed a low sensitivity for intermediate- and advanced-level answers from I-before. Although this pattern may indicate a validity problem of MS, our interview data suggest that the low performance for MS is attributable to the difficulty children experience in overcoming naïve beliefs. For example, several children gave a correct answer for I-before, correctly identifying the design error in question (e.g., “It was not a good experiment because he didn’t try the blue fertilizer”), but they still agreed with the naïve answer when MS answer options were presented (e.g., “It was good experiment because the scientist found that all plants got bigger blossoms”). Although this indicates that these children understand that the missing experimental contrast means that the hypothesis cannot be tested, they do not yet understand that hypothesis testing is the only purpose of experimentation and still uphold naïve beliefs regarding the production of effects.

Together these results indicate that the presentation of answer options (in MC) may lead to an overestimation of performance when the item difficulty is high (controlled-contrastive experiment) and children are allowed to choose the best answer. However, making children indicate whether or not they agree with each answer option (in MS) reveals potentially conflicting views in their thinking, and means that their competence is only confirmed once they have formed coherent advanced conceptions. Therefore, MS is superior to MC when investigating profound—as opposed to beginning—competencies.

 

3.      Study 2

Study 2 investigated the reliability and the content and convergent validities of our 11-item inventory. The content validity was investigated by testing whether contrastive experiments (contrastive UNEX) are, as hypothesized, easier than controlled-contrastive experiments (controlled-contrastive UNEX), while convergent validity was assessed by comparing UNEX to the performance in two conventional CVS choice tasks. In addition, Study 2 assessed interindividual differences in the development of UNEX in primary school.

3.1       Methods

3.1.1    Participants

The 411 included primary-school children comprised 128 second-graders (mean age 8 years, 3 months; SD=7 months; 66 girls), 137 third-graders (mean age 9 years, 4 months; SD=7 months; 62 girls) and 146 fourth-graders (mean age 10 years, 2 months; SD=9 months; 78 girls). Children were recruited from 25 classrooms in predominantly middle-class schools in Germany. Parental informed consent was obtained for all children. Study 2 was part of a larger study of the cognitive development of primary-school children.

3.1.2    Materials

UNEX. The 11 UNEX items were designed analogously to the sample items presented in Study 1, and comprised 6 contrastive and 5 controlled-contrastive experiments (Appendix 1 provides the full item set). The contrastive experiments each included two experiments involving the manipulation of a single value of the focal variable, a missing control group and a missing reference to baseline. The controlled-contrastive experiments did not control non-focal variables.

CVS choice tasks. We adopted two tasks from the literature (Bullock & Ziegler, 1994; Chen & Klahr, 1999) that both presented children with three variables (each with two possible values) and a hypothesis regarding one of them. Our adaptation of Bullock and Ziegler’s cars task involved the children testing the hypothesis that the speed of a car depends on the size of its tires (wide vs. narrow; non-focal variables: car and spoiler types). Our adaptation of Chen and Klahr’s slopes task involved the children testing the hypothesis that the steepness of a slope influences the distance a marble travels after rolling down it (steep vs. flat; non-focal variables: size and position of the marble). All 8 (=23) variable combinations in each item were presented graphically, and the children were asked to select the 2 cars or slopes that they wanted to compare.

3.1.3    Procedure

All items were illustrated and presented in booklets that the children worked on individually in a whole-class testing procedure (taking approximately 30 minutes). Tasks and (verbal) answer options were read out by an experimenter and displayed in a PowerPoint presentation. Test assistants ensured that the children worked on their own booklets only, and assisted children who had any questions regarding the procedure.

3.2       Results and discussion

3.2.1    Core performance

Core performance data for all items (see Table 6) indicate a high frequency of naïve MS answers (all >60%), and lower frequencies of intermediate and advanced conceptions (all <20% and <25%, respectively).

3.2.2    Scale analysis

The reliability of the scale was determined by fitting a partial credit model (Masters, 1982) to the children’s responses. This model assumes that developmental progression is unidirectional and that categories (i.e., naïve, intermediate and advanced) are hierarchical, reflecting the assumption of our conceptual-development model. All but one item (Item U5, a contrastive experiment) had a good fit to the model (i.e., 0.85> infit mean-square statistic [MNSQ] <1.15; see Table 6). Removing this poorly fitting item yielded a scale with an expected a posteriori estimate based on a plausible values (EAP/PV) reliability of .82 (weighted-likelihood estimator person separation reliability=.55, Cronbach’s α=.85; scale mean=5.35, SD=5.85, minimum=0, maximum=22). All items satisfied the partial credit model’s requirements of increasing point-biserial correlations and ability estimates per category. However, none of our items fulfilled a third criterion (ordered delta parameters). It is debated in the literature as to whether this is a mandatory requirement for model fit to hold (Adams, Wu, & Wilson, 2012), but the unordered delta parameters are consistent with children’s low frequency of choosing the intermediate level. We therefore dichotomized the data by collapsing the naïve and intermediate levels. A Rasch model fitted to the resulting binary data revealed a good fit for all items (0.85< infit MNSQ <1.15) and an EAP/PV reliability of .71 (weighted-likelihood estimator person separation reliability=.62, Cronbach’s α=.72). All subsequent analyses are based on these binary data.

 

Table 6
Percentage of answers on the naïve, intermediate and advanced level for the 11 UNEX items of Study 2 overall, and the item difficulty, discrimination and item fit.


 

 

Level

 

 

 

Item

Experiment type

Naïve
(0)

Inter. (1)

Adv. (2)

Difficulty

Discri-mination

Infit MNSQ

U1

Contrastive

62.4

19.9

17.7

–.14

.57

1.10

U2

Contrastive

67.8

9.3

22.9

–.32

.66

1.03

U3

Contrastive

74.2

7.6

18.2

.20

.69

0.92

U4

Contrastive

73.2

4.4

22.4

–.14

.66

1.03

U5

Contrastive

65.2

10.4

24.3

–.48

.80

0.74#

U6

Contrastive

73.2

5.6

21.2

.01

.45

0.96

U7

Controlled-contrastive 

73.7

15.1

11.2

.56

.43

1.04

U8

Controlled-contrastive 

72.7

4.6

22.7

–.15

.63

1.05

U9

Controlled-contrastive 

72.3

11.5

16.2

.69

.49

0.90

U10

Controlled-contrastive 

68.1

12.9

19.9

–.04

.69

0.91

U11

Controlled-contrastive 

74.0

5.6

20.44

–.06

.72

0.89











 















    Notes.
Difficulty, discrimination and infit mean-square statistic (MNSQ) are based on an analysis of the partial credit model; large negative item difficulties indicate easy items, while large positive values indicate difficult ones; infit MNSQ should be between .85 and 1.15 for an item to show a good fit to the partial credit model; # indicates an item with a poor fit according to infit MNSQ. Item numbering applies to Study 2 and the full item set given in the Appendix.

3.2.3    Developmental patterns and interindividual differences

A univariate analysis of variance revealed a significant main effect in UNEX for grade, F(2,402)=21.75, p<.001, partial η2=.10. Interestingly, while fourth-graders recognized significantly more design errors than did third-graders, F(1,402)=24.14, p<.001, partial η2=.06 (see Figure 2), there was no difference between grades 2 and 3, F(1,402)=1.89, p>.05. There were also no differences between boys and girls, F(1,402)=3.18, p>.05.

3.2.4    Content validity

An explanatory item response model (De Boeck & Wilson, 2004) with a fixed person effect (age) revealed that controlled-contrastive experiments were, as hypothesized, more difficult than contrastive experiments (see Table 7 for model fit, Table 8 for parameter estimates).

Figure 2. Percentage of correct answers for contrastive and controlled-contrastive experimental designs per grade. Error bars indicate 95% confidence intervals. (see pdf)


Table 7

Model comparisons for the explanatory item response model

 

Model

Effects
(fixed)

Effects (random)

AIC

BIC

-2LL

df

LR test

M0 (1PL)

---

Intercept

3493.3

3506.1

3489.3



M1

Experiment type

Intercept

3483.3

3502.6

3477.4

2

11.87***

M2

Experiment type + age

Intercept

3465.4

3491.0

3457.4

1

20.01***

 Notes. M0 = reference model (1-parameter logistic model); AIC = Akaike information criterion; BIC = Bayesian information criterion; -2LL = deviance; df = degrees of freedom; LR test = Likelihood ratio test.
*** p<.001.

 

Table 8

Coefficients for M2

Model

B

SE

p

Intercept

–7.20

1.14

< .001

Age

.54

.12

< .001

Controlled-contrastive (reference: contrastive)

–.33

.01

< .001



3.2.5    Convergent validity

One-third (130) of the children chose CVS for both the cars and slopes tasks; however, only 55 children (14%) chose CVS consistently across the 2 items. The cars task was performed correctly by 20%, 33% and 44% of second-, third- and fourth-graders, respectively; the corresponding rates for the slopes task were 18%, 35% and 46%. The overall performance was thus slightly worse than that reported by Bullock and Ziegler (1999), where approx. 40% and 60% of third- and fourth-graders, respectively, correctly solved a CVS choice task. These differences are probably due to differences in sociodemographic characteristics between the previous urban sample and our more rural sample (cf. Koerber, Mayer, et al., 2015).

Interestingly, whereas the change-all strategy (vary all—including non-focal—variables) was the third most frequently chosen strategy for the cars task (20%), it was the least popular strategy for the slopes task (4%). In contrast to the cars task, where low performance resulted mostly from children disregarding the necessity of experimental control, errors in the slopes task were primarily due to children choosing an incorrect focal variable (i.e., size or position of the marble instead of slope steepness). While both these errors reflect unsuccessful coordination of hypothesis and evidence (cf. Kuhn, 2011), they suggest that the content domain of the specific task and children’s knowledge thereof influence the expression of this interference between children’s hypotheses and their construction of evidence.

A binomial regression revealed that performance in the cars task was significantly predicted by controlled-contrastive UNEX, χ2(1)=4.67, p=.03. Specifically, children with a more profound controlled-contrastive UNEX were more likely to use CVS than any other strategy, β=.13, t(1)=4.62, p=.03, odds ratio=1.13. However, contrastive-UNEX experiments did not predict strategy choice in the cars task, β=.02, t(1)=.11, p=.74. In contrast, the use of CVS in the slopes task was predicted by contrastive UNEX, χ2(1)=8.16, p=.004. Specifically, children with a high contrastive UNEX showed an increased use of CVS, β=.13, t(1)=8.02, p=.005, odds ratio=1.14, while there was no positive effect of controlled-contrastive UNEX, β=–.08, t(1)=2.03, p=.15. This finding is consistent with the descriptive data for the slopes task, which suggest that few children choose an experiment in which all non-focal variables are varied. These results might be due to children only contrasting non-focal variables when their individual influences are unknown to them and they want to find out about them in a single test (cf. Schauble, 1990).

While 68% and 56% of the children with mastery in the controlled-contrastive UNEX (≥four items correctly solved) solved the cars and slopes tasks correctly, 70% and 68% of the incompetent children (<four items solved correctly) did not apply CVS to these tasks, χ2(1)=16.90 and χ2(1)=6.43, respectively; both p<.05. Conversely, around 30% of the children with no mastery in controlled-contrastive UNEX solved the cars or slopes task, suggesting that choice tasks lead to an overestimation of performance, even when the probability of correct guessing is small due to there being a large number of answer alternatives (28 in Study 2).

To summarize, these results support the reliability and the content and convergent validities of our inventory. In addition, the data show that children’s UNEX is rudimentary at the beginning of primary school, but that it increases to a percentage of correct answers of around 30% by the end of primary school. This performance level is lower than that in choice tasks, which focus on children’s beginning understanding of experimentation (e.g., 54% used CVS in an MC task by Koerber, Mayer, et al., 2015), but higher than that in discovery tasks (e.g., 17% initial use in Schauble, 1990).

 

4.      General discussion

The findings of the two studies presented in this article suggest that our inventory of primary-school UNEX is a reliable measure of both content and convergent validities, and that MS is a well-suited and valid item format that yields a reliable performance estimate comparable to that obtained when using an interview measure. Our results further suggest that developmental progressions in UNEX take place in the late primary-school years, where a solid UNEX first emerges.

4.1       Validity of the item format

In contrast to conventional choice tasks in CVS, our inventory does not employ the more common MC item format, instead relying on MS, which requires children to independently agree or disagree with a set of statements (qualitatively different levels in the present studies). Study 1 revealed that this item format yields a reliable estimate of children’s abilities that is comparable to an open-interview measure that is presented before exposure to the answer options (I-before). While Koerber, Osterhaus, and Sodian (2015) demonstrated a significant relation between children’s MS choices in scientific thinking tasks and their performance in an interview measure that is conducted after exposure to the answer alternatives (I-after), the present studies are the first to validate the MS format against I-before. The significant relation between MS and I-before is an important result because our data indicate that revealing the answer alternatives of an item to children may lead to an overestimation of the performance for I-after, especially when the item content is difficult. As with the results of Koerber, Osterhaus, and Sodian (2015), Study 1 additionally corroborates the finding that the MS format is especially well suited for obtaining a reliable estimate of consolidated rather than beginning abilities.

4.2       Reliability and validity of the inventory

While CVS choice tasks are often strongly influenced by non-essential task characteristics, resulting in large performance discrepancies between items (Croker & Buchanan, 2011; see also the finding for the cars and slopes tasks in Study 2), the solving rates did not vary substantially between different items in our inventory, and a Rasch analysis indicated that the scale had a good reliability. This is an important finding since a solid and thoroughly tested instrument is needed for large-scale studies of the interindividual differences in scientific thinking and experimentation skills that already exist in primary school (Bullock et al., 2009). While inventories of primary-school children’s general scientific thinking have recently been proposed and tested (Koerber, Mayer, et al., 2015), our inventory of primary-school UNEX is the first instrument that exclusively focuses on experimentation skills in early primary school; for example, Hammann, Phan, Ehmer, and Grimm (2008) provided an instrument that was only suited to grade 5 and above.

In addition to the reliability of our inventory, Study 2 revealed its content and convergent validities. The content validity was supported by the finding that contrastive experiments were—as predicted by our hypothesis—easier than controlled-contrastive experiments, which in addition to understanding experimental contrast requires an understanding of experimental control. The convergent validity was supported by our finding that UNEX predicted the performance in the cars and slopes tasks. However, while choosing CVS for the cars task was significantly related to controlled-contrastive UNEX, for the slopes task this choice was predicted by contrastive UNEX. This result shows a problem of conventional CVS choice tasks, and suggests that contrastive UNEX suffices for correctly solving tasks that include non-focal variables whose effects children hold strong beliefs about (e.g., how the size of a marble and its position on the slope influences the distance it travels). According to this interpretation, children only vary non-focal variables when their influences are unknown to them (e.g., influences of the car and spoiler types on car speed) and they want to uncover all individual effects in a single experimental test (cf. Schauble, 1990). Therefore, controlled-contrastive UNEX seems to be especially important when children perform experiments that involve non-focal variables whose effects are unknown to them.

4.3       Development of UNEX in primary school

Our theoretical, conceptual-development model, on which item construction was based, suggests that UNEX develops from naïve to intermediate to more advanced levels of understanding. However, while the partial credit model generally fitted the data, the descriptive statistics in Study 2 suggested that a small percentage of primary-school children perform at an intermediate level. This finding of an overall naïve classification is mostly due to many children selecting the naïve level even when they agreed with the intermediate level (cf. also Study 1). Therefore, although an important step towards a more mature UNEX is understanding that hypothesis testing is necessary and that there are features of experimental design that influence the quality of an experiment, this realization does not appear to necessarily overcome naïve beliefs.

A beginning understanding of UNEX first emerges late in primary school, where we found a correct performance rate of about 30%. This estimate, which lies between discovery (e.g., 17% initially using CVS in grades 5 and 6; Schauble, 1990) and choice (e.g., 54% in grade 4; Koerber, Mayer, et al., 2015) tasks, differs significantly from performance in early primary school (grades 2 and 3), where performance is still low, typically at 10–15%. Because experimentation is not explicitly taught in German primary schools, it is important to identify the mechanisms underlying this substantial development in children’s UNEX during primary school. Potential mechanisms include children’s metacognitive development (Kuhn, 2000; Lockl & Schneider, 2002) and increases in their executive control, which may allow children to inhibit an experiential processing of the experimental designs in favour of analytical processing (cf. Amsel et al., 2008; Klaczynski, 2000).

Future research needs to address these issues and reveal how development in UNEX can be understood and promoted in primary school. Our development of a well-tested instrument that is psychometrically sound and can be used in large-scale studies is an important prerequisite for these studies, and it should promote future work on this important component of scientific thinking.

                                                          

Keypoints

*       A novel 11-item inventory of primary-school children’s understanding of experimental design yields a valid and reliable estimate of children’s experimentation skills.

*       The multiple-select item format serves as a strict criterion, revealing profound—rather than beginning—competencies.

*       A coherent advanced understanding of experimentation emerges in primary school.

Acknowledgments

We are grateful to all research assistants for their help in the data collection, as well as to all teachers, children, and parents for their friendly collaboration and support of this research.

References

Adams, R. J., Wu, M. L., & Wilson, M. (2012). The Rasch rating model and the disordered threshold controversy. Educational and Psychological Measurement, 72, 547573. doi:10.1177/00131644114321

Amsel, E., Klaczynski, P. A., Johnston, A., Bench, S., Close, J., Sadler, E., & Walker, R. (2008). A dual-process account of the development of scientific reasoning: The nature and development of metacognitive intercession skills. Cognitive Development, 23, 452–471. doi:10.1016/j.cogdev.2008.09.002

Bullock, M., Sodian, B., & Koerber, S. (2009). Doing experiments and understanding science: Development of scientific reasoning from childhood to adulthood. In W. Schneider & M. Bullock (Eds.), Human development from early childhood to early adulthood. Findings from the Munich Longitudinal Study (pp. 173–197). Mahwah, NJ: Erlbaum.

Bullock, M., & Ziegler, A. (1999). Scientific reasoning: Developmental and individual differences. In F. E. Weinert & W. Schneider (Eds.), Individual development from 3 to 12. Findings from the Munich Longitudinal Study (pp. 3854). Cambridge, UK: Cambridge University Press.

Case, R. (1974). Structures and strictures: Some functional limitations on the course of cognitive growth. Cognitive Psychology, 6, 544573. doi:10.1016/0010-0285(74)90025-5

Chen, Z., & Klahr, D. (1999). All other things being equal: Acquisition and transfer of the control of variables strategy. Child Development, 70, 1098–1120. doi:10.1111/1467-8624.00081

Croker, S., & Buchanan, H. (2011). Scientific reasoning in a realworld context: The effect of prior belief and outcome on children’s hypothesistesting strategies. British Journal of Developmental Psychology, 29, 409–424. doi:10.1348/026151010X496906

De Boeck, P., & Wilson, M. (2004). A framework for item response models. New York, NY: Springer.

Hammann, M., Phan, T. T. H., Ehmer, M., & Grimm, T. (2008). Assessing pupils’ skills in experimentation. Journal of Biological Education, 42, 66–72. doi:10.1080/00219266.2008.9656113

Inhelder, B., & Piaget, J. (1958). The growth of logical thinking from childhood to adolescence. New York, NY: Basic Books.

Klaczynski, P. A. (2000). Motivated scientific reasoning biases, epistemological beliefs, and theory polarization: A two-process approach to adolescent cognition. Child Development, 71, 1347–1366. doi:10.1111/1467-8624.00232

Koerber, S., Mayer, D., Osterhaus, C., Schwippert, K., & Sodian, B. (2015). The development of scientific thinking in elementary school: A comprehensive inventory. Child Development, 86, 327336. doi:10.1111/cdev.12298

Koerber, S., Osterhaus, C., & Sodian, B. (2015). Testing primary-school children’s understanding of the nature of science. British Journal of Developmental Psychology, 33, 57–72. doi:10.1111/bjdp.12067

Kuhn, D. (2000). Metacognitive development. Current Directions in Psychological Science, 9, 178–181. doi:10.1111/1467-8721.00088

Kuhn, D. (2011). What is scientific thinking and how does it develop? In U. Goswami (Ed.), The Wiley-Blackwell handbook of childhood cognitive development (2nd ed., pp. 472523). Oxford, UK: Wiley.

Kuhn, D., Garcia-Mila, M., Zohar, A., Andersen, C., White, S. H., Klahr, D., & Carver, S. M. (1995). Strategies of knowledge acquisition. Monographs of the Society for Research in Child Development, 60(4), 1128.

Kuhn, D., & Phelps, E. (1982). The development of problem-solving strategies. In H. Reese (Ed.), Advances in child development and behavior (Vol. 17,  pp. 2–44). New, NY: Academic Press. doi: 10.1016/S0065-2407(08)60356-0

Lockl, K., & Schneider, W. (2002). Developmental trends in children’s feeling-of-knowing judgements. International Journal of Behavioral Development, 26, 327–333. doi:10.1080/01650250143000210

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174. doi: 10.1007/BF02296272

Mayer, D., Sodian, B., Koerber, S., & Schwippert, K. (2014). Scientific reasoning in elementary school children: Assessment and relations with cognitive abilities. Learning & Instruction, 29, 4355. doi:10.1016/j.learninstruc.2013.07.005

Nehm, R. H., & Schonfeld, I. R. (2008). Measuring knowledge of natural selection: A comparison of the CINS, an open-response instrument, and an oral interview. Journal of Research in Science Teaching, 45, 1131–1160. doi:10.1002/tea.20251

Piekny, J., & Maehler, C. (2013). Scientific reasoning in early and middle childhood: The development of domaingeneral evidence evaluation, experimentation, and hypothesis generation skills. British Journal of Developmental Psychology, 31, 153179. doi:10.1111/j.2044-835X.2012.02082.x

Schauble, L. (1990). Belief revision in children: The role of prior knowledge and strategies for generating evidence. Journal of Experimental Child Psychology, 49, 3157. doi:10.1016/0022-0965(90)90048-D

Siegler, R. S., & Liebert, R. M. (1975). Acquisition of formal scientific reasoning by 10- and 13-year-olds: Designing a factorial experiment. Developmental Psychology, 11, 401402. doi:10.1037/h0076579

Tschirgi, J. E. (1980). Sensible reasoning: A hypothesis about hypotheses. Child Development, 51, 1–10. doi:10.2307/1129583

Zimmerman, C. (2007). The development of scientific thinking skills in elementary and middle school. Developmental Review, 27, 172–223. doi:10.1016/j.dr.2006.12.001


Appendix 1

Full item set (see pdf)