^{1}

^{2}

^{*}

^{3}

^{1}

^{2}

^{3}

Edited by: Karin Binder, University of Regensburg, Germany

Reviewed by: Niki Pfeifer, University of Regensburg, Germany; Catarina Dutilh Novaes, University of Groningen, Netherlands

This article was submitted to Educational Psychology, a section of the journal Frontiers in Education

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We report on a study on syllogistic reasoning conceived with the idea that subjects' performance in experiments is highly dependent on the communicative situations in which the particular task is framed. From this perspective, we describe the results of Experiment 1 comparing the performance of undergraduate students in 5 different tasks. This between-subjects comparison inspires a within-subject intervention design (Experiment 2). The variations introduced on traditional experimental tasks and settings include two main dimensions. The first one focuses on reshaping the context (the pragmatics of the communication situations faced) along the dimension of cooperative vs. adversarial attitudes. The second one consists of rendering explicit the construction/representation of counterexamples, a crucial aspect in the definition of deduction (in the classical semantic sense). We obtain evidence on the possibility of a significant switch in students' performance and the strategies they follow. Syllogistic reasoning is seen here as a controlled microcosm informative enough to provide insights and we suggest strategies for wider contexts of reasoning, argumentation and proof.

The acquisition of reasoning proficiency according to logical standards is a central topic in regard to the development of critical thinking competencies. It inheres also in the development of mathematical argumentation and proof. Even so, the experimental evidence from the psychology of reasoning and from mathematics education has widely documented well-rooted difficulties concerning the reasoning skills of students (and humans, in general). In this context, syllogistic reasoning is a paradigmatic case which can provide pre-eminent insights for several reasons: first, the study of the topic accumulates more than a 100 years of experimental study [starting with (Störring,

Given the relevance of the subject to the issues alluded to before, and the experimental evidence so far, two natural questions emerge: what can explain the fact that typical performance in the usual syllogistic tasks does not adhere to Classical Logic? Are there other situations or experimental settings which can elicit reasoning closer to this logical standard?

Following the path of Vargas et al. (submitted), the proposal of the present paper is to show how, even if we have chosen a tiny fragment of full first-order classical logic, in regard to syllogisms we can already see important changes in reasoning tied to the use of representations and the pragmatic situations from which particular reasoning mechanisms emerge. We report on two subsequent experiments. In the first one we compare how undergraduate subjects perform on 5 different tasks intended to understand how different thinking strategies are followed by subjects depending on the communicative situations. The second experiment, more educationally oriented, is based on the insights provided by the first experiment. It studies the trajectories of students in a sequence of tests and short interventions. These are intended to lead them to a shift in their performance based on their understanding of the kind of reasoning that they are expected to attain normatively.

In what follows, we first elaborate on the theoretical background just outlined focusing on the two fundamental aspects which support the design: the importance of communicative settings for reasoning, and the use of the construction of counterexamples in argumentation and proof (Sections Plurality of Goals in Communication and Reasoning and Construction of Counterexamples: Modeling and Countermodeling as Tools for Syllogistic Reasoning). We describe and report then on Experiments 1 and 2 (Sections Experiment 1: Recognizing the Diversity of Communication Contexts and Goals and Experiment 2: Integrating the Tasks as a Didactic Sequence). In the final discussion (Section Results) we develop connections with the psychology of reasoning and the implications for education, particularly in regard to the literature on argumentation and proof in mathematics education.

Experimental study of Aristotelian syllogisms has led to a very neat conclusion: answers of untrained subjects in the customary tasks are very far from being correct from the point of view of the intended, classical interpretation. There have been different kinds of explanations in the psychological literature that try to give an account of experimental data [see (Khemlani and Johnson-Laird, ^{1}

The previous discussion connects at different points with research in education. On one side, the general view that human cognition, and mathematical cognition, in particular, happens as a social and communicational phenomenon, challenges more traditional, internalistic views on thinking and learning which ignore, both experimentally and educationally, their essential character. Inspired largely by Vigotsky, this view has been stressed repeatedly in mathematics education research, since its “social turn” (Lerman,

On the other side, an important point of connection relates to the literatures on argumentation and proof. Our view on the context/communicational dependency of reasoning is in line in fact with integrating the “social dimension” of proof (Balacheff, ^{2}

As noted before, in syllogistic reasoning the usual instructions do not prompt answers according to what “logically follows.” As with many other reasoning tasks, simple rephrasing or emphasis in the instructions do not lead to substantial changes in performance. This way, asking for “necessary conclusions” or deductions “valid in general” does not usually lead to substantial improvement or disambiguation. We propose a change in the contextualization of the materials which may encourage an integration of what precisely “logically” or “necessary” means in this practice. Even if experimental evidence seems to indicate that we are not “naturally” capable of syllogistic reasoning in general, we are more inclined to see here what may be expressed using the competence vs. performance dichotomy, but with more than one competence possible. Actual low performance may be caused because performance deviates from the competence that is normatively established. But performance has to be measured against the right competence, and performance aimed at other norms may be successfully elicited by appropriate contexts which invoke their ecological source (Simon,

What do we logically expect when asking if a conclusion “logically follows” from some premises? Even if in some traditions still influential in education “logical” is conceived from this definition of a deductive system or a set of syntactic transformations (or inference rules), we believe that, in the context of ‘naive' untrained reasoners, a more accessible approach is semantic. In classical logic (Tarski, ^{3}

The problem of the exploration of possible counterexamples and their generation may not be finite or even decidable in general. This may be overcome in the particular case of syllogistic reasoning where we deal only with a vocabulary of three monadic predicates. In this case models are sets of a certain number of individuals, with interpretations for the predicates.^{4}^{5}

Despite this crucial role that the construction of counterexamples may play in regard to the analysis of syllogistic thinking, the topic has been almost completely absent from experimental testing in the psychology literature. One exception is Bucciarelli and Johnson-Laird (

Besides psychological experiments and theories, the use of examples (and counterexamples) in the learning and teaching of mathematics has been widely acknowledged (as well as in mathematicians' practices). The mathematics education literature has recently addressed the role of examples and counterexamples [see, e.g., (Watson and Mason,

“Examples can therefore usefully be seen as cultural mediating tools between learners and mathematical concepts, theorems, and techniques. They are a major means for ‘making contact' with abstract ideas and a major means of mathematical communication, whether ‘with oneself', or with others. Examples can also provide context, while the variation in examples can help learners distinguish essential from incidental features and, if well-selected, the range over which that variation is permitted.” (Goldenberg and Mason,

A change in disposition already occurs when we deal with the exploration of examples and counterexamples. This is reflected in our Experiment 1 results. Nevertheless, grasping the sense of counterexamples and adjusting the relevant conventions in the semiotic representation used in each particular situation is not something automatic or easy. This is the case in mathematical contexts, in general, but we will face the same obstacles in our study. Our Experiment 2 addresses these difficulties proposing strategies on how they can be dealt with.

The aim in Experiment 1 was to explore the effects that countermodeling in an adversarial setting produces in syllogistic reasoning. This is done comparing performance across 5 tasks described below. Most of the studies of syllogistic reasoning present pairs of premises and ask the subject for a conclusion of syllogistic form from a menu including the option “none of the above” either explicitly from a menu presented in each trial, or from instructions at the beginning about the constraints on the form of conclusions (the generation paradigm). In some cases experiments propose, besides the pair of premises, a conclusion whose validity, given the premises, is to be judged (the evaluation paradigm). We use both approaches in our tasks.

Each subject answered a booklet in just one of the conditions described next. Subjects were assigned conditions in a random order. So, these five conditions are essentially separate experiments with random subject sampling from the same population. They had 60 min to do this even if in practice many of the participants finished before, predominantly around 45 min. The booklets had 16 problems for all tasks, aside from the evaluation task which was substantially less demanding. For this task, participants had to answer the whole set of 32 problems studied. The order of presentation of the problems was also random, with three different such orders for each set of problems. The tasks studied are the following (for the exact phrasing of the instructions see the

Conventional (CV): The draw-a-conclusion task usually considered in the literature [see e.g., (Johnson-Laird and Steedman,

Evaluation task (EV): This task has been also extensively present in the literature (see, for example, Rips,

Countermodels Adversarial (CMA): This is essentially the “Syllogistic Dispute” task in Vargas et al. (submitted) which proposes the construction of countermodels in a betting situation against Harry-the-snake. Participants are presented a pair of premises and a proposed conclusion. They have to bet whether this conclusion is valid or not. They are thus in competition with Harry, the nefarious character who proposes the bets and who is trying to empty their wallets. We apply a small variation to the countermodel construction: 2-element countermodels were requested. Syllogism AI3 will serve as an example. Suppose the following premises are given:

Harry proposes the following bet:

Besides having to judge whether this follows or not, participants must provide the counterexample in this last case by ticking or crossing each course if the student is taking it or not:

Linguistics

Arabic

Geometry

Linguistics

Arabic

Geometry

Countermodels Adversarial 2 (CMA2): With the same structure of CMA (a proposed conclusion from two given premises, and the construction of counterexamples when possible) but in this case with another story/context. Instead of a betting situation and Harry-the-snake, participants are asked to play the role of a professor who must correct the answers (conclusions) offered by students as valid inferences in an exam. If the conclusion does not follow (i.e., if the exam script that they are correcting presents a mistake) participants must provide a counterexample as a didactic tool for their imaginary pupil in order to explain why it does not follow. This is a familiar, technically adversarial, situation: an examination.

Communication-conclusions task (COMM-C) This task is proposed with the idea that what participants actually do in CV is to play a cooperative game which the task is an attempt to mimic. Here subjects are introduced into a game: each participant has an imaginary team-mate who wants to communicate to her an assigned statement. Following the syllogistic structure (with b the middle term and a and c the other ones) this statement is about terms a and c. This communication cannot be done directly: the team-mate can only express something about a and b, and something about b and c. The participant is presented with two statements (which play the role of “premises”) which “come from her teammate.” The task is to decide which sentence is it most likely that the team-mate is trying to communicate from a menu of nine possibilities (a possibility for: “no favorite guess” is included). It is emphasized that this is a cooperative task in the precise sense that the subject should think of him or herself as working in a team with the source of the premises. The team-mate is trying to communicate a sentence, and our participant is trying to guess it. Both of them are scored as teams (pairs) according to how often they succeed in their mutual goal. The instructions assert that “If you can guess what sentence he has in mind from the pair of premises (s)he gives you, then your team win five points. If you guess wrong, then you both lose 1 point. There is also the option: ‘

It is important to emphasize, for comparison purposes, that, in their structure, CV and COMM-C tasks follow the generation paradigm, whereas EV, CMA, and CMA2 tasks follow the evaluation paradigm.

Tasks CMA and CMA2 require, besides an evaluation of validity, the construction of counterexamples. For the reason explained in Section Construction of Counterexamples: Modeling and Countermodeling as Tools for Syllogistic Reasoning (with two elements it is always possible to construct a counterexample, if one exists), we standardized the required countermodels to 2-element ones.

As indicated above, beyond purely historical interest, syllogisms constitute a microcosm complex enough to reveal wide variation in typical performance from subjects. So, it is a topic revealing a wide spectrum at the level of misalignment from normative expectations. Studying the whole set of 512 possible pairs of premises and proposed conclusions was not feasible in the time. We limited ourselves to a subset of 32 of these possibilities, presenting 16 to each of our participants. The selection of these problems was heavily biased toward the ones which could reveal the use or absence of classically valid reasoning, and therefore, those which turn out to be solved by other strategies. This is revealed by traditional performance in the CV task, already well-documented in the literature. Our choice was therefore focused on those problems which turn out to be “difficult” in the CV task. A prominent phenomenon in this task is a clear incapacity for detecting that the majority of the problems (out of 64 pairs of premises) have no valid conclusions. Those problems with no valid conclusions which are judged by subjects as having one, reveal a tendency to reason cooperatively.

Validity rate: an equal number of logically valid and non-valid problems in both sets. This number is in proportion with the number of valid/non valid problems among the 64 problems (seven valid and nine with no valid conclusion in each set, which reflects the fact that among the 64 possible pairs of premises there are 27 with valid and 37 with no valid conclusions). In the 4th column of

Difficulty: the main measure of this is given by the typical performance of subjects in the conventional task. We used for this the results from the meta-analysis in Khemlani and Johnson-Laird (^{6}^{7}

one existential and one universal premise and a valid conclusion with a positive quantifier from a premise;

two universal quantifiers and a valid universal conclusion;

one existential premise and one universal premise, and a conclusion with a negative quantifier from a premise;

one existential and one universal premise with a valid conclusion requiring a quantifier not in the premises; and

two universal premises, but only existential valid conclusions.^{8}

The 32 problems selected in the study, their premises, existence, or absence of valid conclusions, the proposed conclusions in the tasks following the evaluation paradigm, percentage of correct answers in the literature CV task, the ES classification, the matched vs. mismatched classification and our two sets subdivision.

AA3 | Aab | Acb | NVC | Aac | 31 | 0 | Mat | 1 |

AE1 | Aab | Ebc | VC | Eac | 87 | 2 | Mat | 1 |

AE2 | Aba | Ecb | VC | Oac | 1 | 5 | Mis | 1 |

AE4 | Aba | Ebc | VC | Oac | 8 | 5 | Mat | 1 |

AI3 | Aab | Icb | NVC | Ica | 37 | 0 | Mat | 1 |

EI1 | Eab | Ibc | VC | Oca | 8 | 4 | Mis | 1 |

EI2 | Eba | Icb | VC | Oca | 37 | 4 | Mat | 1 |

EI3 | Eab | Icb | VC | Oca | 21 | 4 | Mis | 1 |

EI4 | Eba | Ibc | VC | Oca | 15 | 4 | Mat | 1 |

IA2 | Iba | Acb | NVC | Ica | 12 | 0 | Mat | 1 |

IO1 | Iab | Obc | NVC | Oac | 33 | 0 | Mat | 1 |

IO2 | Iba | Ocb | NVC | Oca | 49 | 0 | Mis | 1 |

OA1 | Oab | Abc | NVC | Oac | 20 | 0 | Mis | 1 |

OI3 | Oab | Icb | NVC | Oac | 49 | 0 | Mis | 1 |

OI4 | Oba | Ibc | NVC | Oca | 47 | 0 | Mat | 1 |

OO1 | Oab | Obc | NVC | Oac | 37 | 0 | Mis | 1 |

AE3 | Aab | Ecb | VC | Eca | 81 | 2 | Mis | 2 |

AI1 | Aab | Ibc | NVC | Eac | 16 | 0 | Mat | 2 |

AO1 | Aab | Obc | NVC | Oac | 14 | 0 | Mat | 2 |

AO2 | Aba | Ocb | NVC | Oca | 17 | 0 | Mis | 2 |

EA1 | Eab | Abc | VC | Oca | 3 | 5 | Mat | 2 |

EA2 | Eba | Acb | VC | Eca | 78 | 2 | Mat | 2 |

EA3 | Eab | Acb | VC | Eac | 80 | 2 | Mis | 2 |

EA4 | Eba | Abc | VC | Oca | 9 | 5 | Mat | 2 |

IA3 | Iab | Acb | NVC | Iac | 28 | 0 | Mat | 2 |

IE1 | Iab | Ebc | VC | Oac | 44 | 4 | Mat | 2 |

IE2 | Iba | Ecb | VC | Oac | 13 | 4 | Mis | 2 |

IO3 | Iab | Ocb | NVC | Oca | 53 | 0 | Mis | 2 |

IO4 | Iba | Obc | NVC | Oac | 54 | 0 | Mat | 2 |

OI1 | Oab | Ibc | NVC | Oac | 36 | 0 | Mis | 2 |

OI2 | Oba | Icb | NVC | Oca | 31 | 0 | Mat | 2 |

OO2 | Oba | Ocb | NVC | Oca | 42 | 0 | Mis | 2 |

Matched/mismatched rate: a pair of premises is matched if the middle term is either positive in both premises or negative in both premises. Otherwise it is mismatched. Problem AE2, for instance, is mismatched because in the premises ^{9}

The conclusions in the table (5th column) were used in tasks under the evaluation paradigm, namely, EV, CMA, and CMA2, where they are proposed after the two premises. Participants should either accept or reject that the conclusion necessarily follows from the premises. The conclusions presented were selected according to the following criteria: for VC problems the conclusion is chosen to be valid. If more than one conclusion is valid, we chose the most frequently selected in the CV task, according to the meta-analysis in Khemlani and Johnson-Laird (

A total of 244 undergraduate students (mean age = 22.4) from first to third-year courses in the Ludwigsburg University of Education distributed thus: CV: 82, EV: 22, CMA: 54, CMA2:44 COMM-C: 42. The difficulty of the two countermodeling tasks (CMA and CMA2) led in some cases either to the non-comprehension of the task or to failure to comply with instructions. We excluded from all our analyses the answers of a total of 3 and 5 participants, respectively, in CMA and CMA2. These are subjects who did not provide any complete construction of countermodels. We did not consider their answers evaluating the validity of the conclusion because the counterexamples part was crucial in our experiment as an exploration of the effects obtained with this construction. This made these data uninterpretable for us. We take the systematic failure to provide counterexamples in these subjects as a clear indication that it was by far more demanding than the other tasks, but also more difficult to grasp without further indications or explanations.^{10}

Universal statements can be interpreted in different ways and models can be considered to be adequate for them according to two well-known options. On one hand, since Aristotle, a long-established convention determines that universal statements are false when the antecedent property is empty in the domain because they are considered to have existential import. So, a universal statement does presuppose in this interpretation the existence of something to which the predicate is applicable. On the other hand, according to modern semantics, truth does not require existence for universal statements. Given, e.g., the syllogistic problem AA1 (

The traditional Aristotelian view is adopted in most of the psychological literature, notably in the criterion for scoring accuracy. We follow this convention even if it is not clear that either of the interpretations should be adopted from a psychological point of view, or that it should be absolutely mandatory in education from a normative stance. For this reason, we will consider the modern interpretation in some of our analyses and will emphasize that some of our subjects in Experiment 2 do follow it explicitly.^{11}

As a first consideration

Comparison of the CV task performance across the 32 problems between the literature and Experiment 1 groups. Colors follow the ES classification.

This task is also present in the literature (Rips,

Five tasks comparison in performance. Generation paradigm (left) and evaluation paradigm tasks (right).

Comparison between the CV and the EV tasks. Performance across the 32 problems. Colors follow the ES classification.

It is worth also noticing that there is in EV a strong asymmetry between valid and non-valid problems which is reflected in a percentage difference of almost 25 points in favor of the former.

As explained before, the CMA and CMA2 tasks share the same structure: a deduction evaluation, followed by counterexample construction when possible, in an adversarial setting. In them we obtained an overall improvement in accuracy and reduction of the imbalance in determining the validity vs. non-validity of conclusions, as seen in

We compare CMA and CMA2 with EV, respectively, in

Comparison between the EV and the CMA tasks. Performance across the 32 problems. Colors follow the ES classification.

Comparison between the EV and the CMA2 tasks. Performance across the 32 problems. Colors follow the ES classification.

CMA and CMA2 offer the additional countermodel data which deserves separate analysis which will not be done here. Nevertheless, it is worth mentioning that, despite the improvement in conclusion evaluation, the generation of countermodels is far from perfect: in these tasks the percentage of correct countermodels is 20 and 31% of possible ones (namely, for each participant the 9 non-valid problems out of 16 presented to her). Calculating the chance levels of correct countermodeling is complex. There are 64 possible 2-element models different in principle among which 28 avoiding reorderings and repetitions of elements. For each problem there are different subsets which are correct. On the other hand there are relatively simple properties of problems that will filter out possibilities. The psychological process of countermodel construction is also complex the most direct evidence being that participants take around three to four times as long per problem. The analysis in Vargas et al. (submitted) provides strong evidence that its subjects are trying to do classical logical countermodeling despite their many errors. The construction of counterexamples poses difficulties due to high demands on executive functions (working memory in particular). Besides this, it poses a number of problems difficult to clarify by means of test instructions alone. This motivated a different approach in Experiment 2.

The purpose of this task was to substantiate the idea that what participants do in the CV task is essentially framed in a context of cooperative communication. If instructions ask subjects explicitly to do precisely this, we obtain in fact very similar results. Correlation between CV and COMM-C is 0.75 (Spearman coefficient,

Correlations (Spearman coefficients) between the 5 tasks in Experiment 1.

CV | 1 | 0.75 ( |
0.50 ( |
0.56 ( |
0.63 ( |

COMM-C | 0.75 ( |
1 | 0.67 ( |
0.46 ( |
0.54 ( |

EV | 0.50 ( |
0.67 ( |
1 | 0.66 ( |
0.70 ( |

CMA | 0.56 ( |
0.46 ( |
0.66 ( |
1 | 0.66 ( |

CMA2 | 0.63 ( |
0.54 ( |
0.70 ( |
0.66 ( |
1 |

Comparison between the CV and the COMM-C tasks. Performance across the 32 problems. Colors follow the ES classification.

Our comparison across tasks is guided by the idea that there is a change in disposition: CV, EV, and COMM-C tasks on the one hand (cooperative), and CMA and CMA2 on the other (adversarial). From the point of view of the answer format we have on the one hand the CV and COMM-C tasks (choose from a menu of conclusions), and on the other, the CMA, CMA 2, and EV tasks (determine the validity given a proposed conclusion). Even if all tasks are all positively correlated (

Under the evaluation paradigm the differences range from 24.7 (EV task) to 19 (CMA task) and 16.7 (CMA2 task). These effects of countermodeling are significant, as reported in subsection The CMA and CMA2 Tasks.

Back to the comparison between CV and COMM-C, we noticed in subsection The COMM-C Task how close they are. Participants in the conventional task do not answer following classical norms consistently, leading to an extremely irregular performance across problems (

The cooperative task of interpreting and understanding discourse can be approached through logical tools (van Lambalgen and Hamm, ^{12}

The effects obtained in Experiment 1 indicate clear tendencies when we take (as experimenters and educators usually do) classical logic as our benchmark. The results obtained comparing the spectrum of tasks suggest that there are good reasons why “naive” subjects deviate from this particular logic and suggest also in which direction we should move if our goal is to obtain results according to it. Again, the goals pursued matter. Experiment 2 explores what we can obtain from an intervention designed in this direction. We implement three successive tests (pretest, posttest 1, and posttest 2) with the idea of facilitating the transition from an initial (cooperative) point, toward an adversarial classical logic one.

We start from the observation that, as noticed in subsection The CMA and CMA2 Tasks, the countermodeling tasks are highly demanding and that even if we see a change in disposition and performance, correct countermodel production is generally not attained. Understanding the construction of counterexamples needs in general more than the bare written instructions of the usual experiments. We focus then on the clarification of this notion, crucial for us as an external tool supporting the definition of the (classical) inference relation, as already explained in Section Construction of Counterexamples: Modeling and Countermodeling as Tools for Syllogistic Reasoning.

We focused here on a within-subjects comparison of the tasks EV and CMA2 (the one that seemed most promising from Experiment 1 to obtain a shift toward classical reasoning). The instructions were the same as in Experiment 1 but in this case instructions (including the countermodeling explanation) were carefully explained and not just provided in the booklets: see the procedure.

The problem selection was the same as in Experiment 1. In the pretest and posttest 1 the problems were the 16 of set 1 (see

These were 36 1st and 2nd year mathematics students at University El Bosque in Bogotá. The mean age was 20.3. They were beginning their studies with introductory courses. From the point of view of logic, their knowledge was limited to a basic semi-formal logic course (partially or totally completed by the time of the experiments), mostly focused on propositional logic, truth tables and quantifiers notation for mathematical statements. The experiment was conducted separately in a total of 5 small groups (from 5 to 9 students each) during class hours with students from different courses.

The sequence was designed with alternating tests and short interventions over three sessions based on the following stages:

The three sessions were held a week apart. At the end, all the results of the three tests were shown to the participants, with a reflection on the didactic effect obtained by them individually and as a group.

If we take the mean performance, we have a mean of 44, 59.2, and 85.3% for validity judgements, respectively, in the pretest, posttest 1 and posttest 2 (

Performance on the 3 tests of Experiment 2.

We interpret these results as a progressive attainment of our intended target. This can be seen also examining the distribution of individual scores (over 16 problems) attained by each of the participants on each of the tests (

Boxplots showing the subjects' performance distribution in the three stages of Experiment 2. Left: performance in problems evaluation. Right: performance in correct countermodel construction in the two last stages.

In the pretest the mean score (7.04), the median (6), and 22 out of 36 participants had scores not greater than 8. With 16 problems, this means chance level or below. There were extreme cases of seven students with 25% or less correct answers, reflecting how misleading intuition can be in this task (they were providing answers almost

In posttest 1 we obtain a large improvement in the evaluation of the conclusions (

As observed in Experiment 1, this is an already important change which reflects an adversarial context. Even so, there is still clearly place for improvement. Above all, countermodeling constructions in posttest 1 are very frequently wrong. Seven participants provided two or less correct countermodels (out of nine possible); four did not construct even one. This alone confirms the difficulties involved in the process of understanding and performing well with the notion of counterexample, as already observed in Experiment 1. This motivated the necessity of a further stage for feedback and clarification, as addressed in our third session. The results obtained confirm this hypothesis and are close to being optimal. In posttest two we achieved another important improvement in evaluating the validity of the proposed conclusions, but more revealing than this, an improvement in the construction of the countermodels (mean score = 6.47 over nine possible countermodels with nine subjects having all of them correct; see also

Comparison between the pretest (EV task) and posttest 1 (CMA2 task) in Experiment 2. Performance across the 16 problems of set 1. Colors follow the ES classification.

Comparison between the pretest (

Correlations (Spearman coefficients) between the 3 stages in Experiment 2.

Pretest | 1 | 0.72 ( |
0.39 ( |

Posttest | 0.72 ( |
1 | 0.61 ( |

Posttest 2 | 0.39 ( |
0.61 ( |
1 |

With few exceptions participants presented a sustained improvement in evaluating correctness of problems across the three trials (see the table in the

Experiment 2 allowed us also to obtain further information besides that provided from the data from the tests. After each session, notes on the arguments and questions from the students were taken. We present next some of the more salient phenomena revealed.

Among the notions introduced in the tests, probably the most difficult one to acquire fully is that of countermodeling and how it can be used in regard to validity: a deduction is not valid if there is a model of the premises which is not a model of the conclusion. The double negative character of this procedure places heavy demands on subjects' attention needed for forcing premises to be true, forcing the conclusion to be false, and integrating the existence of such a construction with a judgment of the invalidity of the deduction. In fact, two salient tendencies in countermodeling (Vargas et al., submitted) are either, (1) to provide a model of the premises forgetting that the conclusion should

Another kind of misunderstanding observed here was about what a countermodel (or a counterexample) is. Given that we asked for universes with two elements, participants often considered that the validity or invalidity of the statements should be evaluated on each of the elements of the structure or the universe, and not globally. Typically, in their first encounter with having to construct counterexamples (in the intervention of our session 2) a conclusion such as

Linguistics ✗

Arab ✓

Geometry ✓

Linguistics ✓

Arab ✓

Geometry ✓

In this case, some participants understand that Student 1 constitutes a counterexample whereas Student 2 constitutes an example, leading to the belief that a counterexample is provided. This is incorrect because the particular affirmative statement is true in the model: there is some student taking both geometry and linguistics, namely, Student 2. In the vocabulary of model theory, they are confusing the notion of a structure not being a model for a statement, with the notion of there being an instance, within the structure, of the negation of the statement. An explanation emphasizing that truth in a structure must take it as a whole turns out to be very useful in clarifying such misconceptions.

Which algorithm do individuals follow for countermodeling construction? Participant S20 was very conscious about what he did, and about the fact that he switched during posttest 1. First, he began constructing a model of the premises and only then tried to provide a countermodel of the conclusion. At the end, he noticed that for him it was easier to begin countermodeling the conclusion and then try to satisfy the premises. In fact, there was an improvement over the test: his only three mistakes were in the problems presented in position 3, 5, and 12, with no mistakes in his last 4 problems. Also, in his final test, after making this explicit remark, he performed perfectly both in conclusions evaluation (16/16) and correct countermodeling construction (9/9 possible countermodels). He changed his strategy because, as he indicated, it was easier, then, to remember that the conclusion had to be false in order to obtain a countermodel. We point to this case because, even if we believe that such a conscious metalevel monitoring as exhibited by S20 was not generally present, it indicates that countermodel construction may put into action clearly different algorithmic strategies even with such simple models as these.

Two well-known concerns regarding the interpretation of the quantifiers involved in the statements were posed by our students.

The first was about the “conversational” use of the existential (or “particular”) statements. Student S33 said, during the feedback on session 3, that some of his “mistakes” in posttest 1 were occasioned because he interpreted all existential assertions (Some A are B) as affirming also that Some A are

Student S29 made explicit the same interpretation during the feedback session. In fact, she did so as an explanation of the fact that in some cases she added a third element to the countermodels. Two elements, in fact, are not always enough when assuming such an interpretation.

A second perplexity was about universal statements. For example, during the explanation of session 2, we used Syllogism AI3 as an example:

Participant S25 proposed the following counterexample:

Linguistics ✗

Arabic ✓

Geometry ✓

Linguistics ✗

Arabic ✓

Geometry ✓

She argued that the first premise is true in this case, because if there is no student taking linguistics, then the universal statement holds. This led to a debate in class. It is well-known that this is the key feature that distinguishes the Aristotelian and the modern interpretation of the universal quantifier. As explained in section “Evaluation of Problems and Countermodels,” for Aristotle, universal statements have existential import whereas modern interpretations do not require this. Was the premise true or not? We clarified the point emphasizing the historical development just mentioned. We did not commit to any of these conventions as “the correct” one, explaining that the interest of their answers in the tests was not in adhering to one or other of these normative positions, but to analyze how they reason. Educationally, it was an opportunity for us for emphasizing the conventional and historical character of some logical rules. Therefore, they were “allowed” to construct counterexamples according to their choice. Interestingly, in both posttest 1 and posttest 2 student S25 presented a systematic tendency in modeling all the universal affirmative statements in the premises using “empty antecedents” (interpreting the universal as an implication). This one was an extreme case, but seven other participants stated explicitly (when interrogated) that they had used this feature in at least some of the problems. In the table in the

A final aspect that emerged during the discussions with participants that we want to emphasize, is that some of the questions and concerns reflected their conceptions about proof and mathematical procedures.

Student S09, for instance, was looking for an algorithmic mechanism for constructing counterexamples. He realized that at some point not everything was completely determined at each step of the construction about the two elements of the models. Some of the features were usually underdetermined by the premises. Part of the work was an exploration, sometimes hypothetical, which could eventually lead to a counterexample. The fact of having two or more possibilities and having to suppose something without knowing the final result produced a manifest anxiety in him. His conception about mathematics was procedural and he expected to reduce argumentation and proof to this level.

Two different students commented independently that a procedure for establishing validity of conclusions is needed. Counterexample construction is in fact a means which in principle leads only to showing invalidity.

As student S01 asked in session 3: “Professor: is there any way to be ^{13}

The fields of cognitive psychology and mathematics education meet at different points in their subject of study. Even if their particular aims do not always coincide and mutual communication is not straightforward, there is a recognized need for interaction between them [see e.g., (Gillard et al.,

The present study is an attempt at such an interaction. Its focus is on the crossroad of cognitive psychology (the topic of study, the design of the tests), educational psychology (class-based interventions, the learnability and teachability of a topic) and mathematics education (the role of counterexamples for mathematical reasoning, the emergence of the notion of proof and refutation). We see the two experiments presented as complementing each other taking into account the strengths and weaknesses of each discipline.

We see such an interaction taking place at the fundamental levels that guided our study: the role of counterexamples in reasoning, and the communicative goals pursued at the base of this process.

On the one hand, as already indicated in Section Construction of Counterexamples: Modeling and Countermodeling as Tools for Syllogistic Reasoning, the theme of examples and counterexamples plays a role both in psychology and mathematics education and can be addressed from the logical point of view, where “models” and “countermodels” have a precise definition. We addressed the problem here, in a very constrained situation, with this level of precision. This allows us to conclude that the process of generating a preferred model^{12}^{14}

As we can infer from our second experiment, the generation of counterexamples requires in many respects a process of familiarization, disambiguation and mastery. We could see this process in a relatively simple situation (2-element models, 3 monadic predicates, a limited non-recursive syntax). It is even more necessary in the far more complex range of mathematical contexts.

On the other hand, context and communication determine the kind of reasoning that is elicited. The issue of context dependency has been widely documented in the psychological literature, and acknowledged in different ways from approaches such as ecological rationality or situated cognition. It is also present to a large extent in educational contexts, in particular in mathematics. The communicational situations may vary the goals pursued to the point of representing completely different “games” (Wittgenstein, ^{15}^{16}

Given this contextual character of communication and reasoning and how the diversity of situations leads to different processes and outcomes we want to stress that it inheres not only the descriptive, but also the normative aspect of logic and its role in psychology. We believe that both the cognitive psychology and the mathematics education literatures still miss and require pluralistic accounts on how we reason. These should go beyond the crude dichotomy between “correct” and “incorrect”^{17}^{18}^{19}

The datasets generated for this study are available on request to the corresponding author.

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

FV performed the interventions in Experiment 2, organized the databases, performed the statistical analysis, and wrote the first draft of the manuscript. FV and KS contributed conception, design of the study, contributed to manuscript revision, read, and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors thank the two reviewers for their help in improving the clarity of the paper. Special thanks to Dr. Laura Matignon for her collaboration in the data collection of Experiment 1 at Ludwigsburg University of Education.

The Supplementary Material for this article can be found online at:

^{1}Among these different accounts of performance in syllogistic reasoning there is probability, see e.g., (Chater and Oaksford,

^{2}See also, e.g., (Lloyd,

^{3}The notion has been further elaborated more recently through Etchemendy (

^{4}The fact that the identification of individuals with a particular one of eight possible types (corresponding to the assertions and negations of the three monadic predicates present in a syllogism) may be used to decide the validity of an argument is in fact already present in Aristotle's works, namely through the ekthesis technique of proof [(Kneale and Kneale,

^{5}See Section Problem Selection for this notion.

^{6}The name is explained by the fact that a first criterion for the classification of problems arises from the observation that existential presuppositions are traditionally assumed in the field and this leads to a clear performance divergence from problems that require this assumption in order to have a valid conclusion and problems that do not. This substantial difference occurs with double universal problems (our classes 2 and 5).

^{7}All problems without any valid conclusions are conventionally assigned the number “0.”

^{8}This simple classification is motivated by the fact that it correlates highly with the percentage of correct answers of the valid problems in the Conventional Task meta-analysis 0.94,

^{9}Logical models are here sets of elements, each element of which represents a

^{10}It is also worth clarifying that in EV we had only 22 participants, given that (as planned in the design) booklets included twice the number of problems in comparison to the other tasks. We used this design since EV was by far the less demanding task in time. Finally, the sample size in the CV task is remarkably larger because in this case we could include the data from a previous experiment. In this experiment we had a booklet generation mistake in the tasks different from CV. This experiment was conducted a semester before in the same institution and courses at the same university level (from first to third year).

^{11}Vargas et al. (submitted) presents some evidence, based on counterexample analysis, that existential presuppositions are not compatible with the results of the CMA task (compatible instead with modern interpretation of classical logic). Even so, they can well be present in the Conventional Task.

^{12}Here the term is used informally, but we mention that it has a technical counterpart in the preferential semantics (Shoham,

^{13}This was probably obtained owing to the fact that our participants were mathematics students and their particular involvement with mathematical proofs even at their early stage.

^{14}It is clear, as also confirmed here, that participants do not primarily follow classical logic in traditional syllogistic tasks. Actual performance on them may be approached more properly with non-monotonic logics (Stenning and van Lambalgen,

^{15}“Game” not only in Wittgenstein's sense, but also in the Games Theory sense that it can be cooperative or adversarial (zero-sum or non-zero-sum).

^{16}Our translation and emphasis.

^{17}Other labels such as “biases” or “fallacies” are equally contestable when understood in absolute terms.

^{18}These diverging norms may be approached through different logical systems. This presupposes logical pluralism: “the view that there is more than one genuine deductive consequence relation, and that this plurality arises not merely because there are different languages, but rather arises even

^{19}This is far from conceptions in math education which see, for example, an opposition between “Child's Logic” and “Math Logic” (O'Brien et al.,