^{*}

Edited by: Iván Manuel Jorrín Abellán, Kennesaw State University, United States

Reviewed by: Enrique Navarro, Complutense University of Madrid, Spain; Paul Seitlinger, Tallinn University, Estonia

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The study of school effectiveness and the identification of factors associated with it are growing fields of research in the education sciences. Moreover, from the perspective of data mining, great progress has been made in the development of algorithms for the modeling and identification of non-trivial information from massive databases. This work, which falls within this context, proposes an innovative approach for the identification and characterization of educational and organizational factors associated with high school effectiveness. Under a perspective of basic research, our aim is to study the suitability of decision trees, techniques inherent to data mining, to establish predictive models for school effectiveness. Based on the available Spanish sample of the PISA 2015 assessment, an indicator of the school effectiveness was obtained from the application of multilevel models with predictor variables of a contextual nature. After selecting high- and low-effectiveness schools in this first phase, the second phase of the study was carried out and consisted of the application of decision trees to identify school, teacher, and student factors associated with high and low effectiveness. The C4.5 algorithm was calculated and, as a result, we obtained 120 different decision trees based on five determining factors (database used; stratification in the initial selection of schools; significance of the predictor variables of the models; use of items and/or scales; and use of the training or validated samples). The results show that the use of this kind of technique could be appropriate if mainly used with correctly pre-processed data that include the combined information available from all educational agents. This study represents a major breakthrough in the study of the factors associated with school effectiveness from a quantitative approach, since it proposes and provides a simple and appropriate procedure for modeling and establishing patterns. In doing so, it contributes to the development of knowledge in the field of school effectiveness that can help in educational decision-making.

Identification of educational factors associated with academic performance is a key aspect in educational research into school effectiveness (

In contrast to traditionally used techniques (inferential and multivariate correlational statistics), data mining is not based on previous assumptions or theoretical distributions to obtain predictive models. In addition, these techniques are applied with minimal intervention by researchers, which, together with the aforementioned, represent a great advantage for the identification of valuable information in massive databases (

The main aim of this study, therefore, is the analysis of the fit and predictive power of data mining techniques, specifically decision trees, for the identification of factors associated with school effectiveness in secondary education.

Given this main objective, we can set the following specific objectives:

Analyze and identify school effectiveness based on cross-sectional data from large-scale assessments.

Promote methodological alternatives for the study of factors associated with school effectiveness based on mass data.

Analyze the effectiveness of decision trees (algorithm C4.5) in the study of the process factors associated with school effectiveness.

Present the possibilities of decision trees for the study of good educational practices in effective schools.

The publication of the

In response to the Coleman Report, the Effective School Movement (ESM) emerged during the 1980s (

In the 1990s, thanks mainly to improvement in the computing capacity of computer systems and to the widespread use of large-scale assessments, research into school effectiveness experienced strong growth and evolution (

That is why this work is interested in proposing a quantitative alternative for the study of process variables associated with school effectiveness that does not have the above-mentioned limitations. Specifically, based on the perspective of educational data mining (EDM), we apply decision trees to establish the predictive models of high- and low-effectiveness schools that have a better fit to the data, and we analyze under which determining factors these techniques achieve better performance.

The current calculation capacity of computers allows the development and application of appropriate statistical techniques for the analysis of massive data. In this regard, data mining emerges as a set of techniques that add value to large-scale data analysis (

Despite the potential that these statistical techniques may hold, their use in the establishment of performance prediction models in compulsory education is sporadic (

Given the characteristics of the statistical techniques of data mining, many of these works focused on non-university levels propose a dichotomous variable as a criterion variable in their models, referring to whether a student reaches a minimum performance (

As for the statistical technique applied, although numerous works are carried out with the aim of comparing classification algorithms (

In this regard,

Regarding the application of data mining in the prediction of performance at university levels, in a work by

In an analysis of large-scale assessments,

Although the use of data mining has been significantly extended in educational research, no applied works that include a comprehensive study of the stability of the models beyond the report of overfit normally from the cross validation have been found. Stability can be defined as the degree to which an algorithm returns constant results from different samples of the same population (

We should point out that all of the works cited in the state of play use the gross performance of the student as a criterion variable for the predictive models, and only include in a few cases, among their predictor variables, some contextual factors. If we define school effectiveness as “the relation between the observed outcomes and the expected outcomes given the socio-economic context of education systems” (

Based on an analysis of secondary data from the PISA 2015 assessment (

Multilevel models with contextual variables applied to OECD countries based on PISA 2009 data show that Spain is one of the countries with the smallest difference between observed and estimated scores in both reading and mathematics (

The size of the Spanish sample in PISA 2015 was much larger than that of most of the sampled countries since each of its 16 autonomous communities is taken as a stratum.

Taking into account the aforementioned, our starting point was the population of Spanish students who at the time of the 2015 PISA assessment were 15 years old, their teachers, and the schools in which they studied. In Spain, students who had undergone standardized schooling were at that time in the final year of compulsory secondary school.

From this population, the initial sample obtained was 32,330 students, 4286 teachers, and 976 schools. However, to obtain more stable estimates of the aggregated variables at the school-level (obtained from the calculation of the average score of the first level variables), and to get better estimates of multilevel model parameters, we removed from the sample all schools with less than 20 students (

As will be discussed later, the results of the multilevel models enabled the selection of high- and low-effectiveness schools.

The sample weights proposed in the PISA 2015 data served to weigh the data in both phases.

The instruments included in the 2015 PISA tests, which we used in this study, were obtained from two sources:

Performance tests in reading, mathematics, and science. The PISA tests used a sampling of items from which the ability of each student in the three areas was estimated using the item response theory (IRT). Thus, PISA assessment includes an estimate of 10 plausible values of the achievement of each student in the three main assessed areas.

Questionnaires used with management teams (school information), teachers, and students. These questionnaires included abundant information on socio-economic context, educational processes and organizational issues, cognitive and personal aspects of students, etc.

While the reliability and validity of the achievement tests included in PISA are evidenced extensively in the technical reports (

Social, cultural, and economic significance of the defined constructs: Cultural differences between countries make it difficult to compare the significance of these constructs and, therefore, to make cross-cultural comparisons (

Lack of stability in the definition of indicators, items, and constructs: Several items and scales change from one edition to another, others are discarded, and some others are included (

Poor translations of the questionnaires into languages other than English: The versions of these questionnaires (including in this case the achievement measurement tests) in the different languages make their comparability difficult (

Missing data: Contrary to what happens with the achievement measurements, which rarely include missing data in the student database, the measurements and constructs of the context questionnaires include missing data on a regular basis (

As a result, although the OECD is making significant efforts in the latest editions of PISA for the improvement of these aspects (

Regarding variables, the following were used:

In the application of the multilevel models, the criterion variables used were gross performance in the three areas assessed at student level (Level 1) and the average performance of the school (Level 2). Unfortunately, although a teacher database is included in PISA 2015, we could not include classroom-level variables in the models since these data do not allow to associate teachers with students in their classroom.

The predictive variables used, which were exclusively contextual in nature, were the following:

Level 1: Gender; birth month; academic year; index of economic, social, and cultural status (ESCS); migratory status; repetition of academic year; number of school changes; mother tongue.

Level 2: Size of the school; classroom size; shortage of resources; shortage of teachers; school ownership; student/teacher ratio; average ESCS; repeater rate; immigrant student rate; proportion of girls.

The decision trees included as a criterion variable the identification of the school as high or low effectiveness (dichotomous variable). The predictor variables included in the decision trees were all non-contextual items and scales included in the PISA 2015 databases, both in schools and in teachers and students. In total, the decision trees used included 232 variables (39 of teachers, 139 of students, and 54 of school).

Selection of the variables included in the multilevel models draws from the focus of this research, which is based on the context-input-process-output (CIPO) model. This model (

In particular, selection of the context and input variables used in the multilevel models is based on the literature review carried out both theoretically (

HLM7 software was used to calculate the multilevel models, which were applied taking into account the 10 plausible values provided by PISA 2015 in each of the three areas assessed. HLM 7 computes an independent model for each of the available plausible values and returns the parameters averages. Since HLM7 does not allow the use of sample replicate weights, to minimize bias in error estimation this software employs robust estimators using the Huber-White adjustment. This adjustment compensates for the biases associated with the omission of replicate weights (

From a significance level of 5%, we included only significant predictor variables in the multilevel models. Since the three models obtained from each achievement measurement were clearly different in terms of the predictor variables included (

Finally, we computed the significant models, with random intercepts and fixed slopes in school level, and calculated the residuals of the school level using empirical Bayes estimation (

where _{ij} refers to the performance achieved by student _{ij} represents the sum of the overall average performance in the corresponding area (γ_{00}), the distance of average school performance _{0}_{j}_{ji}).

The final models obtained in each area are specified in Eq. 2:

where γ_{0}_{s}_{qj}_{sj} variables _{qij} variables

After obtaining the residuals of the schools in the three final models, we carried out the selection of high- and low-effectiveness schools. To do so, we carried out a first selection of schools (non-stratified selection), in which the schools that were placed in the first quartile in the three computed models (schools of negative residual, low effectiveness) and the schools that were placed in the last quartile in the three models (positive residual schools, high effectiveness).

Given the extensive educational competence of the autonomous communities in Spain, we made a second selection of schools (stratified selection). In this second case, we used the same criteria indicated above in each of Spain’s 16 autonomous communities, implementing 16 separate selective processes for high- and low-effectiveness schools. The residuals used in this selection were the same residuals used in the original selection. We opted for this procedure because ICC levels were below 10% in the null models of the specific samples in some autonomous communities. The decision to create a dichotomous variable from the residuals obtained in school level addresses two fundamental questions:

The use of dichotomous criterion variables in obtaining decision trees simplifies the interpretation of the rules obtained in the models.

The residuals used are indicators with estimation errors associated with them, so the use of their absolute values is not appropriate.

The decision trees were computed using Weka 3.8.1 free software. Given the results shown in the state of play, we decided to use the C4.5 algorithm in the construction of the models. This algorithm is an extension of ID3 (

As an additional control measure, we include the study of the stability of the trees obtained (

To take a known point of reference that allows the fit level of the models obtained to be assessed, logistic regression models are applied. Selection of the predictor variables included in these models is automated through the use of the LogitBoost algorithm (

It was necessary to generate a total of 120 different databases based on different determining factors:

Informant of predictor variables (

Type of predictor variables included (

Stratum by Spanish region (

Significance level of the predictor variables (

Type of sample to obtain the model (

This made it possible to compute a total of 120 different decision trees based on these five determining factors (for example, one of the 120 trees calculated included as predictor variables the significance scales of the student database and as a criterion variable the identification of high- and low-effectiveness schools taking into account stratification by region, estimating the fit indices from the training sample).

After this process, we were able to compare the fit of the trees obtained based on these five determining factors. To do so, we calculated the average scores and typical overall deviations and by interest groups and used the appropriate hypothesis contrasts to compare the groups. This procedure made it possible to identify the categories or groups with best and worst fit in the predictive models.

The initial ICC in the three models applied was acceptable (science = 12.41%; reading = 12.04%; mathematics = 12.26%). The ICC of the final models achieved acceptable levels (science = 5.60%; reading = 5.07%; mathematics = 4.55%), since in the three models the variance explained at school level accounted for more than 50% of the total variance. The most explanatory model was the competence in mathematics predictor, in which the predictor variables accounted for 62.99% of the total variance of the second level.

The breakdown of selected schools, based on the two procedures described in the methodology, can be seen in

Breakdown of high- and low-effectiveness schools according to the selection procedure.

High | 75 | 34 | 0 | 109 |

Not selected | 55 | 518 | 59 | 632 |

Low | 0 | 62 | 68 | 130 |

Total | 130 | 614 | 127 | 871 |

Breakdown of students and teachers according to selection procedure.

High | Students | 2569 | 1211 | 0 | 3780 |

Teachers | 294 | 58 | 0 | 352 | |

Not selected | Students | 2016 | 18,553 | 2230 | 22,799 |

Teachers | 241 | 2206 | 340 | 2787 | |

Low | Students | 0 | 2198 | 2328 | 4526 |

Teachers | 0 | 141 | 402 | 543 | |

Total | Students | 4585 | 21,962 | 5448 | 31,105 |

Teachers | 535 | 2405 | 742 | 3682 |

The average accuracy levels obtained according to each of the determining factors are shown in

Average accuracy of the decision trees according to determining factors (under parentheses accuracy of logistic regression models).

DDBB | Schools | 0.803 (0.767) | 0.801 (0.767) | 0.812 (0.768) | 0.502 (0.526) | 0.483 (0.506) | 0.518 (0.539) |

Teachers | 0.665 (0.665) | 0.647 (0.612) | 0.676 (0.677) | 0.581 (0.630) | 0.453 (0.523) | 0.610 (0.652) | |

Students | 0.626 (0.631) | 0.617 (0.617) | 0.633 (0.641) | 0.591 (0.604) | 0.578 (0.587) | 0.603 (0.618) | |

Aggr. school + student | 0.934 (0.950) | 0.943 (0.951) | 0.927 (0.950) | 0.717 (821) | 0.718 (0.809) | 0.717 (0.833) | |

Not Aggr. school + student | 0.807 (0.834) | 0.786 (0.829) | 0.846 (0.838) | 0.786 (0.823) | 0.781 (0.819) | 0.797 (0.828) | |

Items-scales | Items + scales | 0.773 (0.811) | 0.759 (0.800) | 0.788 (0.817) | 0.638 (0.704) | 0.609 (0.680) | 0.650 (0.717) |

Items | 0.769 (0.802) | 0.758 (0.792) | 0.777 (0.808) | 0.632 (0.706) | 0.604 (0.682) | 0.642 (0.720) | |

Scales | 0.760 (0.695) | 0.759 (0.675) | 0.772 (0.700) | 0.636 (0.633) | 0.594 (0.585) | 0.654 (0.645) | |

Significance | All | 0.772 (0.808) | 0.763 (0.795) | 0.782 (0.814) | 0.633 (0.704) | 0.597 (0.677) | 0.649 (0.716) |

Significant | 0.762 (0.731) | 0.754 (0.715) | 0.775 (0.736) | 0.637 (0.658) | 0.608 (0.621) | 0.648 (0.672) | |

Stratum | Stratified | 0.776 (0.777) | 0.768 (0.757) | 0.784 (0.783) | 0.649 (0.687) | 0.597 (0.637) | 0.673 (0.706) |

Not stratified | 0.758 (0.762) | 0.749 (0.754) | 0.773 (0.767) | 0.621 (0.675) | 0.608 (0.661) | 0.625 (0.682) |

Although the level of general accuracy of the logistic regression models is slightly higher, mainly in the validated data, we should bear in mind that a maximum number of predictor variables is not set in these models. It should be remembered that a maximum size of 30 branches is set in the decision trees. Thus, while the logistic regression models reach an average number of 47.12 predictor variables included (reaching a maximum of 174 variables in one of the models), the decision trees feature, on average, 15.10 rules (with a maximum of 30 rules). Therefore, this fact significantly affects the fit levels of the models.

Stability of obtained decision trees (100-folds cross-validation) according to determining factors.

DDBB | Schools | 0.502 | 0.363 | 69.76% |

Teachers | 0.581 | 0.156 | 26.61% | |

Students | 0.591 | 0.077 | 12.83% | |

Aggr. school + student | 0.717 | 0.315 | 44.18% | |

Not Aggr. school + student | 0.786 | 0.058 | 7.30% | |

Items-scales | Items + scales | 0.638 | 0.257 | 39.99% |

Items | 0.632 | 0.249 | 38.48% | |

Scales | 0.636 | 0.247 | 38.57% | |

Significance | All | 0.633 | 0.251 | 38.63% |

Significant | 0.637 | 0.250 | 39.38% | |

Stratum | Stratified | 0.649 | 0.251 | 37.90% |

Not stratified | 0.621 | 0.250 | 39.99% |

In general, decision trees were obtained with highly variable levels of fit depending on the determining factors proposed (

Overall fit of the decision trees proposed.

TP | Training | (0.18,0.95) | 0.697 (0.208) | (0.59,0.97) | 0.815 (0.103) | (0.59,0.95) | 0.767 (0.114) |

Validated | (0.14,0.88) | 0.552 (0.187) | (0.40,0.91) | 0.698 (0.124) | (0.40,0.87) | 0.635 (0.116) | |

Accu. | Training | (0.56,0.97) | 0.756 (0.125) | (0.61,0.95) | 0.779 (0.113) | (0.59,0.95) | 0.767 (0.114) |

Validated | (0.32,0.88) | 0.602 (0.146) | (0.40,0.88) | 0.649 (0.116) | (0.40,0.87) | 0.635 (0.116) | |

Kappa | Training | – | – | – | – | (0.18,0.90) | 0.515 (0.246) |

Validated | – | – | – | – | (-0.20,0.76) | 0.248 (0.247) | |

ROC | Training | – | – | – | – | (0.62,0.98) | 0.802 (0.122) |

Validated | – | – | – | – | (0.38,0.92) | 0.652 (0.135) |

An example of two of the very different decision trees obtained in this study is presented in

Example of two trees obtained in the study.

The rhombuses show the predictor variables included in the models and the ellipses the accuracy of each rule established by the tree. While the arrows provide information on the range of scores of the previous variable included in the rule, each ellipsis provides information on the level of accuracy of the rule (the first letter represents the prediction of schools that meet that rule as high or low effectiveness, the first number indicates the number of elements included in the rule, and the second the elements whose level of effectiveness does not match the prediction). Thus, in the right-hand tree, it can be observed that very low levels of teacher experience and job satisfaction are the fundamental variables that predict low effectiveness. Meanwhile, in schools that have teachers with higher levels of experience and job satisfaction, it is also necessary to have a staff committed to decision-making in the school and to their own teaching development.

Comparison of the fit of the models according to the determining factors was carried out in the validated samples through the ANOVA test, using accuracy as a dependent variable. Therefore, the determining factors ^{2} = 89.4%). Significant determining factors were observed in

Decision tree fit comparison ANOVA table—overall accuracy.

^{2} |
|||

Intercept | 16, 860.041 | <0.001 | 0.998 |

DDBB | 108.829 | <0.001 | 0.912 |

Items-scales | 0.154 | 0.857 | 0.007 |

Significance | 0.140 | 0.710 | 0.003 |

Stratum | 7.997 | 0.007 | 0.160 |

DDBB^{∗}significance |
8.902 | <0.001 | 0.459 |

DDBB^{∗}stratum |
6.979 | <0.001 | 0.399 |

Significance^{∗}stratum |
6.186 | 0.017 | 0.128 |

Interaction between determining factors—overall accuracy.

^{2} = 87.8%) and in the low-effectiveness model (adjusted ^{2} = 83.6%).

Decision tree fit comparison ANOVA table—high and low effectiveness.

^{2} |
^{2} |
|||||

Intercept | 8372.433 | <0.001 | 0.995 | 11, 456.530 | <0.001 | 0.996 |

DDBB | 95.475 | <0.001 | 0.899 | 64.474 | <0.001 | 0.857 |

Items-scales | 0.449 | 0.641 | 0.020 | 0.339 | 0.714 | 0.016 |

Significance | 0.656 | 0.422 | 0.015 | 0.012 | 0.913 | <0.001 |

Stratum | 0.741 | 0.394 | 0.017 | 15.683 | <0.001 | 0.267 |

DDBB^{∗} significance |
6.573 | <0.001 | 0.379 | 5.388 | 0.001 | 0.334 |

DDBB^{∗} stratum |
7.347 | <0.001 | 0.406 | 5.021 | 0.002 | 0.318 |

The significant interaction between Database and Significance is further analyzed in

Interaction between Database and Significance—high and low effectiveness.

Finally,

Interaction between Database and Stratum—high and low effectiveness.

The main aim of this work was to study the relevance of the use of decision trees for the study of educational factors associated with school effectiveness using data from large-scale assessments. We decided to use the C4.5 algorithm since it allows the use of variables of all kinds (

On the one hand, a descriptive study of the level of fit of the decision trees was carried out based on several important determining factors. The results seem to indicate that the factor that creates the most differences in the accuracy achieved and in the stability is the

Taking into account that in this study decision trees with a very small size (maximum 30 branches) were selected, the levels of average accuracy achieved, which were above 0.76 in the training sample, were satisfactory. Other previous works that did not limit the size of the decision trees achieved overall accuracy levels of between 0.7 and 0.8 (

Also notable in the results obtained was the superior fit of the models for the prediction of low-effectiveness schools. The calculated models achieved a TP rate almost 15% higher in these schools than in those of high effectiveness in the validated results (and 5% higher accuracy). Since other works seem to point to this trend based on more applied analysis (

In addition, an inferential study was carried out both on the significance of the main effects of the determining factors analyzed and on the interactions between these factors. The results reflected those indicated above, highlighting that the strongest determining factor in the accuracy of the models was

The interactions studied show interesting trends not assessed in previous works: on the one hand,

Several consistent strengths and contributions in our study are, therefore, confirmed, mainly in relation to the good fit shown by a good number of the models applied in general, and especially by one of them in particular. It seems that the use of decision trees from correctly pre-processed data, which includes abundant and combined school, teacher, and student information, returns predictive models of school effectiveness with good fits both in the training and validated samples. Some weaknesses inherent to this work should, however, also be highlighted. On the one hand, we find the restrictive nature of the categorization performed in the criterion variable to obtain a dichotomous variable. This decision eliminates much of the variability of this variable, limiting the possibilities of pattern identification in the data. In this regard, we decided to prioritize the obtaining of easily interpretable decision trees over trees that are very tight fitting but difficult to apply to educational reality and decision-making. We believe that this is the most appropriate procedure to facilitate the transfer of the results obtained given the level of development and current possibilities of the techniques used. In this sense, we must also point out that the computation of small trees, easily interpretable, could make it difficult to obtain trusted trees (

Many future research lines of great interest for the educational scientific community, mainly in two areas, are therefore opened up for the near future. Regarding the carrying out of more basic studies similar to this one, we need to increase the volume of evidence and contributions, since there are no similar studies that use school effectiveness as a criterion variable: Works that compare the operation of various classification algorithms; systematic analysis of the implications of using combined aggregate and non-aggregate data; and studies similar to this one in which the gross residual of school effectiveness (scale variable) or politomic categorization is used. With respect to the use of studies closer to that used, there is an undeniable potential regarding the use and interpretation of specific decision trees in various databases to try to identify factors associated with effectiveness, thereby contributing to the educational characterization of high- and low-effectiveness schools.

The datasets for this study can be found in the OECD webpage:

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was provided by the participants’ legal guardian/next of kin. This study is based on the public databases of the PISA 2015 assessment (OECD). Data collection for OECD-PISA studies is under the responsibility of the governments from the participating countries.

This work was completely conducted and developed only by FM-A.

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.