^{1}

^{†}

^{1}

^{†}

^{1}

^{2}

^{*}

^{†}

^{1}

^{1}

^{2}

Edited by: Rosalba Giugno, University of Verona, Italy

Reviewed by: Jie Zheng, Nanyang Technological University, Singapore; Xianwen Ren, Peking University, China

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

†These authors have contributed equally to this work.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Cell classifier circuits are synthetic biological circuits capable of distinguishing between different cell states depending on specific cellular markers and engendering a state-specific response. An example are classifiers for cancer cells that recognize whether a cell is healthy or diseased based on its miRNA fingerprint and trigger cell apoptosis in the latter case. Binarization of continuous miRNA expression levels allows to formalize a classifier as a Boolean function whose output codes for the cell condition. In this framework, the classifier design problem consists of finding a Boolean function capable of reproducing correct labelings of miRNA profiles. The specifications of such a function can then be used as a blueprint for constructing a corresponding circuit in the lab. To find an optimal classifier both in terms of performance and reliability, however, accuracy, design simplicity and constraints derived from availability of molcular building blocks for the classifiers all need to be taken into account. These complexities translate to computational difficulties, so currently available methods explore only part of the design space and consequently are only capable of calculating locally optimal designs. We present a computational approach for finding globally optimal classifier circuits based on binarized miRNA datasets using Answer Set Programming for efficient scanning of the entire search space. Additionally, the method is capable of computing all optimal solutions, allowing for comparison between optimal classifier designs and identification of key features. Several case studies illustrate the applicability of the approach and highlight the quality of results in comparison with a state of the art method. The method is fully implemented and a comprehensive performance analysis demonstrates its reliability and scalability.

With the ongoing development of sophisticated engineering methods for biological components, the benefits of synthetic biology for medical applications are discussed more and more (Kis et al.,

Rational design of synthetic biological systems is a complex task. Assembly of an

In this article, we show the potential of formal methods, in particular Answer Set Programming (ASP), in the context of classifier design. Although the underlying ideas are broadly applicable, we tailor our implementation to the task of processing miRNA profiles to distinguish between healthy and cancerous cells. In this context, a classifier is represented by a Boolean function that, given as input a discretized miRNA profile, outputs the binary cell state encoding healthy or diseased. A similar problem is considered in work by Mohammadi and colleagues (Xie et al.,

Exploiting the potential of ASP as a powerful solver for constraint satisfaction problems, we present an approach that allows to compute globally optimal classifiers that satisfy all given constraints. In the hierarchy of optimization criteria, the strongest emphasis is placed on classifier accuracy in terms of classification errors followed by circuit simplicity in terms of number of inputs and utilized gates. The computational power of the approach allows us to calculate all optimal solutions that can be further distinguished using scores relating the discrete results to the continuous data. Not least, comparison of those optimal solutions can uncover key classifier features as well as highlight variability in design. After describing the formalization and strategies for solving the problems, we present our results for five breast cancer case studies and compare them with the output of the heuristic approach of Mohammadi et al. (

Similar to electronic circuits, synthetic gene circuits are designed in terms of logic gates (Singh,

Cell classifier circuits are synthetic logic circuits capable of sensing endogenous molecular signals (inputs) in a living cell, classifying them as type-specific signals and triggering a desired response (output) based on the classification result (Xie et al.,

Here, we focus on miRNA-based cell classifiers for cancer datasets where input signals are miRNA expression levels that are binarized into two qualitative levels:

If the data is consistent, that is, the same profile has not been observed to characterize both a healthy and a diseased cell, and does not cover all input combinations then there will be more than one classifier. In practice, it is then interesting to choose from the set of mathematically feasible classifiers those that are biologically feasible while minimizing a cost function that represents the actual cost of assembling the classifier in a laboratory.

To implement our approach to classifier design we made use of Answer Set Programming (Lifschitz,

The full approach is implemented in a set of Python scripts, available on GitHub with a detailed description and a manual. The script

The scheme for our ASP-based approach to synthetic gene circuit design.

For our purposes, a dataset is a table of binarized miRNA expression profiles for different samples. The first column referenced as

Example data set for 3 samples: 2 negative and 1 positive.

Naturally, data discretization has to be handled with care since results will depend on the chosen thresholds. A variety of different discretization methods are available, see an example work by Gallo et al. (

In our context, a ^{n} → {0, 1}, where

An example of a perfect classifier for the dataset shown in Figure

where ∧, ∨ and ¬ represent logical conjunction, disjunction and negation, respectively. This classifier consists of two clauses, namely (

Differences in the structure of the Boolean function, for example, how the gates are formed, are relevant because they may result in classifiers that cannot be assembled in the laboratory or in classifiers that are very expensive to assemble. Based on the study of Mohammadi et al. (

A full set of core classifier constraints. Here, a classifier may consist of up to 6 gates with overall up to 8 inputs. Gates are either of Type 1 (OR gate) or Type 2 (NOT gate) where the first may have only non-negated inputs while the second may only be a single negated input.

The core constraints may in many cases be extended to specific needs without much effort. Here, we describe two that are implemented in our software.

First, it may easily happen that the chosen constraints are not satisfiable for a given dataset, that is, the perfect classifier does not exist. This may be caused by the diversity between cancer samples, but also by experimental artifacts or data preprocessing errors, for example, in the data discretization step. In such case, we can search for an imperfect classifier allowing misclassification of a certain number of samples. Thus, we introduced two additional constraints: upper bounds on false positive and false negative errors. The optimization procedure for imperfect classifier optimization we call

It is worth considering which type of error we may accept or even neglect. If the desired response is to cause the cell apoptosis, a false negative error results in a wrong diagnosis and a cancerous cell survives. In case of a false positive error a healthy cell will be diagnosed as diseased and killed. In case of false negative errors we may avoid killing healthy cells increasing the misclassification expectancy for the diseased cells at the same time. The first presented case study, Breast Cancer

Second, we formulated the

Once a feasible classifier is specified in terms of gate types and bounds on inputs and gates, it is usually of interest to find one that is optimal with respect to a given cost function. Putting the focus on finding structurally simple classifiers to facilitate construction, we propose the following optimization problems for finding a perfect classifier:

(Opt1) Minimize the number of inputs.

(Opt2) Minimize the number of gates.

(Opt3) Minimize the number of inputs followed by the number of gates.

(Opt4) Minimize the number of gates followed by the number of inputs.

The two problems (Opt3) and (Opt4) are bi-level optimization problems where the upper-level problem is solved first followed by the lower-level problem. Each of the four strategies may lead to a different classifier even for the same dataset and classifier constraints. Here, it might also be interesting to run both (Opt3) and (Opt4). Results can be evaluated and the final design chosen accordingly.

In the context of the ASP programming environment (see the

Note that we optimize globally, that is, all solutions (if found) are the best solutions among all other feasible solutions for a given dataset. In other words, if a given classifier is a solution of optimizing with strategy (Opt1) and it consists of 7 inputs then a classifier with less than 7 inputs for a given dataset under given constraints does not exist. In addition, the ASP-based method allows to list all optimal solutions, if more than one exists. Analysis of commonalities and differences of the optimal classifiers can then provide additional insight into the importance of specific inputs or circuit modules for the classification.

However, for some datasets it is impossible to find an optimal solution satisfying all constraints. As mentioned before, to tackle that issue we incorporated an optimization procedure for finding imperfect classifiers (but respecting the

Usually, optimal solutions are not unique. There may be several optimal designs that differ in the miRNAs that are used or in the way inputs are assigned to gates. In those cases it might be insightful to enumerate all optimal solutions, for example to investigate common structural features, or simply to ask which miRNAs do or do not appear in optimal classifiers. However, when interested in this feature we need to take care of symmetries generated in the process of finding the classifiers. ASP allows for isomorphic classifiers to be counted as different solutions. Gates are for example assigned an integer identifier which we need in order to determine which inputs belong to which gate. Any permutation of assigning IDs to gates will therefore be counted as a separate solution. Breaking symmetries is an involved topic of its own. Its importance lies in the fact that the number of symmetric solutions may explode and seriously hamper the calculation of all solutions. However, in our applications we can still easily solve this problem through a post-processing step. Within a set of optimal solutions returned by the ASP solver we first sort the gates of each classifier by the IDs of inputs. Then we rewrite all the solutions by assigning to each gate an ID in ascending order preserving the original gate to input relation. The input and gate IDs are then ordered identically for all the isomorphic solutions in each class what makes them indistinguishable. An arbitrary solution from such a subset can be picked as a representative. This procedure is illustrated in the first case study presented in section 3.2

To assess the classifiers resulting from the optimization procedure we incorporated in our workflow an evaluation step. We distinguish two settings for the classifier assessment: Boolean and continuous. In the Boolean setting we consider discretized datasets and classifiers, that is, Boolean functions, to evaluate how well the function separates samples for a given dataset. In a continuous setting we estimate how well a classifier will perform in a setting closer to reality. Here, we adopt an approach by Mohammadi et al. (

To asses classifiers in the Boolean setting we calculate the false positive and false negative rates. Both rates allow to estimate the expectancy that a sample may be misclassified and how well a classifier performs in a context of classifying the data.

To evaluate classifiers in the continuous setting we calculate, as mentioned, the classifier scores proposed by Mohammadi et al. (

The first score (_{AUC}) that is considered with the help of the continuous model, the area under the ROC curve (AUC), predicts how well the classifier responds based on different thresholds for the circuit output concentration. Additionally, Mohammadi et al. (_{m}) is represented by a weighted sum of these margins:

where λ ∈ [0, 1] is a weight that specifies the contribution of the particular margins. For the breast cancer data sets we used λ = 0.5 (assuming that both margins are equally relevant) applied also by Mohammadi et al. (_{AUC} and then with the highest _{m}. We preserve all details of the strategy proposed by Mohammadi et al. (

As an additional evaluation of our approach, we perform, if the data permits, cross-validation to test the predictive power of the calculated optimal classifiers when facing entirely new samples. We illustrate that for the largest case study dataset below, where we performed a 3-fold cross validation.

To have a broad picture of the performance of our method we tested our approach on simulated datasets. Here, we describe how the datasets were prepared.

We generated random 0-1 matrices, with each entry independent and 0-1 equally likely to occur, of all dimensions starting from 10 by 10 going up to 500 by 500 where the step-sizes of increasing rows and columns are both 10. The benchmark consists therefore of 50 × 50 = 2, 500 binary matrices representing miRNA expression data sets. For each matrix we generated

Here, we propose two setups for the data generation. In

In

In addition to the run-time analysis for our ASP-based implementation we also assessed the quality of the predictions of the classifiers. We decided to record the generalization error of a 10-fold cross-validation for Setup 1 (guaranteed existence of solution), for finding feasible and optimal solutions. The cross-validation was conservative in that we treated time-outs as false predictions.

For the cross-validation the rows of each matrix of each data point were divided into 10 parts of equal size. For each tenth we built a classifier based on the 9 remaining parts only. All mismatches between the resulting classifier and the given

All computations, for both the benchmarking and the cross-validation, were performed on a Linux AMD64 with 2.83 GHz and 32 GB of memory.

To test and evaluate our approach in application we present a case study with five datasets also considered by Mohammadi and colleagues, allowing a subsequent comparison with their results. We complement the case study with a performance analysis using synthetic data.

The following breast cancer datasets have been presented by Farazi et al. (

Breast cancer dataset description: overall number of samples, number of positive and negative samples, number of miRNAs taken into account and binarization threshold applied for data binarization.

All | 178 | 167 | 11 | 478 | 250 |

Triple- | 82 | 71 | 11 | 456 | 250 |

Her2+ | 86 | 75 | 11 | 438 | 1,250 |

ER+ Her− | 32 | 21 | 11 | 392 | 1,250 |

Cell Line | 17 | 6 | 11 | 375 | 50 |

We searched for classifiers for the breast cancer

For the combined Breast Cancer

Classifiers for Breast Cancer All data set.

The first classifier consists of only one gate and one input

As mentioned before, it is also worth considering whether one type of error is less desirable or entirely forbidden. Here, we present results of optimization where we do not allow the false positives to occur. In this case, we find 6 optimal solutions. However, all of them are just artifacts of the implementation, namely being isomorphic to the classifier presented in Figure

Both classifiers presented in Figure

For the largest dataset we also performed a 3-fold cross-validation. The folds consist on average of 56 positive samples and only 4 negative samples. We divided the dataset in 3 almost equal subsets (the subset size differs in at most 1 sample) without taking an even distribution of positive and negative samples between subsets into account. For all folds it was necessary to apply the constraints relaxation procedure. The classifiers result on average in classification with FN rate = 0.01 and FP rate = 0.56 (FN occurrence average = 1, FP occurrence average = 3). The results show that our method was able to classify the positive samples almost perfectly. The very high FP rate may be a result of a very imbalanced division of negative vs. positive samples. We address the influence of imbalanced datasets on the results in the discussion. The cross-validation resulted in 2 different classifiers: (¬ miR-144) and (¬ miR-10b) AND (¬ miR-193a-5p). Both

Classifiers for the breast cancer

For the breast cancer

The resulting classifier for the dataset

Classifier for the breast cancer

For the

Results for the breast cancer

For the Cell Line dataset the optimization results in six perfect classifiers that consists of only one gate and one input each, where five of them are gates of Type 2 and only one is a gate of Type 1. That is, these classifiers distinguish cancerous from healthy samples based merely on the expression level of one miRNA (FN rate = 0.00, FP rate = 0.00). As an example we present a classifier with a negative input of miRNA mir-145 (see Figure

We evaluated our classifiers with the _{AUC}, _{m}. We run the calculations keeping the same biochemical parameter sets and binarization thresholds for each dataset as proposed by Mohammadi et al. (_{AUC}, _{m}. Additionally, we calculated FN and FP rates for these circuits to compare the scores in the binary setting. Note that Mohammadi et al. (

Evaluation of breast cancer classifiers with scores: false negative rate, false positive rate, AUC, average margin and worst margin.

_{AUC} |
_{m} |
||||
---|---|---|---|---|---|

BC All | (¬ miR-378) | 0.02 | 0.27 | 0.96 | 0.16 |

Triple- | (¬ miR-378) ∧ (¬ miR-144) | 0.04 | 0.18 | 0.98 | 0.24 |

(miR-24-1) ∧ (¬ miR-378) | 0.04 | 0.18 | 0.99 | 0.25 | |

Her2+ | (miR-21) ∧ (¬ miR-451-DICER1) ∧ (¬ miR-320-RNASEN) | 0.00 | 0.09 | 0.99 | 0.31 |

ER+ Her− | (miR-21) | 0.00 | 0.27 | 1.00 | 0.50 |

(miR-21) ∧ (¬ miR-320-RNASEN) | 0.00 | 0.18 | 0.96 | 0.14 | |

Cell Line | (¬ miR-145) | 0.00 | 0.00 | 1.00 | 1.50 |

(¬ miR-143) | 0.00 | 0.00 | 1.00 | 1.16 | |

(¬ miR-199a-2-5p) | 0.00 | 0.00 | 1.00 | 0.96 | |

(¬ miR-451-DICER1) | 0.00 | 0.00 | 1.00 | 0.93 | |

(¬ miR-146a) | 0.00 | 0.00 | 1.00 | 0.55 | |

(¬ miR-425) | 0.00 | 0.00 | 1.00 | 0.32 |

For the _{AUC} = 1.00, _{m} = 0.40. Our one-input classifier for this dataset is shorter (three inputs less). _{m} shows that the margins are lower, although the FN rate is noticeably improved. The cross-validation results for the same dataset for 3-fold cross-validation presented by Mohammadi et al. are: _{AUC} = 0.99, _{m} = 0.31. Our results for a 3-fold cross-validation (_{AUC} = 0.93, _{m} = 0.24) shows that our method separated the new samples with a very similar accuracy. Note, that the samples were divided into random subsets by us and by Mohammadi et al. (

In case of the _{AUC} score and then (in case of equal values) at the highest _{m}. Based on the same strategy we chose the second classifier (_{AUC} = 1.00, _{m} = 0.51). However, the error rates correspond to 7 errors in total. The classifier optimized with our approach is simpler (one input less). Thus, it could be easier to assemble. Additionally, we again reduced the overall number of errors in the binary setting.

For the _{AUC} = 1.00, _{m} = 0.53. Classifier (_{m} score is probably related to the one outlying sample captured by the FP rate. Our classifier is also shorter (two inputs less). Thus, it could be easier to assemble in the laboratory. In this case it is worth to consider which constraint is more important.

For the _{AUC} = 1.00, _{m} = 0.65. Accuracy of our single-input classifier

Results for the

Lastly, in case of the _{AUC} = 1.00, _{m} = 1.71) are comparable.

In all cases we were able to find shorter classifiers and in most cases improve the accuracy of classification in the binary setting. Otherwise, the accuracy is identical. In the continuous setting we obtained comparable results. However, interpretation of these scores linking the Boolean to the continuous classifier are difficult to assess. We address this problem in the discussion.

Two case studies cannot give a broad picture of the performance of an algorithm. In particular we were interested in the approximate number of samples and miRNAs at the breaking point when the ASP solver does not find a solution anymore. Clearly, the answer depends on a lot of parameters: How is the data generated? What are the constraints that specify feasible solutions? How long do we wait before a problem is deemed unsolvable?

Each considered dataset consists of a random 0-1 matrix and an annotation that specifies which rows of the matrix correspond to positive and which to negative samples. Here, we made use of two approaches to data generation (described in detail in section 2.7):

Results of the benchmarking. Black filled squares indicate that the time-out was reached. Squares outlined in black indicate that the infeasibility of the problem was proved within the time limit.

To obtain the benchmark in a reasonable amount of time we used time-outs between 10 min and 1 h. Figure

The problem of finding a feasible solution seems to increase equally with the sample and the miRNA dimension, see Figure

In Setup 2 we marked problems that were proven to have no solution with a black outline in the heat map. Interestingly, the likelihood that a solution exists increases with the number of miRNAs, see Figures

Overall, the two plots of Setup 1 show that only a small portion of problems may not be solved, even within the time limit of up to 15 min which could be well extended in practice. The plots of Setup 2 show that the infeasibility of constraints and data can also be decided in many cases within the 15 min.

Finally, we decided to benchmark the

Results of the scalability test. The plot is a benchmark for 50 sampels and miRNAs that range between 50 and 10,000 with a time-out of 30 min.

We see that the mean and standard deviation are both below 10 min, even for 10,000 miRNAs. Also, the number of problems that were not solvable within the time-limit remains below 20 for a bin size of 400 problems.

Note, that for the tests we used a computing power corresponding to a power of a personal computer. We were able to find feasible solutions on a scale of minutes. When considering real-world applications one certainly may invest more time and computing power to obtain feasible, globally optimal solutions for large datasets.

Here, we present the results for the 10-fold cross-validation for Setup 1 described in the previous section. The plot in Figure

Results of the cross-validation. The results of the cross-validation for Setup 1.

The circular region with an increased error in the top right corner, and in particular, the 4 data points that were assigned an error rate of 1.0, are explained by the way we deal with time-outs during the cross validation. If no feasible classifier is found during a cross-validation we count the whole tenth as false predictions. The bright yellow dots are problems in which the time-out of 10 min was insufficient for each of the 10 calculations and hence everything was counted as misclassified.

The same is happening in Figure

The main goal of this study was to show the potential of Answer Set Programming for design problems in synthetic biology, in particular, in the context of the miRNA-based classifier design. We created a multi-step workflow for classifier optimization, which allows to obtain globally optimal perfect and imperfect (in case when a perfect classifier does not exist) classifiers in a short time using the computing power of a personal computer. The constraints we employ, that is, the gate types and bounds on inputs and occurrences reflect real life requirements for practical circuit designs (Mohammadi et al.,

Five real-world case studies demonstrate that the ASP-based approach allows to find shorter classifiers than heuristic methods (Mohammadi et al.,

Unfortunately, the criteria and scores in both settings, binary and continuous, are not easily comparable and cannot be intuitively interpreted together. Future work will aim at integrating the different aspects employed in choosing the optimal classifier in the optimization criterion used for scanning the design space. Beyond the notions already explored in this paper, we plan to furthermore integrate weights representing the assessment of data quality, sensitivity to data discretization and a preference for particular circuit building blocks to foster reusability of available molecular constructs.

Although we find globally optimal feasible solutions the datasets used for the case study analysis were imbalanced. Most of them consist of several positive samples and only a few negative samples. It is worth to investigate whether the imbalanced datasets affect the results and employ additional statistical methods to decrease the possible influence in a pre-processing step.

The breast cancer datasets we considered here were pre-processed by Mohammadi et al. (

The benchmarks suggest that if a set of samples has a feasible solution then it can be found efficiently using ASP. That is, even for hundreds of samples and miRNAs, solutions may typically be obtained on the scale of minutes rather than hours or days using a personal computer. Thus, the benchmarks underline the feasibility of our approach for large datasets, especially in medical applications. In case of classification based on personalized miRNA profiles similar to the case studies presented in this work the ASP-based method seems to be adequate and does not require additional computational power.

We proposed a classifier design method which allows to obtain globally optimal solutions in a short time. The method is flexible in relation to the given constraints resulting from the complexity of the biological problem. We also presented several possibilities to extend the presented tool in the future. However, the task of classifier design is a complex task demanding an ongoing cooperation on both sides: experimental and computational to achieve the compromise between the biological requirements and computational possibilities.

Python scripts and pre-processed data sets used for case study analysis are available on GitHub:

The Potsdam Answer Set Solving Collection Potassco is available at:

KB, HK, and HS conceived the study. KB and HK implemented the code and prepared the first draft of the manuscript. HK designed and performed the simulated data analysis. MN extended the framework with the evaluation procedure, performed the case studies and wrote the final version of the manuscript in consultation with KB, HK, and HS. HS supervised the project.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We would like to thank Niko Beerenwinkel and Yaakov Benenson for the discussions in Basel and Pejman Mohammadi for clarifying technical details regarding his work on these synthetic circuit designs.

The Supplementary Material for this article can be found online at: