Integrating and refining the predictions of ensembled plant effector detection programs using machine learning
Effectors are pathogen proteins that facilitate infection by manipulating plant immunity. Computational programs have been developed that identify effectors from sequence data. Many of these programs use internal models that have unavoidable biases due to their training processes and the diverse nature of effector sequence, function and phylogeny. Each programs ability to predict effectors across a broad range of plant pathogens is therefore limited. We hypothesised that a meta-predictor constructed using machine learning (ML) approaches could integrate predictions from multiple programs and improve our ability to predict effectors more accurately from bacteria, fungi, and oomycetes. We trained a range of classifiers using classical ML approaches and deep neural networks (DNNs), then selected eight: Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and five DNNs for evaluation. The training, test and validation data were carefully curated from effector and non-effector annotated sequence derived from the training and sample data of six programs: EffectorP 3.0, deepredeff, WideEffHunter, EffectorO, EffectiveT3, T3SEpp. The models were tested against existing programs on a test dataset, and we observed better performance from our models. The best-performing model was a DNN (Model_2) that balanced improved sensitivity with specificity across the three taxa. We observed using SHAP that all the features contributed to the output of Model_2, which might be the reason for its superior performance. The DNN was developed into a package, fimep, to allow easy use of our model.