Modeling

Once properties have been selected and testing data defined, models can be executed to generate predictions.

The modeling tab, presented in Figure 7, offers various modeling and visualization options. First, users can select the model to use from the dropdown menu at the bottom left of the tab. The two options are Knowledge Driven and Random Forest. Users can click on Run to start the modeling process. The validation score (accuracy on the validation dataset) will be displayed to the right of the model selection dropdown menu. Three visualization options are available: Feature Importance, Confusion Matrix, and ROC Curve. Users can click on the corresponding button to display the visualization. Finally, by clicking on the Export to GA menu, users can export all computed results to GA.

_images/modeling_tab.png

Figure 7 The ‘modeling tab’ and its different options.

Knowledge-Driven

Users can select the “Knowledge Driven” option from the dropdown menu. This option enables them to establish minimum and maximum threshold values and a weight for each selected property. It allows users to construct a model based on their geological knowledge to compare it with a data-driven approach, as show in Figure 8. the index overlay operation is computed for every point with the following formula:

\[\sum_{i=1}^{n} \left( \min_{i} < \text{property}_{i} < \max_{i} \right) \times \text{weight}_{i}\]
_images/index_overlay.png

Figure 8 “Index Overlay” panel allows users to define the minimum and maximum threshold values and a weight for each selected property.

Random Forest

The second modeling option employs a Random Forest approach. This model is trained on the balanced data obtained from the Train-Test-Split panel and tested on the testing data.

The model optimization uses the grid search approach, leveraging the principle of cross-validation to identify the best parameters. Cross-validation tests various parameter combinations using different subsets of the training data. To address the issue of spatial correlation, the distinct groups defined in the Train-Test-Split tab are used sequentially in the validation phase. This strategy ensures that parameter selection is not compromised by overfitting due to spatial correlation.

The parameter combination that yields the best results is then applied to recompute the algorithm over the entire training dataset. Upon training the model, a score and interpretative figures are generated and made available to users.

For both methods, a score representing the accuracy is provided: - A perfect classification results in a score of 1. - A random classification yields a score of 0.5. - The worst classification results in a score of 0.

The accuracy is obtained by the following formula:

\[\text{Score} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}\]

The score, calculated solely based on the testing dataset, serves as an indicator and does not necessarily reflect the model’s relevance. Specifically, a model scoring near 1 may be indicative of overfitting rather than genuinely predictive of new areas of interest.

Interpretation Figures

The modeling process generates three figures to help users interpret the model’s performance and behavior: Feature Importance, Confusion Matrix, and ROC Curve. Those figures are available for both the Knowledge-Driven and Random Forest models.

Feature Importance

The first figure presented is the feature importance plot. This figure is generated by permutation importance. It randomly shuffles the data for one property within the testing dataset and then measures the impact on the model’s accuracy. A decrease in accuracy, associated with a positive score for the feature in the figure, indicates that the property is crucial for prediction, attributing it positive importance. If the shuffle does not affect the accuracy, it suggests the property is not utilized by the model, resulting in a score close to zero. An increase in accuracy implies the property is being used by the model, but the correlation defined in the training dataset is not present in the validation dataset, leading to a negative score. This process is repeated several times for each property to obtain a distribution of scores, which are displayed in a box plot.

“Feature Importance” panel displays the distribution of scores for each property.

This figure offers geologists insights into which properties are impactful for the model’s predictions and which might introduce biases regarding the validation set. However, it is important to interpret these figures as indicative rather than conclusive. The relevance of a property is determined by its contribution to the model, which does not always align with its actual significance in identifying positive points in the real world. A property being classified as unused, positively important, or negatively impacting the model does not definitively dictate its overall relevance or irrelevance in identifying real-world targets. Moreover, even if a model relies on a property, removing it can lead to another solution with a better score. Conversely, removing a property with a low score can decrease the model’s performance. Therefore, the best properties must be selected through a trial-and-error process.

Confusion Matrix

The second figure is a confusion matrix, a representation used in machine learning to visualize the performance of an algorithm. It is a table that allows the user to see the frequency of correct and incorrect predictions made by the model, classified into four categories: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). True positives and true negatives represent the instances correctly identified by the model as positive and negative, respectively. In contrast, false positives are negative instances incorrectly identified as positive, and false negatives are positive instances incorrectly marked as negative. This matrix provides an insightful snapshot of the model’s accuracy, helping to pinpoint where it performs well and where it may require adjustments.

“Confusion Matrix” panel displays the frequency of correct and incorrect predictions made by the model.

The confusion matrix is specifically applied to the test data that was set aside earlier in the Train-Test-Split panel. This ensures that the matrix reflects the model’s performance on unseen data, providing a more accurate representation of its predictive capabilities in real-world scenarios. However, applying a confusion matrix to a test set that is not spatially correlated and has been balanced with undersampling may not fully capture the model’s performance in real-world, spatially complex scenarios, potentially skewing its perceived effectiveness.

Analyzing the matrix can reveal important behavioral patterns of the model. For example, a model with more false negatives than false positives might be overly cautious, potentially missing out on identifying positive cases. Conversely, a model with more false positives might be too aggressive, leading to the identification of too many instances as positive, including those that are not.

ROC Curve

The True Positive Rate (TPR) and False Positive Rate (FPR) are used to evaluate the performance of classification models. The TPR, also known as sensitivity or recall, is defined by the formula:

\[TPR = \frac{TP}{TP + FN}\]

where TP represents the number of true positives, and FN represents the number of false negatives. This measure indicates the proportion of actual positives correctly identified by the model. Conversely, the FPR is defined by the formula:

\[FPR = \frac{FP}{FP + TN}\]

where FP is the number of false positives, and TN is the number of true negatives. The FPR measures the proportion of actual negatives that are incorrectly classified as positives by the model.

The output of a Random Forest model, like many classification models, is a probability ranging from 0 to 1. By setting a threshold that varies from 0 to 1 to perform the classification, one can observe the model’s behavior: at lower thresholds, the model may classify more instances as positive, potentially increasing both the number of true positives (thus increasing TPR) and the number of false positives (thus increasing FPR). As the threshold increases, the model becomes more stringent, possibly reducing both TPR and FPR. This variability highlights the trade-off between capturing as many positives as possible while minimizing the misclassification of negatives.

A ROC curve, as displayed below, is a graphical representation of the TPR against the FPR across different thresholds. The curve is generated by plotting the TPR on the y-axis and the FPR on the x-axis, with each point on the curve representing a different threshold from 0 (left) to one (right).

“ROC Curve” panel displays the trade-off between TPR and FPR across different thresholds.

In the context of the ROC curve, a “perfect” model would exhibit a scenario where the curve shoots straight up to the top-left corner, indicating a TPR of 1 (or 100%) and an FPR of 0 simultaneously. Such an outcome suggests that the model correctly identifies all positives without any false positives. However, this ideal scenario might also hint at overfitting.

The ROC curve visually represents the trade-off between TPR and FPR across different thresholds, providing users with a powerful tool to assess a model’s diagnostic ability. By examining the curve, users can identify the model’s performance across all possible thresholds, aiming to choose a threshold that balances sensitivity and specificity. The area under the ROC curve (AUC) gives an aggregate measure of performance across all possible classification thresholds. A higher AUC indicates a model with better discrimination capability, able to differentiate between the positive and negative classes effectively. This value is calculated for the user and is annotated in the top left of the ROC Curve plot. Through the ROC curve, users gain insights into the model’s predictive accuracy and its potential for practical application, ensuring informed decision-making in selecting the optimal threshold for classification.