Split Train and Test Data ========================= Why split the data? ------------------- `Spatial auto-correlation`_ refers to the tendency of points that are close to each other in space to exhibit similar values. In machine learning applied to geoscience, it induces complications in model development, as spatial correlation between training and testing datasets may artificially enhance model performance metrics and induce overfitting behaviors in parameters definition. Indeed, searching for good parameters in a spatially correlated dataset may lead to selecting parameters prone to overfitting. This can result in the model focusing on specific data value intersections in the data space rather than on global statistical trends, which will not generalize well to new data. To address this challenge and ensure that the training and testing data are not spatially correlated, a clustering algorithm (`DBSCAN`_) is used to segregate the positive data into distinct clusters based on their spatial coordinates. This ensures that points are grouped into different clusters if they are more than a user-defined distance apart. This separation process, which may take several minutes due to its reliance on a background machine learning model, is crucial for mitigating the risk of overfitting. Given the common issue of `imbalanced`_ positive (mineralized) and negative points, the most represented class (typically the negative) undergoes resampling. This resampling, conducted with a `K-means`_ algorithm over the selected properties, aims to ensure a representative sampling of the negative data. Train Test Split Tab -------------------- The "Train Test Slit" tab is divided in 2 sections: One on the left, to select the training and testing data, and one on the right, to visualize the clusters generated by the DBSCAN algorithm. In the left section, users can select a **distance** value (in meters, see :numref:`figure_train_test_split`). This distance represents the minimum distance required for two points to be considered part of the same cluster. Each created group has an identifier. In the left section, users also have the option to transfer clusters between the training and testing datasets using the provided arrows as shown in :numref:`figure_train_test_split`. The percentage of the data that will be used for testing is displayed above. .. _figure_train_test_split: .. figure:: ./images/train_test_split/split_options.png :align: center :scale: 70% *The interface to select the training and the testing dataset.* The "Train Test Split" tab also allows to visualize the clusters generated by the DBSCAN algorithm. In this visualization, negative points are represented by circles, positive points by diamonds. Training points are outlined (empty), whereas testing points are filled (solid). Each cluster is distinguished by a unique color. .. raw:: html :file: ./images/train_test_split/train_test_split.html *Train Test Split panel.* How to select the clusters? --------------------------- Selecting clusters requires both technical and geological considerations. It is important for users to select a limited number of clusters in the training dataset (around 3-5), which should represent about 70-80% of the dataset. Selecting more clusters may lead to slower training. Additionally, these groups must be individually statistically representative. Avoid selecting clusters that are too small, as they may not accurately represent the overall data. In the validation set, however, all clusters are considered as one. There are also geological considerations to take into account when selecting the clusters. Favorable selection of the testing data relies on geological insights. The testing set should reflect the characteristics of new zones that users anticipate discovering. .. _Spatial auto-correlation: https://medium.com/locale-ai/spatial-autocorrelation-how-spatial-objects-affect-other-nearby-spatial-objects-e05fa7d43de8 .. _DBSCAN: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html .. _K-means: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html .. _imbalanced: https://medium.com/@okanyenigun/handling-class-imbalance-in-machine-learning-cb1473e825ce