Split Train and Test Data
=========================

Why split the data?
-------------------

`Spatial auto-correlation`_ refers to the tendency of points that are close to each other in space to exhibit similar values. In machine learning applied to geoscience, it induces complications in model development, as spatial correlation between training and testing datasets may artificially enhance model performance metrics and induce overfitting behaviors in parameters definition. Indeed, searching for good parameters in a spatially correlated dataset may lead to selecting parameters prone to overfitting. This can result in the model focusing on specific data value intersections in the data space rather than on global statistical trends, which will not generalize well to new data.

To address the challenge of spatial correlation between training and testing data, a clustering strategy is applied to divide the positive points into spatially distinct groups based on their coordinates. This process combines `t-SNE`_ (t-distributed Stochastic Neighbour Embedding) for spatial regrouping, followed by `K-means`_ clustering to define clusters in the re-projected space. This spatial separation is essential to minimize overfitting and improve model generalization.

Users also have the option to define clustering groups manually during the `target selection`_ step, by selecting the properties used for clustering. This flexibility allows users to tailor the clustering process to their specific geological or analytical requirements.

.. _target selection: target_selection.rst

Train Test Split Tab
--------------------

The "Train Test Slit" tab is divided in 2 sections: One on the left, to select the training and testing data, and one on the right, to visualize the clusters generated.

In the left section, users can select several **Groups** values (see :numref:`figure_train_test_split`). Each created group has an identifier.  In the left section, users also have the option to transfer clusters between the training and testing datasets using the provided arrows, as shown in :numref:`figure_train_test_split`. The percentage of the data that will be used for testing is displayed above.

.. _figure_train_test_split:

.. figure:: ./images/train_test_split/split_options.png
    :align: center
    :scale: 70%

    *The interface to select the training and the testing dataset.*


The "Train Test Split" tab also allows visualization of the clusters generated (or previously selected). In this visualization, only positive points are represented by diamonds. Training points are outlined (empty), whereas testing points are filled (solid). Each cluster is distinguished by a unique colour.


.. figure:: ./images/train_test_split/train_test_split.png
    :align: center
    :scale: 40%

*Train Test Split panel.*

How to select the clusters?
---------------------------

Selecting clusters requires both technical and geological considerations.

It is important for users to select a limited number of clusters in the training dataset (around 3-5), which should represent about 70-80% of the dataset. Selecting more clusters may lead to slower training. Additionally, these groups must be individually statistically representative. Avoid selecting clusters that are too small, as they may not accurately represent the overall data. In the validation set, however, all clusters are considered as one.

There are also geological considerations to take into account when selecting the clusters. Favorable selection of the testing data relies on geological insights. The testing set should reflect the characteristics of new zones that users anticipate discovering.


.. _Spatial auto-correlation: https://medium.com/locale-ai/spatial-autocorrelation-how-spatial-objects-affect-other-nearby-spatial-objects-e05fa7d43de8

.. _t-SNE: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

.. _K-means: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html