Rather than fitting a single model of soil thicknesses we went for a nuanced approach which entailed three separate models for:
Model 1. Predicting the occurrence of rock outcrops.
Model 2. Predicting the thickness of soils within the 0-2 m range.
Model 3. Predicting the occurrence of deep soils (soils greater than 2 m thick).
Models 1 and 3 used the categorical model variant of the Ranger RF which was preceded by distinguishing; for Model 1, the observations that were deemed as rock outcrops from soils. And for Model 3, distinguishing soils that were less than 2 m thick (and not rock outcrops) from soils greater than 2 m thick. Ultimately both Models 1 and 3 were binary categorical models. 50 repeats of 5-fold CV (cross-validation) iterations of the Ranger RF model were run for each Model variant.
Model 2 used the regression form of the random forest model. After removing from the total data set the observations that were regarded as rock outcrops and soil greater than 2 m, there were 111,302 observations available. Of these, 67,698 had explicitly defined soil thickness values. The remaining 43,604 were right-censored data and were treated as follows. For each repeated 5-fold iteration, prior to splitting the data in calibration and validation datasets, values from a beta function were drawn at random of length 43,604. This value (between 0 and 1) was multiplied by the censored value soil thickness and then added to this same value, creating a simulated pseudo-soil thickness. Once the simulated data were combined with actual soil thickness data, the values were square-root transformed to approximate a normal distribution. Ranger RF modelling proceeded after optimising the Hyperparameter settings as described above for the categorical modelling. Like the categorical modelling, 50 repeated 5-fold CV iterations were computed.
All three model approaches were integrated via a simple ‘if-then’ pixel-based procedure. At each pixel, if Model 1 indicated the presence of rock outcrops 45 times or more out of 50 (90% of resampling iterations), the estimated soil thickness was estimated as rock outcrop, or effectively 0 cm. Similarly, for Model 3 which was the model based on prediction of deep soils (soils >2 m deep). In no situations did we encounter both Models 1 and 3 predict in the positive on 90% or more occasions simultaneously. If Model 1 or 3 did not predict in the positive in 90% of iterations, the prediction outputs of Model 2 were used.
After model integration, we derived a set of soil thickness exceedance probability mapping outputs. These were derived simply by assessing the empirical probabilities (at each pixel) and then tallying the number of occasions the estimated soil depth exceeded given threshold depths of 10 cm, 50 cm, 100 cm, and 150 cm. This tallied number was divided by 50 to give an exceedance probability for each threshold depth.
All processing for the generation of these products was undertaken using the R programming language (R Core Team, 2020).