Variable depth forest: a more random random-forest for heterogeneous disease genetics

Bayat A, Wilson L, O'Brien AR, Szul P, Dunne R and Bauer DC

Transformational Bioinformatic Team CSIRO.

Genome-wide-association studies (GWAS) nowadays often apply Random-Forest for its capability to consider the interactions between genes. Random-Forest is an ensemble technique that adds randomness to Decision Trees, which are individual predictive machine learning models that capture interaction between genomic loci (features). This randomness allows the evaluation of a larger solution space for associated loci in GWAS-style analyses. One of the important parameters in the Random-Forest is the number of features to be evaluated at each node of each tree (mtry). This parameter directly controls the randomness in the model and substantially affects the performance by potentially limiting the exhaustive exploration of the solution space. This is especially crucial for multi-gene diseases, where sets of features may only incrementally obtain strong disease-association (deep trees) and thereby initially compete with individual features of moderate association (shallow trees). There have been efforts in the literature to find the optimal value of mtry. However, the optimal value highly depends on the dataset and its characteristics. In our work, we propose a method in which the value of mtry varies during the training process. Thus, not all trees are built using the same mtry allowing the creation of trees with diverse depths. The ensemble hence captures the strongest individual but importantly also sets of features associated with the disease. Furthermore, we evaluate changing the value of mtry at the node level, which allows an even more comprehensive search of the solution space. We assess our approach on Bone Mineral Density (BMD) case/control datasets.