Multiple-instance learning of somatic mutations for the classification of tumour type and the prediction of microsatellite status


Large-scale genomic data are well suited to analysis by deep learning algorithms. However, for many genomic datasets, labels are at the level of the sample rather than for individual genomic measures. Machine learning models leveraging these datasets generate predictions by using statically encoded measures that are then aggregated at the sample level. Here we show that a single weakly supervised end-to-end multiple-instance-learning model with multi-headed attention can be trained to encode and aggregate the local sequence context or genomic position of somatic mutations, hence allowing for the modelling of the importance of individual measures for sample-level classification and thus providing enhanced explainability. The model solves synthetic tasks that conventional models fail at, and achieves best-in-class performance for the classification of tumour type and for predicting microsatellite status. By improving the performance of tasks that require aggregate information from genomic datasets, multiple-instance deep learning may generate biological insight.

Nature Biomedical Engineering