Metagenomics studies the DNA content of samples containing multiple microbest, to have a better understanding of microbial diversity. Third generation sequencing technologies have enabled the recovery of longer DNA sequences from these samples, but advanced computational methods are still needed for organizing the sequences into genomic bins. This task is known as metagenomic binning.
Despite recent advances in sequencing technologies and assembly methods, obtaining high-quality microbial genomes from metagenomic samples is still not a trivial task. Current metagenomic binners do not take full advantage of assembly graphs and are not optimized for long-read assemblies. The assembly graph is constructed while combining overlapping sequencing reads into longer sequences (contigs).
Deep graph learning algorithms have been proposed in other fields to deal with complex graph data. These algorithms can learn features (embeddings) based on the graph structure, i.e., the features of each node are influenced by its neighbors. In the same way, the assembly graph could be integrated with contig features to obtain better genomic bins.
We propose a method called GraphMB, which uses Graph Neural Networks (GNN) to leverage the assembly graph in the binning process. We combine this with a Variational AutoEncoder, inspired by VAMB (Nissen et al. 2021), to generate node-level features. The GNN generates graph-level features, which are combined with node features to assign each node to a bin.
We have analyzed GraphMB on long-read datasets of different complexities, and we have compared the performance of GraphMB with that of other state of the art binners in terms of the number of High Quality (HQ) genome bins obtained. With our approach, we were able to obtain unique bins on all real datasets, and obtain more bins on most datasets.
The results indicate that a deep learning model combining local contig-specific information wíth global information about the assembly graph can improve metagenomic binning. We hypothesize that this approach can be further optimized, by combining node learning, graph learning, and the subsequent clustering process into a continuous end-to-end process. Furthermore, other features, such as genome modifications, can in principle easily be integrated within our method.
GraphMB is available from https://github.com/MicrobialDarkMatter/GraphMB and the Bioinformatics journal paper is available here: https://doi.org/10.1093/bioinformatics/btac557