The rhizosphere, the region of soil surrounding plant roots, is inhabited by a unique population of plant growth promoting rhizobacteria (PGPR). Many important PGPRs as well as plant pathogens belong to the genus Pseudomonas. However, there is uncertainty regarding the separation of beneficial and pathogenic strains, as it was previously thought that this meant that genomic characteristics had a limited ability to separate these strains. Here, we use a genome properties (GP) common biological pathway annotation system and machine learning (ML) to establish the relationship between genome-wide GP composition and plant-related lifestyles of 91 Pseudomonas strains isolated from the rhizosphere and phyllosphere, representing both phenotypes associated with plants. GP enrichment analysis, Random Forest model fit, and feature selection revealed 28 discriminant features. A test sample of 75 new strains confirmed the importance of the selected traits for classification. The results show that GP annotations are a promising computational tool for better classifying plant-related lifestyles.
Introduction
Among the goals set by the United Nations to achieve the goal of eliminating hunger is the need to double the production of agricultural food1. Early attempts to increase crop yields and production focused on plant breeding, chemical pest control, and the introduction of synthetic fertilizers to exploit limited global supplies2,3. While these strategies have improved production, the growing negative impact on the environment is forcing us to look for sustainable alternatives4,5,6.
Numerous studies have shown that cooperative microbiomes can play an important positive role in plant growth, development and fitness2,3,7. A particular hotspot is the rhizosphere, the region of soil surrounding plant roots colonized by plant growth-promoting rhizobacteria (PGPR)8. A stable population of PGPR can increase stress tolerance, growth, and crop yields by improving soil nutrient uptake and altering phytohormone status and plant metabolism7,9,10,11,12,13,14,15. The best studied PGPRs are Pseudomonas spp., a functionally diverse group representing beneficial plant species, as well as (opportunistic) pathogens such as P. syringae that can live on plant surfaces as epiphytes. Under the right conditions, P. syringae can also colonize the internal tissues of the plant and cause disease16,17,18.
The plant-associated lifestyle of the Pseudomonas strain is the result of a diverse spectrum of plant-host interaction pathways. Genome-based correlation approaches have identified several marker genes that contribute to the phenotype19,20,21. However, these marker genes are shared to a certain extent by both groups22, and therefore the uncertainty about the gap increases with each new genome added. There is still no general description of the presence and integrity of the biological functions and pathways that contribute to the plant-associated lifestyle of the Pseudomonas strain. Such knowledge would provide a critical insight into its potential to improve enterprise productivity and resilience.
Comparative functional genomics is possible when genes are placed in a biological context. Genome Properties (GP) is a domain-based functional annotation system by which functional attributes can be assigned to the genome23. The resource is a set of 1286 common biological pathways, validated by different sets of protein domains. For large-scale functional comparison, protein domains scale better and are less sensitive to sequence variation compared to methods based on sequence similarity 24,25 . Here we apply GP-based functional genomics using a total of 1286 traits and machine learning techniques to compare 91 fully sequenced Pseudomonas strains with documented lifestyles: 58 soil-dwelling Pseudomonas strains classified as PGPR and 33 known plant pathogen. strains (EPP). Because strains with different lifestyles often belong to the same species, it has been suggested that genomic islands gained and lost through homologous recombination may encode important plant-related lifestyle determinants. System wide analysis