Methods

About PhenDB

In recent years, metagenomics projects were often focused on taxonomic composition and diversity, which were used to explain the roles of uncultivated microbes in their community. However, due to lateral gene transfer (LGT) mechanisms and rapid mutation rates observed in many bacteria, assessing the presence or absence of a phenotypic trait in an operational taxonomic unit (OTU) by proxy of taxonomic position alone can be deceptive. Here, we provide a framework which gives a rough overview of phenotypic traits potentially present in a user-provided set of metagenomics bins on the basis of genomic content. 

PhenDB is a publicly available resource able to screen previously assembled and binned metagenomes or genomes of cultivated isolates for 47 different bacterial traits. We use support vector machines (SVM) trained with manually curated datasets based on gene presence/absence patterns for trait prediction.

The PhenDB workflow

The PhenDB pipeline utilizes several applications to provide the user with trait predictions based on uploaded metagenomic bins.

Trait prediction is performed by utilizing a machine learning-based method called PICA as published in Feldbauer et al., 2015.

Together with the prediction, we also provide a rough estimation of bin completeness and contamination which PhenDB uses to gauge prediction quality. These values are derived from the presence and absence of 34 marker genes, extracted from the EggNOG annotation similar to CheckM. The completeness and contamination values are used to adjust the maximum accuracy achievable by each model individually. Note, that we use a marker gene set based on different HMM-profiles than CheckM does, therefore the values for highly contaminated genomes may vary.

Marker genes are also taxonomically classified using the bactNOG database. We use a lowest common ancestor algorithm as applied by MEGAN with majority rule to provide a taxonomic classification for each input bin. Note that contaminated bins may be unclassified because of the stringent cutoffs we apply.

 

Specific methods in PhenDB

  • ENOG Assignment

    All PhenDB predictions are performed on the bin itself, but on the set of orthologous groups of proteins (ENOGs) in which the uploaded bin’s predicted proteins are represented. Thus, in the first step, protein-coding genes are predicted in the bins/genomes [...]

  • Quality Analysis of Uploaded Bins (CompleConta)

    Completeness/Contamination estimation
    To assess the quality of the bin itself we check the presence or absence of marker genes in the ENOGs returned from the annotation. The 34 universal single copy ENOGs we use were derived from the originally [...]

  • Trait Prediction (PHENOTREX)

    Bacterial Trait Prediction with PICA

    We use a reimplementation of the PICA software called phenotrex (Feldbauer et al. 2015, https://doi.org/10.1186/1471-2105-16-S14-S1). Phenotrex makes use of python3 and Linear SVM of scikit-learn to train models and perform predictions on sets of [...]