FeGenie: a comprehensive tool for the identification of iron genes and iron gene neighborhoods in genomes and metagenome assemblies

Iron is a micronutrient for nearly all life on Earth. It can be used as an electron donor and electron acceptor by iron-oxidizing and iron-reducing microorganisms, and is used in a variety of biological processes, including photosynthesis and respiration. While it is the fourth most abundant metal in the Earth’s crust, iron is often limiting for growth in oxic environments because it is readily oxidized and precipitated. Much of our understanding of how microorganisms compete for and utilize iron is based on laboratory experiments. However, the advent of next-generation sequencing and the associated surge in publicly-available sequence data has now made it possible to probe the structure and function of microbial communities in the environment. To bridge the gap between our understanding of iron acquisition and utilization in model microorganisms and the plethora of sequence data available from environmental studies, we have created a comprehensive database of hidden Markov models (HMMs) that is based on genes related to iron acquisition, storage, and reduction/oxidation. Along with this database, we present FeGenie, a bioinformatics tool that accepts genome and metagenome assemblies as input and uses our comprehensive HMM database to annotate the provided datasets with respect to iron-related genes and gene clusters. An important contribution of this tool is the efficient identification of genes involved in iron oxidation and dissimilatory iron reduction, which have been largely overlooked by standard annotation pipelines. While this tool will not replace the reliability of culture-dependent analyses of microbial physiology, it provides reliable predictions derived from the most up-to-date genetic markers. FeGenie’s database will be maintained and continually-updated as new genetic markers are discovered. FeGenie is freely available: https://github.com/Arkadiy-Garber/FeGenie.


58
Iron is the fourth most abundant element in the Earth's crust (Morgan and Anders, 1980), 59 where it occurs primarily as ferrous [Fe(II)] or ferric [Fe(III)] iron. Under circumneutral pH and 60 aerobic conditions, ferrous iron spontaneously oxidizes to its ferric form, which precipitates and 61 settles out of solution becoming highly-limiting to microbial life (Emerson, 2016). Nonetheless, 62 microorganisms have evolved mechanisms to deal with this limitation, as evidenced by the variety  Table S3. Even though the sensitivity 152 of each HMM has been calibrated against NCBI's nr database (see "HMM development and 153 calibration" for details on the calibration process), we recommend that users take advantage of an 154 optional cross-validation feature of the program that allows users to search each FeGenie-155 identified putative iron gene against a user-chosen database of reference proteins (e.g. NCBI's nr, 156 RefSeq). Based on these analyses, FeGenie outputs the following files: 157 158 • CSV file summarizing all identified putative iron-related genes, their functional category, 159 bit-scores (shown in the context of the calibrated bit-score cutoff of the matching HMM), 160 number of canonical heme-binding motifs, amino acid sequence, and closest homolog to a 161 user-provided database (optional; e.g., NCBI nr database). 162 • Heatmap summary comparing the number of genes identified from each iron-related 163 category across the analyzed genomes/metagenomes.   Table S1). To expand the diversity of each of the collected proteins, those sequences were then used as queries in a BLASTp v.2.6.0 (Madden, 2013) search against NCBI's 178 RefSeq (Release 89) database (Pruitt et al., 2007), with a minimum amino acid identity cutoff of 179 35% (Rost, 1999) over at least 70% of the query length. These search results were then de-180 replicated so that each seed sequence is represented by a unique set of non-overlapping BLAST 181 hits. Using MMseqs2 (Steinegger and Söding, 2017), each seed sequence and its set of BLAST 182 hits were then collapsed with a 70% amino acid identity cutoff to remove overrepresented protein 183 sequences, which would otherwise create biases in resulting HMMs. Each collapsed set of 184 sequences was then aligned using Muscle v.3.8.31 (Edgar, 2004) and each alignment was manually 185 inspected and curated. These curated alignments were then used as seeds for the generation of   Figure S1). Thus, two separate HMMs were constructed, one for MtrA homologs 216 encoded by known iron-reducers and one for MtoA homologs encoded by known and suspected iron-oxidizers. The MtoA HMM includes PioA, which has been genetically-verified to be 218 necessary for iron oxidation in Rhodopseudomonas palustris TIE-1 (Jiao and Newman, 2007) 219 Moreover, the mtrA-encoding operon in iron-reducing bacteria typically encodes mtrC, an outer-220 membrane cytochrome thought to participate in dissimilatory iron reduction (Lower et al., 2007).

221
MtrC is not encoded by iron-oxidizing bacteria (Shi et al., 2014), supporting its use as an additional 222 indicator for iron-reducing potential. In light of these ambiguities in the function MtoA, 223 identification of MtoAB by FeGenie is treated with caution as a potential iron oxidase/reductase.

224
Other HMMs used for determination of iron oxidation potential include genes from iron-oxidizing    FeGenie can also be used to identify siderophore synthesis genes and potential operons.

244
Siderophores are microbially-produced products (500-1200 Da) that have a preference for binding 245 ferric iron (up to 10 -53 M) (Ehrlich and Newman, 2008), enabling microorganisms to obtain this   and FeGenie will identify putative siderophore synthesis genes based on the genomic proximity of 254 each identified gene ( Table 1). In contrast, the NIS pathway consists of multiple enzymes that 255 each have a single role in the production of a siderophore, such as aerobactin, which was the first 256 siderophore discovered to be synthesized by this pathway (Kadi and Challis, 2009). The operon involved in aerobactin biosynthesis is iucABCD, and homologs of the genes iucA and iucC (which 258 are included in FeGenie) are indicators of siderophore production via the NIS pathway (Carroll 259 and Moore, 2018). The HMM library that represents siderophore synthesis consists of HMMs 260 derived from the Pfam database, as well as those constructed here (Table 1). Because many 261 different siderophore synthesis pathways share homologous genes, we developed HMMs that were 262 sensitive to the entirety of each gene family, rather than for each individual siderophore.  infer siderophore and heme transport include both custom-made and Pfam models ( Table 1 and   285   Supplemental Table S1).  Table 1 and Supplemental Table S1. Heme oxygenase and transport genes define another strategy that microorganisms, 299 especially pathogens, use to obtain iron from their environment. In particular, heme oxygenases 300 enable pathogens to obtain iron from a host through oxidative cleavage of heme, thereby releasing 301 iron (Wilks and Heinzl, 2014). Heme oxygenases are categorized into two groups: 1) "canonical" 302 heme oxygenases (HmuO, PigA, and HemO), which degrade heme to biliverdin and carbon 303 monoxide, and 2) "non-canonical" heme oxygenases (IsdG, IsdI, MhuD, and Isd-LmHde), which      (Emerson et al., 2013). Since mtoAB are homologous to genes also implicated in iron reduction (mtrAB), FeGenie classified these genes 378 as potentially related to iron oxidation or iron reduction (i.e. the "potential iron oxidation/potential 379 iron reduction" category).

381
FeGenie accurately identified iron-reduction genes and operons in known iron-reducing 382 bacteria. For example, Shewanella oneidensis MR-1, a model organism for iron reduction, was 383 found to encode both copies of its porin-cytochrome module: mtrCAB and mtrDEF (mtrDEF is 384 homologous to mtrCAB, and was identified as such by FeGenie). Additionally, FeGenie identified 385 two more operons that each encode only mtrAB, which FeGenie categorizes as "probable iron 386 reduction" due to the lack of mtrC. Interestingly, within the mtrCABDEF operon, FeGenie also 387 identified the ferrous iron transport genes feoAB, which could be involved in the uptake of ferrous 388 iron that is generated during iron reduction. This same operon also encodes a catalase (not included 389 in FeGenie), which is a heme-containing protein that deals with oxidative stress and may 390 potentially be expressed together with the iron-reduction genes to deal with the oxidative stress of 391 high intracellular iron concentrations (Touati, 2000) potentially resulting from dissimilatory iron 392 reduction.

394
Some of the identified iron-reducers, for example, R. ferrireducens (Finneran et al., 2003),  Without any other candidate iron reductases in AMB-1's genome, this indicates that MtrAB may 411 be utlized in iron reduction without the outer-membrane component.

413
FeGenie was also used to identify iron acquisition and transport genes in model 414 microorganisms, including siderophore transport and synthesis genes, heme transport and 415 oxygenases, and Fe(II)/Fe(III) transport. It is worth noting that in these organisms not linked to 416 respriatory iron oxidation or dissimilatory iron reduction, FeGenie did not identify genes related to these metabolisms. In Escherichia coli and Bacillus subtilis, FeGenie identified three genes that 418 are necessary for the uptake of iron, efeUOB (Cao et al., 2007), in addition to other iron transport 419 genes (Supplemental File 4). Iron transport potential was also identified in nearly every genome 420 analyzed (including the CPR, discussed more in the section "Case Study: Iron-related genes 421 encoded within the Candidate Phyla Radiation"). This is expected, given that iron is a necessary 422 micronutrient for the vast majority of life. As an example of FeGenie's capability to identify the 423 siderophore gene families, we will focus on siderophore synthesis by Bacillus anthracis. B. 424 anthracis is known to produce anthrabactin (bacACEBF) and petrobactin (asbABCDEF) (Oves- available in public repositories. This step will add additional confidence that a gene identified as 441 iron-related is indeed so, based on its closest known relative.

443
FeGenie was also used to analyze the iron relevant genes encoded by five phototrophs,   Table 2 for site 482 descriptions). Generally, FeGenie's analysis indicate that there are discernable differences in iron 483 maintenance and metabolism strategies based on locale, likely due to differential iron availability 484 and general redox conditions ( Figure 5A, Supplemental Files 6 and 7). For example, where iron 485 oxidation and reduction gene counts are high, there appears to be fewer genes for iron acquisition.

486
As expected, the genetic potential for iron acquisition and storage appears to be more important in 487 environments where microorganisms are more likely to encounter iron limitations (Crosa, 1989;488 Andrews, 1998). This is supported by hierarchical clustering of the iron gene abundances across 489 analyzed metagenomes (Figure 5B), an optional step in FeGenie's pipeline. This offers support 490 for FeGenie's ability to provide meaningful insights into the iron-related genomic potential in 491 environmental metagenomic datasets.  Figure 5A). While cyc2 appears to be the 499 most widely-distributed gene that is associated with iron oxidation, other putative iron oxidases 500 are also identified (e.g. sulfocyanin, mtoAB, foxE). Iron reduction is predicted from the occurrence 501 of homologs to mtrCAB, as well as various porin-cytochrome operons homologous to those 502 encoded by Geobacter and Desulfovibrio species. In addition, we identified homologs to the 503 cytochrome OmcS from Geobacter sulfurreducens, thought to be involved in long-distance   The microbial capability to uptake iron is critical to understanding human oral infections 547 (Wang et al., 2012). This is because host iron-binding proteins, such as transferrin, lactoferrin, 548 hemoglobin, and ferritin, maintain an environment of low free iron concentrations (estimated 10 -549 18 M free iron in living tissues (Weinberg, 1978)), inhibiting bacterial growth (Mukherjee, 1985). 550 Here, we used FeGenie to analyze four representative strains from the human oral biofilm  Files 8 and 9), all publicly-574 available CPR strains were analyzed (Supplemental Files 10 and 11). The 17 selected genomes were chosen to demonstrate differences within these genomes with regard to genomic potential for 576 iron acquisition and utilization. The CPR members presented here include members of the 577 candidate phyla OP9 (Caldatribacterium), as well as Candidata Rokubacteria, Nealsonbacteria, 578 Zixibacteria, and the novel Archaeal phylum AR4.

580
Genes for siderophore synthesis were detected in only one of the CPR genomes analyzed 581 (Figure 6), while potential for siderophore transport is found in nearly all of the genomes.

582
Candidates for heme transport genes were found in only 4 of the 17 CPR strains analyzed. Out of 583 the 17 CPR genomes analyzed, none were found to encode genes associated with iron acquisition 584 from heme or hemophores, although the heme transport gene hmuV was identified in Candidatus 585 Nitrospira defluvii and Candidatus Raymondbacteria. Interestingly, some CPR microbes, such as 586 Candidatus Nealsonbacteria, do not seem to encode any genes associated with iron maintenance 587 or metabolism, with the exception of some putative iron transporters. One possible reason for this 588 is that these microorganisms, whose genomes are considerably smaller than typical free-living Here, we describe a new HMM database of iron-related genes and a bioinformatics tool, 610 FeGenie, that utilizes this database to analyze genomes and metagenomes. We validated this tool 611 against a select set of 26 isolate genomes and demonstrate that FeGenie accurately detects genes 612 related to iron oxidation/reduction, magnetosome formation, iron regulation, iron transport, 613 siderophore synthesis, and iron storage. Analysis of 27 environmental metagenomes using metabolic strategies across diverse ecosystems, and demonstrates that FeGenie can provide useful 616 insights into the iron gene inventories across habitats. We also used FeGenie to provide insights 617 into the iron metabolisms of 17 of the recently discovered CPR microorganisms, and revealed 618 genetic potential not identified in previous reports. FeGenie will be continuously updated with new 619 versions as new iron-related genes are discovered. Acknowledgments 625 We gratefully thank the following people for their support, advice, and comments: David

648
The authors declare no competing financial interests in relation to this work.           Table S1 1171 for more information, including the corresponding Pfam or TIGRFAMs families and the sequences used to create the HMMs. including external programs/dependencies, optional databases for cross-reference, and custom Python 1213 scripts 1214 representative isolate genomes. The isolate genomes were selected as model microorganisms to 1217 demonstrate the accuracy of FeGenie for identifying genes involved in iron oxidation and reduction, iron 1218 transport (including siderophores and heme), iron storage, and iron gene regulation. The genomes were 1219 obtained from the NCBI RefSeq and GenBank databases and analyzed by FeGenie. The size of each dot 1220 reflects the number of genes identified for each category and normalized to the number of protein-coding 1221 genes predicted within each genome. *This category is reserved for genes related to the MtoAB/PioAB 1222 gene family. 1223 27 metagenomes. The dendrogram was created using Rscript by hierarchically-clustering the distance 1228 matrix (Euclidian), which is created from a scaled version of FeGenie's matrix output, which summarizes 1229 the amount of iron genes for each category present in each metagenome assembly. The FeGenie output 1230 summary is represented as the heatmap, which is created from the same scaled matrix that was used for 1231 the dendrogram. The size of each dot reflects the number of genes identified for each category, 1232 normalized to the number of protein-coding genes predicted within each metagenome. *This category is 1233 reserved for genes related to the MtoAB/PioAB gene family. 1234