Wednesday, September 5, 2018

Tenure-Track Investigator Recruitment in Data Science, Biomedical Informatics, and Computational Biology @NLM, NIH

The National Library of Medicine (NLM) of the National Institutes of Health (NIH) is seeking 2-3 tenuretrack investigators to lead world-class research programs within its Intramural Research Program. The goal of this search is to identify candidates with the potential to develop a dynamic, innovative, and independent computational research program that will enhance NLM’s collaborative environment by performing novel, cutting-edge research and, thereby, advance the objectives of NLM’s new Strategic Plan, 2017-2027 to accelerate data-driven discovery. As part of this plan, and in alignment with NIH’s new Strategic Plan for Data Science, NLM is embarking on a major expansion of its Intramural Research Program, with a specific emphasis on data science methodologies, analytics, visualization, computerassisted curation, and applications of novel methods for basic biological and biomedical discovery.   

The NIH Intramural Research Program offers unique opportunities for interaction and collaboration with an extensive community of scientists who have broad expertise across all fields of biomedical research, proximity to NLM’s advanced biomedical data and information systems, access to NIH’s highperformance computing resource, and a funding model that facilitates the conduct of long-term, highimpact science that can be difficult to undertake in other research environments. 

NLM seeks to recruit investigators with significant experience in the use of statistical, bioinformatic, machine learning, optimization, and advanced mathematical methodologies as applied to biomedical and health science, including one or more of the following domains of interest:

• Analysis of large biomedical data sets, multi-dimensional medical images, and unstructured text to generate new biomedical knowledge
• Standards and ontologies for the content and structure of biomedical and health data, to facilitate the automated aggregation and analysis of these data
• Natural language processing and computational linguistics, including question answering, deidentification, entity extraction, text indexing, and text mining
• Network and systems approaches to answering relevant questions related to molecular biology, biochemistry, and disease
• Genomic studies, including evolutionary genomics, metagenomics, ancient DNA-based genomics, cancer genomics, personalized genomics, population genetics 
• Novel methods for processing and analyzing emerging types of large-scale data sets, such as those arising from single-cell expression studies and multi-dimensional (3D/4D) chromatin structure studies

We encourage individuals to apply who are at the forefront of working with complex biomedical and health-related data with the ability, expertise, and interpersonal skills to direct research projects yielding generalizable, scalable, and reusable methodologies. Ideal candidates are expected to have an outstanding record of publication and achievements in the aforementioned domains of interest.

Successful candidates will join a diverse, collegial, and cooperative community of investigators within NLM, providing a supportive environment for high-quality mentoring. Successful candidates will be supported with long-term, stable resources equivalent to those provided to faculty in academic departments; this includes the ability to recruit postdoctoral fellows and a budget for software, consumables, and equipment. Salary is competitive and will be commensurate with experience, and full federal benefits are included.

Applicants must have a Ph.D., M.D., or equivalent degree in a field appropriate to this recruitment, as well as demonstrated accomplishment commensurate with several years of postdoctoral research experience. Appointees may be U.S. citizens, resident aliens, or non-resident aliens with (or eligible to obtain) a valid employment authorization visa.

 Interested applicants should submit the following materials electronically to investigator_search_2018@ncbi.nlm.nih.gov:

• A one-page cover letter 
Curriculum vitae
• A three-page description of proposed research
• A one-page statement describing the candidate’s mentoring approach and philosophy. This statement should include information on relevant mentoring experience involving women and individuals from groups that are underrepresented in the biomedical research community.
• Names and contact information for three referees familiar with the applicant’s work

Applications will be reviewed starting on October 1, 2018 and will be accepted until the position is filled. 

For more information on NLM, please visit https://www.nlm.nih.gov. Additional information on the NIH Intramural Research Program can be found at http://irp.nih.gov. Specific questions regarding the recruitment may be directed to Dr. Andy Baxevanis, the Search Chair, at andy@mail.nih.gov. 


Department of Health and Human Services • National Institutes of Health DHHS and NIH are Equal Opportunity Employers and encourage applications from women and minorities 

Tenure-Eligible Senior Investigator Recruitment in Data Science, Biomedical Informatics, and Computational Biology @ the NLM, NIH

The National Library of Medicine (NLM) of the National Institutes of Health (NIH) is seeking to recruit a tenure-eligible Senior Investigator to lead a world-class research program within its Intramural Research Program. The goal of this search is to identify candidates to develop a dynamic, innovative, and independent computational research program that will enhance NLM’s collaborative environment by performing novel, cutting-edge research and, thereby, advance the objectives of NLM’s new Strategic Plan, 2017-2027 to accelerate data-driven discovery. As part of this plan, and in alignment with NIH’s new Strategic Plan for Data Science, NLM is embarking on a major expansion of its Intramural Research Program, with a specific emphasis on data science methodologies, analytics, visualization, computerassisted curation, and applications of novel methods for basic biological and biomedical discovery. 

The NIH Intramural Research Program offers unique opportunities for interaction and collaboration with an extensive community of scientists who have broad expertise across all fields of biomedical research, proximity to NLM’s advanced biomedical data and information systems, access to NIH’s highperformance computing resources, and a funding model that facilitates the conduct of long-term, highimpact science that can be difficult to undertake in other research environments.

NLM seeks to recruit a Senior Investigator with significant experience in the use of statistical, bioinformatic, machine learning, optimization, and advanced mathematical methodologies as applied to biomedical and health science, including one or more of the following domains of interest:

• Analysis of large biomedical data sets, multi-dimensional medical images, and unstructured text to generate new biomedical knowledge
• Standards and ontologies for the content and structure of biomedical and health data, to facilitate the automated aggregation and analysis of these data
• Natural language processing and computational linguistics, including question answering, deidentification, entity extraction, text indexing, and text mining
• Network and systems approaches to answering relevant questions related to molecular biology, biochemistry, and disease
• Genomic studies, including evolutionary genomics, metagenomics, ancient DNA-based genomics, cancer genomics, personalized genomics, population genetics
• Novel methods for processing and analyzing emerging types of large-scale data sets, such as those arising from single-cell expression studies and multi-dimensional (3D/4D) chromatin structure studies

We encourage individuals to apply who are at the forefront of working with complex biomedical and health-related data with the ability, expertise, and interpersonal skills to direct research projects yielding generalizable, scalable, and reusable methodologies. The ideal candidate is expected to have an outstanding record of publication and achievements in the aforementioned domains of interest.

The successful candidate will join a diverse, collegial, and cooperative community of investigators within NLM, providing a supportive environment for high-quality mentoring. The successful candidate will be supported with long-term, stable resources equivalent to those provided to faculty in academic departments; this includes the ability to recruit postdoctoral fellows and a budget for software, consumables, and equipment. Salary is competitive and will be commensurate with experience, and full federal benefits are included.

Applicants must have a Ph.D., M.D., or equivalent degree in a field appropriate to this recruitment, as well as demonstrated accomplishment commensurate with several years of experience as an independent investigator. Appointees may be U.S. citizens, resident aliens, or non-resident aliens with (or eligible to obtain) a valid employment authorization visa.



Interested applicants should submit the following materials electronically to investigator_search_2018@ncbi.nlm.nih.gov:

• A one-page cover letter
Curriculum vitae
• A three-page description of proposed research
• A one-page statement describing the candidate’s mentoring approach and philosophy. This statement should include information on relevant mentoring experience involving women and individuals from groups that are underrepresented in the biomedical research community.
• Names and contact information for three referees familiar with the applicant’s work

Applications will be reviewed starting on October 1, 2018 and will be accepted until the position is filled. 

For more information on NLM, please visit https://www.nlm.nih.gov. Additional information on the NIH Intramural Research Program can be found at http://irp.nih.gov. Specific questions regarding the recruitment may be directed to Dr. Andy Baxevanis, the Search Chair, at andy@mail.nih.gov.


Department of Health and Human Services • National Institutes of Health DHHS and NIH are Equal Opportunity Employers and encourage applications from women and minorities

Friday, May 18, 2018

Release factor evolution and resolution in ribosome quality control (RQC) pathway


One of the more fundamental mysteries in the evolution of Life concerns the emergence of the translation release factors. Release factors are the proteins which severe the terminal tRNA-aminoacyl bond at the ribosome, enabling the completed protein chain to diffuse from the ribosome. Release factors accomplish this task by adopting a structural shape that mimics the tRNA, allowing it to access and interact with the stop codon in the ribosome. On gaining access to the ribosome, the release factor catalyzes hydrolysis of the tRNA-aminoacyl bond by coordinating a water molecule with an absolutely-conserved glumatine residue.

Despite this crucial and universal function, no single release factor can be traced to the last universal common ancestor (LUCA) of life: in fact, the principal release factor proteins of the bacterial and archaeo-eukaryotic lineages belong to two entirely distinct protein folds. Dueling parsimonious evolutionary scenarios can account for this observation: 1) one of the two release factor folds was found in the LUCA, and was later displaced in one of the lineages and 2) the two versions emerged independently in the lineages, each displacing the ancestral release factor. In the latter scenario, the release factor could have been a tRNA or tRNA-related ribozyme, consistent with other RNA-world hypothesis.

To throw light on the question of early release factor diversification, we specifically investigating the evolutionary history of the archaeo-eukaryotic release factors (aeRF1s) [see Verma et al.]. Through this analysis, we identified a pair of novel clades in the aeRF1 superfamily, both of which surprisingly had a substantial bacterial component. One of these clades contained 4 families with an unusually complicated evolutionary history: the earliest-branching family is found only in archaea and retains the core architectural features of the classical archaeal aeRF1s, suggesting it was an ancient duplication of the classical versions. At some point, representatives from this family were transferred to a terminally-differentiated bacterial lineage, eventually giving rise to two distinct families. One of these families, found primarily in Bacteroidetes, was then acquired early in the evolution of eukaryotes, giving rise to the final family in the clade. This eukaryotic family is sporadically-distributed across several lineages, but was fixed early in the crown group eukaryotes (plants-fungi-amoebozoa-animals) as the central catalytic core of the Vms1/ANKZF1-like proteins.



Through a collaboration with the Deshaies laboratory, this family was characterized in a recent publication in the Nature magazine as the key missing release factor of the ribosome quality control (RQC) pathway [Verma et al.], the pathway that rescues “jammed” ribosomes which are stalled on mRNA with the growing peptide chain still attached. We suspect, due to shared sequence and domain architectural features, that the prokaryotic families of this clade (named the VLRF1 clade for Vms1-like aeRF1 clade) are also likely to be involved in the clearance of stalled ribosomes.

The second clade we identified contained a total of 14 previously unrecognized families found across a diverse assortment of bacterial lineages (named the baeRF1 clade for bacterial-aeRF1). Despite the monophyly of these families, a wide range of structural, domain architectural, and sequence diversity is observed, suggestive of considerable selective pressure being applied to these families. Perhaps most notably, the characteristic loop region of the aeRF1 superfamily which typically houses the active site glutamine residue varies tremendously in length and content both across and within the baeRF1 families, many families are even predicted to be catalytically inactive due to the lack of a strongly-conserved glutamine residue. While these families remain functionally uncharacterized, one strongly-conserved genomic contextual association was consistently observed across several families: shared genome association with an HPF-like ribosome hibernation factor. These domains are known to directly interact with the translational machinery and induce conformational change in the ribosome to promote the inactivation of ribosomes. The association between baeRF1 and HPF-like domains could indicate that baeRF1 proteins play a complementary role in inducing ribosome inactivation, potentially by occupying the typical tRNA binding sites on the ribosome (consistent with the inactivation of the enzyme in most families). Alternatively, the association could act as a regulatory switch, with the baeRF1 displacing HPF and restoring ribosome function. Given the rapidly-evolving features of the baeRF1 clade, it seems likely that this function is tied to bacterial conflict. As such, baeRF1 could be activated in response to the detection of invasive elements, potentially to prevent the element from hijacking the endogenous translation machinery.

While these findings speak to a previously poorly-understood complexity in the evolution of the aeRF1 superfamily and resolves one of the last remaining mysteries of the RQC pathway, ultimately little is revealed about the state of the release factor in the LUCA. While it may be tempting to suggest that the VLRF1 and baeRF1 clades represent the surviving remnants of a potential ancestral bacterial aeRF1 presence that was displaced early in bacterial lineage by the bacterial-specific release factor fold, our analysis indicates that both clades likely emerged from later transfers from a classical archaeal aeRF1 progenitor.

The most striking evolutionary finding from this analysis is the clear acquisition of the eukaryotic Vms1/ANKZF1-like release factors from the Bacteroidetes lineage. This observation adds to an increasing list of key eukaryotic factors which have their direct antecedents in the Bacteriodetes, suggesting that an important complement of genetic material in eukaryotes was likely inherited early in eukaryotic evolution from a Bacteroidetes symbiont, independent of the α-proteobacterial mitochondrial progenitor.

Wednesday, April 25, 2018

Resolution of the ribosome quality control (RQC) pathway and unexpected diversity in archaeo-eukaryotic release factors

RQC is a critical pathway conserved across animals, essentially acting as the cell’s last resort for the rescue of “jammed” ribosomes. A jammed ribosome is one that stalls on the mRNA during translation, preventing further reading along the mRNA and the release of the elongating peptide chain. Disruptions to this pathway have been associated with several ribosomopathies, or human disease conditions related to ribosome malfunction.

Since the RQC pathway was first discovered around fifteen years ago, its steps have been more or less worked out: very generally, the jammed ribosome is split into its two subunits, the faulty mRNA is removed and degraded, the incomplete peptide chain is elongated and pushed out through the exit tunnel via a remarkable tailing process [Shen et al.], and the chain itself is concomitantly tagged for degradation. Still, one outstanding question remained: across life, elongating peptide chains can only be released from the ribosome by the action of a release factor, which severs the last amino acid residue from the tRNA to which it is bound. In RQC, the responsible release factor was unknown.

Our group has had a longstanding interest in the evolution of both release factors and the RQC pathway [Leipe et al.][Burroughs and Aravind]. Through a collaborative effort with the Deshaies lab, we discovered the elusive RQC release factor. In humans, it is found in the ANKZF1 protein; its yeast ortholog is Vms1. Several elegant experiments clearly demonstrating the tRNA-aminoacyl hydrolase activity of Vms1/ANKZF1 and its impact on the degradation of the stalled peptide chain were recently published in the Nature magazine [Verma et al.]. Given that the ANKZF1 protein was recently directly linked to infantile-onset inflammatory bowel disease [van Haaften-Visser et al.], this study both links the condition to ribosome dysfunction and provides the first genetic function leads for a serious contributor to this complex condition.

While this discovery finally completes our understanding of the central steps in the RQC pathway, it also throws new light on another long-standing mystery: the deep evolution of release factor proteins. While most core pieces of the translation machinery are universally-conserved across all Life, the solution to the translation termination step differs in both the archeao-eukaryotic and bacterial lineages, which each have their own distinct release factors. It is possible that one of these release factors was present in the last universal common ancestor (LUCA) and was displaced in the other lineage, or both lineages independently-evolved their own solutions.

We found that the Vms1/ANKZF1 release factor was part of a previously-unknown and extensive radiation of release factors related to the archaeo-eukaryotic release factors (aeRF1s). Surprisingly, these new members were also found across a broad range of bacteria; the first discovery of any aeRF1s in bacteria. Careful analysis of these new versions allowed us to divide them into two groups, what we call the VLRF1 clade (for Vms1/ANKZF1-like release factor clade) and the baeRF1 clade (for bacterial archaeo-eukaryotic release factor clade). We predict that the release factor proteins belonging to the first clade, including those found in the bacteria, are all likely to be involved in the rescuing of jammed ribosomes. One of the more striking evolutionary observations of this clade is that the direct antecedents of the eukaryote Vms1/ANKZF1 release factor proteins are found almost exclusively in the Bacteroidetes lineage of the bacteria. While the bulk of the bacterially-inherited genetic component of the eukaryotes was inherited from the a-proteobacterial mitochondrial symbiont progenitor, this finding adds to a growing list of key eukaryotic protein domains which have their roots in early symbionts from other lineages.

The second group, the baeRF1 clade, is found only in diverse bacteria species. These release factors are notably under tremendous selective pressure: in other words, they are evolving rapidly in terms of their structure and sequence features. In fact, we predict that most the versions belonging to this clade are likely to be catalytically inactive, incapable of cleaving tRNA-aminoacyl bonds in the ribosome. This begs the question: what is their function in these bacteria, and why are they evolving so rapidly? Our analysis identified a conserved link in bacterial between baeRF1-like proteins and ribosome hibernation factors (Figure 1). We suspect that baeRF1 proteins are likely involved in bacterial conflict: potentially activated in response to virus or other invasive element infection and then contributing to either the shutting down or re-starting of translation along with the ribosome hibernation factor.

Figure 1.  Gene neighborhoods of baeRF1
While our findings speak to a previously poorly-understood complexity in the evolution of aeRF1 proteins, ultimately little is revealed about the state of the release factor in the LUCA. While it might be tempting to speculate that the VLRF1 and baeRF1 clades represent surviving remnants of a potential ancestral bacterial aeRF1 presence displaced early in bacteria by the bacteria-specific release factor fold, our analysis indicates that both clades likely emerged from later transfers from a classical archaeal aeRF1 progenitor.



Monday, July 24, 2017

Cases of rank unethical scientific practice

June 30-2017
In the current scientific environment, where life runs by the maxim “publish or perish” we see an enormous uptick in new journals, publication of short but poorly documented papers, and the culture of the minimal publishable unit. This has greatly contributed to the swelling bulk of scientific literature while diminishing its quality. Further, it is has resulted in it becoming difficult to keep up with all the scientific literature even in one’s own field. Hence, it is conceivable that one misses key papers pertaining to research one is seeking to publish. Yet, when one fails to cite key papers germane to one’s own publication it raises serious questions. This is especially serious if the primary novel idea or finding being presented in a publication has previously been presented elsewhere. Instructional resources pertaining to scientific plagiarism provided by the Department of Health and Human Services, USA, (https://ori.hhs.gov/images/ddblock/plagiarism.pdf ) have the following to say regarding plagiarism of ideas:

 “Appropriating an idea (e.g., an explanation, a theory, a conclusion, a hypothesis, a metaphor) in whole or in part, or with superficial modifications without giving credit to its originator.” 

We believe that this is precisely what has happened in two cases pertaining to discoveries we have formerly published and outline the situation below:  Recently, two papers examining the active CRISPR polymerase domain of Type III CRISPR systems have become public. One was published in the Science magazine and the other was deposited in bioRxiv, a preprint archive for biological sciences (Jul-24-2017-- see update appended below). Both papers carry out a set of experiments which test hypotheses that were exactingly reached as part of a rigorous computational study published by our group in late 2015. The new papers confirm the predictions down to very minute details, and the conclusions they reach and discuss adopt almost exactly the same language as our computational paper. Thus, experimental researchers, have contributed to the understanding of Type III CRISPR systems, firmly establish a concept first explicitly laid out in our paper, resolving the mystery and the very origins of a key arm of prokaryotic counter-nucleic acid immunity along with implications for parallel animal immunity mechanisms.

There’s only one problem: absent from either paper is any mention, let alone citation, of our original paper which was the first to lay out this concept. There can be absolutely no doubt that our computational paper was the sole source of the authors’ “inspiration”. The current literature is rife with misunderstanding regarding the role of the Type III CRISPR polymerase, with various dubious assertions linking it to various kinds of activities, most extraordinarily, nuclease activity. Prior to a week ago, our paper was the only paper to provide a unifying functional hypothesis for the role of the Type III CRISPR polymerase which was supported by several lines of consilient evolutionary and genomic evidence. From this starting point, it is impossible to excuse the lack of reference to the paper, but it becomes even harder when considering that the paper was published in a high-profile journal (Nucleic Acids Research), which one would assume any supposed CRISPR specialist with the capacity to do something known as PubMed Search, a basic tool of biological research, could access. Further, the work has been presented as a lecture in several high-profile conferences and other venues, and was extensively described from more of a layperson’s perspective in a detailed blog post.

Unethical citation takes many forms and we have been very frequently victims of such. There is a general tendency in the community of wet-lab workers to extensively use computationally derived ideas and predictions while failing to acknowledge or even disparaging them when it comes to actual acknowledgement of prior research in their own publications. This would be rather unthinkable in certain other sciences, such as physics, where a clearly published preexisting prediction cannot be ignored by experimental workers who subsequently publish confirmation. In the same paper of ours we also had several other (in our opinion) important predictions concerning nucleotide recognition in various signaling networks, including immunity systems. One of these was the first prediction for the TIR domain being an enzyme which potentially hydrolyzes the base-sugar linkage. A subsequent experimental verification of this hypothesis has been published with due citation of our prior prediction. This shows that ethical practice does not preclude high-profile publication. Hence, we see the actions of non-citation by the above-mentioned authors pertaining to the other discoveries described in the same paper as an act of blatant unethical scientific practice, which in our eyes is no less than plagiarism. This continues with a point we made in a previous post regarding the inability of the peer-review and the editorial system at the scientific magazines and other journals to prevent bad scientific practice. Since the paper deposited in bioRxiv has not been formally published we still hope that the authors might take corrective steps prior to the final publication of their work and hope that this represents an example of the preprints helping to reform the publishing system in biology. 

Finally, we have circumstantial evidence that the authors of the Science paper from Lithuania have accessed our paper and blog well before they submitted their article for review to Science. Their article was received for publication June 8, 2017 and accepted for publication June 22, 2017. We have visitor logs for our blog (presented below) that visitors from the same university using the same mail serve have accessed our blog over 14 times on March 7 and March 8th 2017. Further, as can be seen below we notice that they have accessed our original paper in Nucleic Acids Research, as well as the key figure in post that essentially forms the basis of the discoveries which they fail to cite. We find this as further reason to suspect their scientific ethics in regard to this publication (Update #2, August 31-2017).











Additionally, below is a screen shot of a google search with some elementary related keywords that would have trivially retrieved our previous publication for a reviewer/editor.



Update (Jul-24-2017): After publishing the above entry, we directly contacted the authors of the bioRxiv preprint, briefly outlining the above concerns. To their credit the authors have properly cited our work in their final released draft which was published last week in the Nature magazine (click to read paper). This is a good illustration of the value of the preprint movement in “reforming” scientific practices.

Update 2 (31-August-2017): In an updated version of the paper from the Lithuanian group in the Science magazine, which appears in its final print form in the August 11, 2017 issue, the authors have cited our work; albeit in perhaps the most grudging way possible. While we would like to conclude that this was done by the authors of their own accord, in reality we communicated our objections forcefully to the editorial staff at the Science magazine by referring them to this very post. Hence it is possible that the citation was included post-review after communication between the editors and authors. The upshot of this entire exercise is that unless you are vehement in your objections and go to great lengths sacrificing considerable time and effort, such unethical scientific practice can easily pass through the gates of editorial oversight and peer review. This is not an isolated example of such behavior; numerous examples go unseen until a point where it is too late to make adjustments or in cases even clear and substantial objections fall on deaf editorial ears.

Thursday, July 6, 2017

Universal gating mechanism for TRPM class of ion channels?

We have discussed the implications of our findings from a recent publication and a scientifically unethical publication confirming one of the discoveries reported therein at length in other posts. One additional finding from our paper which is yet to receive any attention is the discovery of a new member of the SLOG superfamily found fused to the well-studied yet still in many ways poorly-understood Transient Receptor Potential Melastatin family (TRPM) of ion channels. Our research points to a likely regulatory role for this newly-discovered SLOG domain in TRPM channels, possibly by functioning in a universal gating mechanism for this clade of channels.

TRPM channels are usually involved in the transmembrane passage of monovalent ions, but have also been shown to transport divalent cations with varying specificity. Eight distinct paralogs (TRPM1-8) have been recognized in mammals, previous studies have traced individual paralogs to distinct starting points during animal evolution, ultimately pointing to their common origin in choanoflagellates. Studies devoted to understanding TRPM1-8 have reported widely varying tissue expression distribution patterns, gating properties, and functional roles while being linked to a range of human disease conditions including, but not limited to, hypomagnesia with secondary hypocalcemia (TRPM6), Guamanian amyotrophic lateral sclerosis/Parkinsonism dementia (TRPM7), and cardiovascular disease (TRPM7).

As is frequently observed in other classes of ion channels, several distinct protein domains have been identified in TRPMs, adorning the central transmembrane region which forms the channel. Prior to our recently-published study, the known core architecture shared across TRPM proteins proceeded as follows: a large N-terminal region with no previously known homology to any other domain, followed by a 6 TM helix-containing region forming the ion channel and the so-called “TRP-box” motif. Additionally, the TRPM2 protein is C-terminally fused to a NUDIX domain while the TRPM6/7 proteins are C-terminally fused to a Protein Kinase domain (see figure 1).

Figure 1

Deletion studies of TRPM channels had previously determined that the large N-terminal region was essential for ion channel function, although the presence of any specific protein domain units in this region had eluded researchers. At least one study had specifically linked this N-terminal region to a role in channel activation, and many of the SNPs associated with the disease conditions above mapped to this same N-terminal region. It was within this region that we observed the presence of the novel SLOG domain, which we dubbed the “LSDAT” family of SLOG domains. This discovery prompted further investigation into this region, and we were able to determine the complete domain architectures of TRPM channels: the SLOG domain is followed by three divergent Ankyrin repeats (a fusion shared with other families of TRP channels) before leading into the previously-described core architecture (see Figure 1). Additionally, several TRPM channels were further fused to a C-terminal cysteine-rich region. Delineation of the core architecture (SLOG+3 Ankyrin+ion-channel) led to novel evolutionary insights into the TRPM channels: beyond the versions identified in animals and choanoflagellates, the core TRPM architecture is also present in algae such as the cryptomonad Guillardia and the haptophyte Emiliania. This suggests a potentially deeper evolutionary origin for the TRPM proteins than previously thought. Versions of the LSDAT SLOG domains were also identified in ciliate genomes. While these often exist as standalone domains, on occasion they are found fused to a distinct ion channel or Ras-like GTPase domains, suggesting multiple independent recruitment events of the LSDAT domain to distinct transmembrane channels in eukaryotes.

The samepaper published by our group provides the first overview of the SLOG domain superfamily, which is a Rossmannoid domain most closely related to the deoxyribohydrolase (DRHyd) and TIR superfamilies to the exclusion of all other Rossmannoid domains and feature a shared atypical nucleotide binding pocket. We observed that SLOG domains appear to function in one of four general roles in the cell: 1) as an enzyme which modifies nucleotides via a base-clipping reaction (as is seen in the classical LOG family which generates cytokinin and cytokinin-like molecules in plants and some bacteria); 2) as a single-stranded DNA-binding protein (Smf/DprA); 3) as a sensor or processor of nucleotides in conflict systems which are centered on the production of a nucleotide intermediate; 4) or as a probable regulator of transmembrane domains, predicted as such due to their genomic association with various distinct transmembrane domains.

The LSDAT SLOG domain falls into the latter category: it is most closely related to a family of bacterial SLOG domains whose lone conserved genomic association is the fusion or adjacent positioning in the genome to the SLATT transmembrane domain family which are predicted to function as potential membrane pore-forming effectors. In some bacteria this system might have been “domesticated” to constitute a system where the SLOG and SLATT domains might work cooperatively as a signaling system with the SLOG domain regulating the formation of or flux through the transmembrane pore constituted by the transmembrane helices of the SLATT domain. This could involve a gating mechanism controlled by the binding and/or processing of a nucleotide ligand by the SLOG domain with additional interactions via the C-terminal cytoplasmic region of the SLATT domain (Figure 2).



In the case of the LSDAT, residue conservation patterns observed through multiple sequence alignment construction suggest there is unlikely to be any enzymatic activity; however, the key residues involved in structuring the atypical Rossmannoid binding pocket shared between the SLOG-DRHyd-TIR superfamilies and the residues likely to be involved in contacting a nucleotide ligand are retained. The sum of these observations led us to predict a role for LSDAT in ligand-binding which influences TRPM membrane channel activity. A universal TRPM ligand has never been identified; instead, a range of distinct small molecule ligands have been linked to gating and regulation of the channels including ADP-ribose, cyclic ADP-ribose (cADPR), NAADP, cAMP, H2O2, and phosphatidylinositol (4,5)-bisphosphate (PIP2). ADP-ribose, cADPR and NAADP are generally believed to act via the C-terminal cytoplasmic Nudix domain found in TRPM2 channels. In light of our discovery of the LSDAT SLOG domain in all TRPM channels, it appears very likely that nucleotide-derived ligands are more generally involved in TRPM channel regulation. It remains possible that these domains recognize such a universal ligand which gates ion transport in TRPM channels.

Wednesday, January 4, 2017

Who is responsible for bad science: researchers, peer reviewers or editors?

The following is partly a case study of how research can take a wrong turn in modern molecular biology/biochemistry and partly a reflection on the sociology of the science.

In the current scientific culture it is considered a big deal to publish in particular “high profile venues”, such as the Science and the Nature magazines. We can attest based on personal experience as well as reports from our peers that publication in these venues often entails enormous difficulty to the researchers involved on account of the peer review practices at these venues. Thus, pushing a paper through to these venues can earn one a virtual badge of a survivor or even hero of a great battle. This badge confers tangible benefits in the modern scientific system: 1) the tenure decisions of scientists at many institutions is favorably influenced by such magazine-publications. 2) The scientific productivity reviews of researchers at many institutions often receives a great boost from such publications. 3) A scientist stands to get press, awards, and higher visibility (as measured by citations) – more generally “fame” from such publications. 4) Last and perhaps most importantly in academia, such publications could be a big factor in earning grant money for continuing research.

Unfortunately, this system of incentives makes the magazine-publication an end in itself, ahead of the actual science. In the below case we illustrate how this, together with the system at the magazines, can engender bad science.

Recently a paper was published in the Science magazine entitled: “A nuclease that mediates cell death induced by DNA damage andpoly(ADP-ribose) polymerase-1”. Given our interest in novel biochemistries and our long-standing investigations of self-inflicted nucleic acid damage in programmed cell death our interest was piqued by this paper. A closer look soon revealed that the paper had problems. Briefly:

MIF, a protein with previously reported tautomerase activity, is a member of the tautomerase superfamily which does not feature nucleases. The authors suggest that MIF is a DNase by claiming a structural relationship to nucleases of the Restriction endonuclease (REase) fold, which frequently but not always contain a PD-(D/E)XK motif. They claim that MIF contains three copies of this motif implying that it contains three copies of the REase fold. However, none of this is supported by structure or sequence evidence: 1) A DALI search with the structure of MIF does not recover any REase fold structures with Z-scores suggestive of genuine relationships (Z>3); as expected it recovers several tautomerase superfamily structures. 2) The REase fold is topologically unrelated to the tautomerase fold, which emerged from an internal duplication of a simple two-strand-helix unit. 3) REase fold catalysis requires residues (not conserved in MIF) beyond the metal-coordinating acidic residues of the PD-(D/E)XK. Moreover, the aspartates and glutamates identified by the authors are often on opposite ends of different structural elements and not proximal to coordinate a metal ion. 4) These motifs, unlike the catalytic prolines of the tautomerases, are not well-conserved even among animal orthologs.

Further, the authors claim glutamate 22 to be the catalytic residue equivalent to that of the so-called Exonuclease-Endonuclease-Phosphatase (synaptojanin-like) domain (Pfam: PF03372). This fold is unrelated to both REases (so-called PD-(D/E)XK) and tautomerases. Their claim that MIF contains a “CxxCxxHx(n)C Zinc finger” (the Rad18 Zn-finger present in FAN1/KIAA1018) is also untenable as the side chains of these residues are nowhere proximal to coordinate a Zn-ion in MIF. Hence, this proposal of DNase function in MIF is based on a flawed structural hypothesis and should be viewed with utmost caution. We have explained the above in great detail in this preprint available from bioArxiv.

Given the serious problems with this published paper we decided to resort to the standard process of post-facto scientific engagement. Within 10 days of reading the said paper we submitted a technical comment along the lines of the above-linked preprint to the Science Magazine (10/26/2016) detailing why the paper is problematic. We believe that this was important because the extraordinary claims made in the article flew in the face of the foundations of biology, i.e. the evolutionary theory, and we felt this should be made apparent to the research community. After more than a month (12/6/2016) we heard back from the editor at the Science Magazine who had handled the original article. Despite all that time spent we received no peer reviews of our technical note. Rather the editor decided that even though “your discussion of our recent paper is interesting” it was not worth publishing. However, the editor suggested that we submit a short summary of our note as an “eLetter” (which allows only 500 words; similar to the summary we provide above). These are non-peer reviewed comments that are posted below the article on the Science Magazine website at the discretion of the editor. They are neither visible in any obvious way with the article nor are they available on Pubmed, which is the standard resource used by researchers to find literature of relevance. We followed the suggestion of the editor by posting an “eLetter” (12/9/2016); eventually, the editors decided to post our comment on the magazine website (12/20/2016). Thus, almost 2 months after the original submission some form of dissenting commentary appeared on the magazine.

So what are the lessons to learn from this story regarding bad science? We see that there are three parties, each of which has to be blamed for different even if somewhat overlapping reasons:

1) The authors. They are to blame because they have utterly disregarded the foundations of biology in planning their experiments. In any other science, like physics, a researcher is unlikely to have any chance of being accepted as a serious scientific player in the community if s/he were unaware of the basic foundations of the science, like say classical mechanics, leave alone publish in the Science Magazine. However, unfortunately, in biology several researchers can spend their publishing career without having any more than a sketchy grasp of the foundations of biology, i.e. the evolutionary theory, and how it is applied to study the functions of biomolecules. This is exactly what we see with this paper. The authors blithely disregard very basic principles of protein structure and sequence evolution to form their starting conjecture which they then go to support with wet lab experiments. Now, if the wet lab experiments supported their results then we have reason to be suspicious of the way they were done – contamination, poor controls, or even worse, some kind of unscrupulous practice. There are indeed suspicious features regarding the experimental result of nuclease activity as pointed out in our preprint.

If this were not enough, the authors responded to our eLetter on 12/24/2016 as can be seen on the Science website. This response betrays not just a lack of understanding of the evolutionary theory but even more fundamental issues. Though the authors aligned a monomer to claim the presence of the so-called “PD-(D/E)XK” motif in MIF in their original paper, now they claim that the REase fold is seen only in the trimer of the Tautomerase fold! This is not merely an about-face regarding their original inference but indicates an even deeper lack of an idea of what comprises a protein fold. In conclusion, based on this all we can say is that they are either visually challenged or lack discernment regarding elementary geometric issues such as symmetry.

2) The reviewers. The Science Magazine helpfully provides the following details regarding the review of this paper: It was handled by a single editor who had it reviewed by three full reviewers and one advisor by the single blind method. The whole process took from 11/02/2015 to 8/22/2016 with two rounds of review prior to acceptance. This is an extraordinary overkill both in terms of time spent and number of reviewers for an article that should have been rejected outright in the first round of review itself or by the advisor if such were consulted by the editor before formal review. Why did this not happen? Given the point we make above, we fear that sadly the reviewers too, like the authors, lacked the basic qualification i.e. sound knowledge of the implications of the evolutionary theory as applied to biomolecules. Instead, as is typical of such magazine venues it appears that the reviewers sent the authors around for almost 11 months on a wild-goose chase of doing more experiments which are utterly worthless given that the starting premise itself is flawed. What this also shows is that at magazine venues reviewers are wont to giving trouble to authors for irrelevant things while not really focusing on the key scientific issues in the paper. It is a certain mentality which is sadly not uncommon among wet lab scientist where technical issues take center-stage before asking whether the science could meaningful (and useful) or not.

3) The Editors. Given the vastness of the field and literature and the technical expertise needed for things like sequence/structure analysis we would not blame the editor overly for missing the bad science in the original submission. Nevertheless, at venues like the magazines and high-profile journals the editors are typically former scientists themselves. Hence, we would expect that in the least they would have some quick intuition for good versus bad science. From personal experience we can say that editors at such venues often fail to see merit in genuinely good and interesting science submitted to them, while taking bad pieces like this one. Thus, we may say that the intuition for discriminating between submissions might be weak among the editors. However, even more damaging is their failure to get proper peer reviews as doorkeepers of these magazines which, as noted above, play an important role in the career of researchers. Finally, we believe that the editor’s unwillingness to properly publish a technical comment that reveals why the article amounts to bad science amounts even further damage to science. Although the editors did finally post an eLetter from us, as noted above, this is not visible via the regular channels of literature search. Hence, no corrective is available to go hand in hand with the original article. Thus, this move on part of the editor strikes against the much valued “self-correcting process” that exists in science.

In conclusion, while we are not involved in any kind of design of science policy, we still have a few recommendations to make. 
  • First, relates to biology education. Modern biology education necessarily needs to go hand-in-hand with proper teaching of the evolutionary theory as it applies to biomolecules, along with the accompanying biochemistry that is needed to properly understand it. Technical skills with handling laboratory equipment and experimentation, however important, cannot be privileged over such education in the above-stated fundamentals. 
  • Second, the scientific status-measuring apparatus needs to go slow on emphasizing publications in magazine-like venues as a “badge of honor”. Publications at such venues are short, thereby giving little space for detailed scientific description that helps develop a foundational argument properly and spot flawed ideas. Their peer-review system is aimed more at “causing sufficient trouble” rather than providing honest peer comments on the science.
  • Third, the members of the editorial system should come out of the echo-chambers fostered by certain “big-name researchers” or artificially constructed “hot science” and pay independent attention to the literature from diverse journals in their respective fields.
Finally, in case a reader were to think that this is an example of us making much ado about nothing or that we are sensationally drumming up an isolated case, then all we’ll say is that this just the tip of the proverbial iceberg. Right in this venue we hope to briefly bring up more examples as and when time permits. One might also look at an earlier paper we had written on this topic. Apparently, things have not entirely changed in all those years.

Wednesday, December 7, 2016

Origins of cyclic- and oligo-nucleotides in biological conflicts from microbes to vertebrate interferons: The “CRISPR polymerase” finds a function

Recent research has pointed to fundamental ties between the emergence of conflict and the fixation of diverse polymerase activities from distinct protein folds (see earlier post). One byproduct of the emergence of several stable polymerase-based replicative systems, which depend on nucleotide transfer during nucleic acid polymer elongation, was a concomitant increase in the diversity of homolog nucleotidyltransferases (NTase) which function in the production of cyclic or oligonucleotides. The appearance of these enzymes in biological conflict contexts likely led to their selection as core components of pathways involved in conflict, with their nucleotide products contributing as signaling molecules during conflict and environmental stress response.

A recent study from our group identified a wealth of previously-unidentified systems, distributed widely across a broad swath of prokaryotic phyla, centering on just such secondary messenger nucleotide-generating NTase enzymes and their corresponding nucleotide sensor domains (click to read). In addition to this NTase-sensor pair, these conflict systems invariably contain an effector domain which is predicted to either attack a non-self-entity or initiate cell suicide. The most frequently-observed NTase embedded in these systems is a representative of the SMODS family, which is typically coupled to one of two novel sensor domains, the SAVED or AGS-C domains. 

The SMODS family belongs to the DNA polβ fold, which includes the experimentally-characterized Vibrio cholerae DncV protein which is known to generate cyclic GMP-AMP (cGAMP). Within the DNA polβ fold, the SMODS family forms a higher-order assemblage with both the eukaryotic cGAS (cGAMP synthase) and OAS (2’-5’ oligoadenylate synthase) enzymes. While the SMODS NTase was likely the “founder” NTase for these newly-identified, nucleotide-dependent conflict systems, several systems display clear evidence of displacement of the core nucleotide-generating NTase component.


In a subset of systems, we identified a previously-unidentified enzyme occupying the typical SMODS position, suggesting a displacement by a novel, uncharacterized NTase domain. Careful analysis of this domain identified it as a new member of the RRM-like fold containing a “palm” domain, with a surprising, close relationship to the catalytic domain of the CRISPR polymerases (frequently referred to as Cmr2 or Cas10). As this novel enzyme conserved the structure and sequence features necessary for NTase activity, yet lacked the N-terminal fusion to the HD phosphoesterase domain observed in the CRISPR polymerase domains, we named it the mCpol (minimal CRISPR polymerase) domain (click to read). Both the CRISPR polymerase NTase domain and the mCpol together form a higher-order assemblage of palm domain NTases to the exclusion of all other families with the GGDEF family of cyclic di-GMP synthetases.

Strikingly, mCpol domains were consistently linked in nucleotide-dependent conflict system contexts to the CARF sensor domain. This represents a further parallel between the CRISPR polymerase and mCpol domains; previous research from our group has described enrichment of CARF domains specifically in the so-called “Type III” CRISPR systems which harbor active CRISPR polymerase and HD domain fusion proteins (click to read). mcPol- and CRISPR polymerase-centered systems can thus each be conceptually thought of as containing three core components: 1) the nucleotide synthetase component, 2) the CARF sensor component, and 3) effector domain components. In mCpol systems, effectors include various pore-forming domains and the HEPN RNase domain. In CRISPR systems, the effector takes the form of the HEPN RNase domain found C-terminally fused to the CARF domain, and might also thematically extend to include other CRISPR effectors 
including interference caused by the cascade complexes.

The discovery of the mCpol domain and its placement within a larger context of nucleotide-dependent conflict systems therefore offers substantive insight into the evolution and function of CRISPR systems containing the CRISPR polymerase. In evolutionary terms, it appears likely that certain CRISPR systems, including the classical Type I and III systems, emerged through combination of the more minimal mCpol-CARF (and potentially HEPN) units with other mobile elements including the RAMPs and Cas1-Cas2 dyad.

In functional terms, the CRISPR polymerase itself has long-remained an enigmatic domain with regards to its possible role in the CRISPR systems, with speculation at various points ranging from roles as a crRNA-amplifying polymerase, template independent terminal transferase, and cyclic nucleotide synthetase. Based on its relationship to the GGDEF synthetases and now with the evolutionary parallels to our newly-described, nucleotide intermediate-dependent conflict systems, we can say that these polymerases are likely generating (cyclic) nucleotides which are in turn sensed by accompanying CARF domains. Again by parallel to the many other described nucleotide-dependent conflict systems (click to read), this nucleotide signal is likely to activate the HEPN effector in these CRISPR systems. In light of this, the HD-phosphoesterase domain N-terminally fused to the CRISPR polymerase domain could provide a means of terminating this effector-activating signal by hydrolyzing the nucleotide, an action comparable to the cNMP phosphodiesterases with HD domains in classical cNMP signaling.

 


The discovery of these evolutionary connections and their resulting functional inferences will undoubtedly deepen experimental understanding of the endogenous regulation of different classes of CRISPR systems. Additionally, there is potential scope for these discoveries to bring improvements to biotechnological application of CRISPR systems in the lab.

Monday, December 5, 2016

Damage, conflict, repair and the early history of nucleic acid polymerases

The accumulated wisdom of sequence analysis and structural biology over the past three decades has led to the realization that the catalytic domains of all extant nucleic acid polymerases belong to just four great superfamilies which have had independent origins. The most widespread is the RRM (RNA-recognition motif)-like fold of the “palm” domains seen in DNA polymerases of superfamily A, B and Y, reverse transcriptases, viral RNA-dependent RNA polymerases, DNA-dependent RNA polymerases of mitochondria and certain viruses (e.g. phage T7), archaeo-eukaryotic type primases, and the tRNA repair enzyme Thg1. The second most prevalent fold, the pol-β fold is that displayed by the superfamily X (e.g. pol β), bacterial PolIII-type DNA polymerases and various template-independent RNA- and DNA-polymerases, such as the CCA-adding enzymes, the polyA polymerases and the terminal transferases. Notably, both these folds also feature multiple independent innovations of synthetases for signaling cyclic/oligonucleotide nucleotide activity. A further innovation of polymerase activity is seen in the TOPRIM domain shared by DNAG-typeprimases and the majority of topoisomerases and gyrases. Last, the RNA polymerase activity templated by DNA- or RNA-templates of all cellular transcription enzymes, certain viral and plasmid RNA polymerases, and the smallRNA-amplifying enzymes involved in the RNAi process display two copies of the double-psi-β-barrel fold.

Two questions raised by these observations are: 1) what do the structures of the catalytic domains of these polymerases tell us about the early protein-nucleic acid world? 2)What are the implications of the repeated innovation of cyclic nucleotide signaling among the nucleic acid polymerases? First, in the case of at least three of the above folds, in addition to nucleic acid polymerase activity, we also see ancient non-metal-binding, non-catalytic versions that are likely to have just bound RNA. This suggests that that the nucleotidyltransferase catalysis probably arose in the context of a more general RNA-binding activity. Thus, the proteins, which were probably at first “protective” or scaffolding partners of the ribozymes, displaced the RNAs in terms of catalysis. The presence of both template-dependent and template-independent activities at least three of these folds suggests that like Thg1 or the CCA-adding enzymes their earliest activities were probably relatively generic without major participation of the protein in template recognition.

The presence of any type of code, in the simplest case in the form of a complementary template, or more elaborately in the form of multi-layered “reading” that emerged in the translation apparatus meant that: 1) there had to be safe-guards for the code against environmental insults such as chemical and radiation damage. 2) The code becomes an excellent invariant for attack by competing rival replicators. This selected a wide range of nucleic-acid- and ribosome-targeting effectors as mediators of this conflict that continues to this date at all levels of biological organization. 3) The previous two factors meant that there was strong selection for multiple nucleic-acid-repair systems. We posit that it was this process that favored the multiple originations of nucleic-acid-repair activities in several structurally distinct RNA-binding folds via emergence of key metal-coordinating residues (see Burroughs and Aravind, 2016). On one hand selection channeled some of these into bona fide replication and transcription enzymes, while on the other hand some of their paralogs remained in a simpler state closer to the ancestral enzymes, as is seen today in the catalytic domains of RNA repair enzymes like Thg1 and the CCA-adding enzymes.


Attacks on Nucleic acids, repair and the provenance of nucleotide signaling
Finally, we posit that a natural byproduct of the activities of at least some of these nucleotidyltransferases were cyclic nucleotides or oligonucleotide like oligo 2’-5’A. The generation of these in context of attacks on nucleic acids due to the ongoing repair activity probably selected for them functioning as signaling molecules for both biological conflict and environmental stress. Thus, we posit that not only did the emergence of these enzymes contribute directly to the structure of the core of biology, i.e. information flow in “the central dogma” but also to the less-conserved “periphery” in the form of signaling systems. Consistent with this proposal our studies have obtained strong evidence for a major role for nucleotide-signals as mediators of counter-invader attack systems. Further support emerges from the fact the synthetases involved in nucleotide-generation in several such systems like the CRISPR systems and several cyclic-dinucleotide and 2’-5’A-centered systems (in bacteria and animal interferon signaling) share a common origin with either the CCA-adding enzyme-like clade of pol-β family nucleotidyltransferases or Thg1-like RNA repair enzymes. For more discussions on related matters, read our latest papers [6,12].

References




Thursday, November 3, 2016

Quod erat demonstrandum? No restriction endonuclease fold in MIF

Recently a seriously flawed article was published in the Science magazine. We present an analysis pointing to the flaws in the paper. Click the following link to read more. http://biorxiv.org/content/early/2016/11/02/085258

Monday, January 4, 2016

DNA adenine methylation in eukaryotes




A video abstract of our latest study on Adenine methylation in eukaryotes published in Bioessays follows.  Click here to access the full paper.





Tuesday, December 17, 2013

PIWI domain evolution

A good deal of manuscript ink has been spilled in study of PIWI proteins, the core catalytic engine of the RNA interference (RNAi) pathway which, among many linked functional roles is perhaps best known for triggering post-transcriptional gene silencing in eukaryotes via binding to small RNAs, which in turn bind reverse-complementary homologous stretches in target mRNAs. Despite copious knowledge gained into almost every minute aspect of PIWI interaction with small RNA and target RNA, the evolution of PIWI and how it came to functionally occupy the central role in such a well-studied pathway remains shrouded in relative murkiness. In a recent review published by our group along with Dr. Yoshinari Ando at Johns Hopkins, we sought to clear some of the fog surrounding the natural history of the PIWI proteins (see our recent review).

PIWI PROTEIN ANATOMY
Much of this confusion surrounding the origin of PIWI stems from a profound lack of understanding the domain architecture and the individual domains comprising the PIWI protein and the extent to which this is conserved in prokaryotic PIWI proteins. The core conserved architecture of eukaryotic PIWI proteins are, in order from N- to C-terminus: 1) a dyad of PIWI-N-terminal domains (PNTD1 and PTND2). These two domains have arisen through duplication followed by a circular permutation at the N-terminus of one of the copies from an ancestral domain with 4 strands and two helices (see the review for more details). The boundaries of these two domains have been inaccurately established in several studies resulting in two inappropriately-defined segments termed the N-terminal and Linker-1 (L1) domains. 2) These domains are followed by PAZ, a RNA-binding domain adopting a SH3-like fold which plays an important role in recognition of the 3’end of the guide strand. 3) A conserved “linker” region (typically termed Linker-2 or L2). 4) The a/ß sandwich MID domain with a Rossmannoid topology that specifically binds the 5’ end of the guide strand. 5) The PIWI catalytic domain itself, belonging to the RNase H fold, which binds the target strand, and if active, uses its metal-dependent RNase H active site to cleave target and passenger strands.

As recently recognized by the Tomari laboratory at the University of Tokyo, this core eukaryotic architecture is observed in some PIWI proteins in prokaryotes [click for ref]. However, in contrast to the strict adherence to this core architecture observed in eukaryotes, prokaryotic PIWI (pPIWI) proteins are more diversity in their architectural construction. One form of elaboration is seen in the fusion of a Sirtuin fold nuclease to the N-terminus of the standard eukaryotic architecture. Another is the potential uncharacterized N-terminal module in the newly-discovered pPIWI-RE family [click for ref] in lieu of the PNTD1/PTND2/PAZ/L2 domains. More strikingly, pPIWI proteins are commonly comprised of only the L2+MID+PIWI domains. The gene encoding this protein is adjacent to another gene encoding a conserved region in a wide range of prokaryotic lineages. This domain is related to a region that was previously claimed to be a novel domain referred to as the APAZ (Analog of PAZ) domain fused to several prokaryotic PIWI domains; the authors of this prediction reasoned that this apparently novel domain was displacing the PAZ domain and therefore likely functionally equivalent to PAZ [click for ref]. However, in our recently published work, we determine that this assignment of a novel domain to the so called “APAZ” region was in error; in fact, the region comprises of rather standard versions of the PTND1 and PTND2 domains and likely a C-terminal PAZ domain; although one defining characteristic of the PAZ domain, like other members of the SH3 barrel fold, is to tendency to diverge rapidly, preventing homology detection using even the most sensitive of methods. Therefore we have determined, outside the possible distinct N-terminal module of the pPIWI-RE family of PIWI proteins, that the core domain architecture established in eukaryotes is largely observed across all PIWI proteins, although in many prokaryotes this architecture is sundered into two distinct polypeptides, the first containing the PTND1+PTND2+PAZ domain order with the second containing the L2+MID+PIWI domain order. Within these split versions, mirroring the Sirtuin fusion to the complete core architecture mentioned above, the PTND1+PTND2+PAZ protein is often further fused at the N-terminus to nuclease domains derived from several distinct folds including the Restriction Endonuclease (REase) fold, the TIR fold, and the Sirtuin fold.

Delineation of the PNTD1/2 domain duplication event and establishment of its deep prokaryotic roots and early adoption into the core PIWI protein architecture clarifies recent functional roles attributed to the N-terminal region of PIWI proteins: namely, implication in the melting of double-stranded RNA duplexes formed during PIWI loading and after target binding and also in prevention of duplex propagation. Introduction of the duplicated PNTD1/2 domains into the core PIWI architecture assisted in the formation of an extended channel, shaping an inbuilt and ancestral switch allowing the RNaseH domain of the PIWI proteins to catalyze cleavage only when the former domains establish an appropriate interface with the binding nucleic acids. Together with the MID domain which recognizes the opposite small RNA terminus, PNTD1/2 appears to have been the primary evolutionary constraint for the characteristic modal length of the small RNAs deployed in RNAi.

With the preceding information in hand, we can begin to trace the evolutionary trajectory of PIWI domain architecture. The RNase H-like PIWI domain is most closely related to the UvrC/Endonuclease V clade of RNase H domains, with UvrC highly conserved across bacteria and EndoV conserved across eukaryotes and archaea. This suggests at least a single copy of this RNase H clade was present in the Last Universal Common Ancestor (LUCA). Both the UvrC and EndoV domain are endoDNases and not RNases. The relatively sporadic and limited distribution of known pPIWI domains suggests they likely emerged from one of these two more broadly distributed lineages later in prokaryotic evolution followed by subsequent dispersal across a diverse range of prokaryotes via horizontal gene transfer. This process appears to have resulted in a shift from DNA duplex to RNA-DNA hybrid duplex specificity. Emergence of the RNase H PIWI domain likely coincided with direct association with the MID domain which descended from an unknown Rossmannoid fold precursor. This core pairing is observed in the pPIWI-RE family, which likely represents the most ancestral extant version of the PIWI domain. As the nature of the N-terminal domains fused to pPIWI-RE remain opaque, the exact temporal timing of the association with the PNTD1/PNTD2/PAZ module (as well as the L2 domain) remains unclear, possibly occurring with the emergence of pPIWI-RE or prior to the divergence of the class I and class II divisions of the classical pPIWI proteins. The eukaryotic PIWI protein was thus necessarily inherited from the class II division, given the core domain architecture shared between eukaryotes and class II in contrast to the sundered architecture in class I.


FUNCTIONAL SHIFTS IN PIWI EVOLUTION
Applying genome contextual information in the form of conserved operon associations onto the above evolutionary framework throws considerable light on the functional shifts that occurred during PIWI evolution. The most basal pPIWI lineage, the pPIWI-RE family, is contained within a three-gene island additionally encoding both a helicase and a REase DNase, strongly suggesting the pPIWI-RE family functions as a plasmid/phage defense system (see post below for more details on pPIWI-RE). In our work, we find evidence supporting similar functional roles for classic pPIWI protein families in the form of strong, family-specific genome linkages to endoDNases of various distinct folds. These associations were observed in all branches of the class I division and at least two branches in the class II division. Recent small RNA profiling in the bacterium Rhodobacter sphaeroides observed hybrid RNA-DNA duplexes associating with pPIWI playing a role in plasmid silencing. This R. sphaeroides pPIWI protein belongs to a class II family associating with a DNA REase domain, further supporting a role for many classical pPIWI protein families in phage/plasmid restriction and drawing a straight line from the predicted function in the pPIWI-RE family to the classical pPIWI families. Thus, ancestrally the pPIWI domains appear to have functioned in the context of RNA-guided restriction of invasive DNA by endoDNases.

We also observed additional contextual associations in the class II division: 1) at least two families have been recruited to previously unrecognized/uncharacterized CRISPR systems. The CRISPR moniker refers to a collection of phage restriction systems following a similar mode of action: incorporation of fragments of phage genomes into genomic loci, transcribing these fragments, and using the fragments as guide RNA to attack the DNA (and in some cases, the RNA) of infecting agents. Despite functional similarities, the protein components comprising these systems are astonishingly diverse, incorporating several distinct nucleases and RNA-binding domains [http://www.biologydirect.com/content/6/1/38]. Our review is the first to link CRISPR-like systems with pPIWI; these systems are notable for their lack of any known processing RNase, suggesting the pPIWI domain functions in processing and utilizing CRISPR RNAs during the phage targeting step. 2) One pPIWI family associates with an endoRNase HEPN domain [click for ref]. 3) One family conspicuously lacking any conserved association with other domains. Strikingly, the pPIWI proteins in this family share the strongest sequence affinity with the eukaryotic PIWI proteins. As the earliest eukaryotic PIWI proteins were clearly recruited to RNA-targeting systems, it appears possible that the shift from DNA targeting to RNA targeting may have actually occurred first, and given the HEPN connection possibly on multiple independent occasions, in prokaryotes.

ADOPTION OF PIWI AS THE CENTRAL COMPONENT OF EUKARYOTIC RNAi AMIDST THE RNA MILIEU OF EARLY EUKARYOTES
As part of our review, we compare small RNA data across diverse eukaryotic phylogenies and identified three sources of small RNA potentially utilized by the earliest-emerging iterations of eukaryotic RNAi systems: small RNA derived from 1) overlapping sites of sense-antisense transcription, 2) genomically-encoded, independently-transcribed hairpin sequences, and 3) double-stranded sections from larger, non-coding RNA entities (including snoRNA, tRNA, etc.). Surprisingly, the most broadly-distributed and ancestral of these three sources appears to be sense-antisense transcriptional sites. Thus, it appears possible that the earliest PIWI-centered RNAi systems in eukaryotes may have acquired substrates from sense-antisense transcription. This dovetails nicely recent research on RNA expression indicating bacteria are engulfed in a transcriptional landscape consisting of such sense-antisense RNA transcriptional products [click for ref], a condition likely mirrored in the eukaryotic stem lineage.

While the above neatly explains both the architectural inheritance and functional shifts taking place during PIWI evolution, it fails to address the logic behind selection of the PIWI domain as the central catalytic component of eukaryotic RNAi. After all, prokaryotes possess their own widespread, well-elaborated RNA-based interference/restriction system: the aforementioned CRISPR/Cas system, in addition to less frequently-observed pPIWI-dependent systems. Why rebuild an RNAi system from scratch, in the process selecting a central component from a relatively infrequently-utilized restriction system? A possible answer for this question is observed in the loss of several other multigene defense systems during the prokaryote-eukaryote transition, such as the classic restriction-modification (R-M) systems, the Pgl system, and toxin–antitoxin systems. All of these systems are themselves mobile, selfish elements that appear to depend on strong genomic linkage (i.e. existence of operons) for the physical assembly of their products and neutralization of their toxic components via the linkage of transcription and translation in prokaryotes. The emergence of the nucleus in eukaryotes, with the resulting breakdown of transcription–translation coupling, rendered such systems incapable of survival owing to the potential danger of the toxic restriction components to the cell. Indeed, expressions of CRISPR/Cas systems in eukaryotes with appropriate RNA guides, e.g., Type II systems, introduce double-strand breaks in DNA with serious mutagenic consequences. The eukaryotic RNAi system therefore appears to have been rebuilt by elaboration around a core formed by the simpler prokaryotic pPIWI-based systems, specifically those that did not have strong operonic linkages with DNA targeting components.

The Cas9-containing CRISPR systems, which are thematically similar in combining a RNaseH domain with a restriction system-like HNH domain inserted into the former have recently proven to be raging successes as biotechnological reagents of gene disruption [click for ref]. In light of these, it might be useful to explore the diverse range of pPIWI guided restriction systems as potential biotechnological reagents for similar purposes.