AcrFinder: genome mining anti-CRISPR operons in prokaryotes and their viruses



What are Acr and Aca, and significance of anti-CRISPRs?

Acr (anti-CRISPR) proteins were first discovered in 2013 in Pseudomonas phages and prophages (PMID: 23242138). Acr encoding genes often form operons with putative transcription regulator genes that encode Aca (Acr associated) proteins (PMID: 31474367). These short Acr proteins (< 200 aa) are made by phages and other mobile genetic elements to inhibit the CRISPR-Cas systems of their hosts. Therefore, Acrs are "naturally occurring off-switch" of CRISPR-Cas, with a great potential to serve as modulators of CRISPR-Cas genome editing tools for more controllable genome engineering (e.g., PMID: 30377362)

Methods for new Acr identification in the literature

As of 5/2020, 65 Acr proteins have been experimentally characterized (see here and here), but most do not have sequence homologs beyond the species level and do not have conserved Pfam domains (PMID: 30208287). Aca proteins are more conserved, all having a helix-turn-helix (HTH) DNA binding domain. Therefore, searching for HTH domains of the more conserved Aca proteins and then using gene neighborhood to probe new Acrs has proven to be very successful, which has been known as the guilt-by-association (GBA) approach (reviewed in PMID: 29062071 and PMID: 30208287 and others). Additionally, the self-targeting idea (first proposed by Rauch et al. 2017), i.e., bacterial genomes having CRISPR spacers and their targets (i.e., protospacers) in the same genome, has also been applied to searching for new Acrs (e.g., PMID: 30190307). We have recently published a bioinformatics data mining work for putative Acr-Aca loci in 75,000 bacterial genomes by combining sequence homology search, GBA, and self-targeting approaches (PMID: 31506266). This pipeline was able to find all the published/characterized Acr-Aca loci and therefore has a recall = 100%. A precision is not possible to obtain as no true negative Acr-Aca dataset is available. AcrFinder describes a bioinformatics workflow rather than a predictive algorithm.

upload_file_tab.png
Fig.1 - guilt-by-association (GBA) approach (source: PMID:30208287 [Stanley SY and Maxwell KL, 2018] - fig2b)
upload_file_tab.png
Fig.2 - self-targeting idea (source: PMID:30208287 [Stanley SY and Maxwell KL, 2018] - fig2c)

Similar web resources

The study anti-CRISPR is a very young and rapidly growing research field (PMID: 30309933). Earlier than 2020, there were no any web server or standalone tool published to predict Acrs given a protein or DNA sequence file. However, since March 2020, there have been four tools published in peer-reviewed journals or BioRxiv. There are also related resources. Please see Menu -> Links. For example, the anti-CRISPRDB (PMID: 29036676) collects experimentally characterized Acr proteins and their homologs and presents on the web.

AcrFinder input and output

Genome sequences in fna, gff and faa formats are taken as input. Only one fna file as input is also acceptable; in that case, the gff and faa file will be generated by running Prodigal (PMID: 22796954). The AcrFinder standalone program outputs a folder, where two files and three sub-folders are found. The two files contain the homology-based and GBA-based Acr-Aca search results. The three sub-folders include: (i) input files; (ii) CRISPRCasFinder (PMID: 29790974) result files; (iii) all the intermediate result files. The computational workflow is described in https://github.com/HaidYi/acrfinder#workflow, which is a modified version of the bioinformatics pipeline reported in our recent paper (PMID: 31506266). This pipeline is not simply chaining others’ tools, but rather a workflow to cleverly process the gff and faa files to extract genomic operons and examine their gene neighborhood, which include multiple steps of complex data filtering using sequence features of known Acr-Aca loci.

Intro to AcrFinder website

This website is free and open to all users and there is no login requirement. The job submission page of the website has an option to let the users try out the sample data: one bacterial genome and one viral genome. A help page is available to provide very detailed instructions on how to use the web server, particularly the interpretation of the data in the result page. A typical bacterial genome submission is expected to finish within 2 minutes. A result web link and a job ID is provided while the job is running. The result page has data tables to show the member genes in the identified Acr-Aca loci, as well as the genomic positions, strand, sequence, length, if adjacent to mobile genetic elements, if match with known Acr or Aca proteins, and if adjacent to self-targeting CRISPR spacers. Jbrowse is used to graphically display the gene neighborhood. See the Help page for the detailed description of the webpages.