Please contact bowen.yang@huskers.unl.edu if you have any questions regarding any of the files within this folder (Created Mar 2020). This folder contain all of the data that AcrFinder uses in search for novel Anti-CRISPR loci. This README file will explain the origin, purpose, and format of all files in this folder: # Version 1.0 release # 03/01/2020 1. AcrFinder_AcaDB.faa This file contains all the identified 4591 potential Aca protein sequences using the guilt by association approach. The file contains potential Aca proteins identified from: 1) A total of 171355 bacterial genomes and 961 Archaea genomes from the Refseq database. 2) All vrial contigs in the HuVirDB, GVD, and IMG databases. The record for the 4591 Aca fasta sequences is formated as such: Example: >AcrIE4-IF7-0-WP_064584003.1|HTH_3 WP_064584003.1 hypothetical protein [Pseudomonas citronellolis] Looking at the example provided above, the sequence ID of the sequence would be: AcrIE4-IF7-0-WP_064584003.1|HTH_3; the sequence description would be: WP_064584003.1 hypothetical protein [Pseudomonas citronellolis]. The sequence ID reflects the reason for the selection of said protein as a potential Aca. It contains several layers of information. 1st, how far is this potential Aca protein from an potential Acr protein; 2nd, what type of HTH domain does this potential Aca contain; 3rd, what Acr homologs does the potential Acr protein have, if any. In our example above, sequence ID is AcrIE4-IF7-0-WP_064584003.1|HTH_3. The sequence ID indicates that protein WP_064584003.1 is downstream of an AcrIE4-IF7 homolog, and that there are 0 genes in between the AcrIE4-IF7 homolog and the potential Aca; The potential Aca WP_064584003.1 contains an HTH_3 domain. Another example is Rampelli_10002_NODE_590_length_11027_cov_2.771054_8|HTH_XRE|AcrIIA1. The sequence ID indicates that the protein Rampelli_10002_NODE_590_length_11027_cov_2.771054_8 has an HTH_XRE domain, and has no Acr homolog genes close to it; But this potential Aca gene itself is an homolog to AcrIIA1, this means that this potential Aca may have the ability to function as both an Acr and an Aca, which is an experimentally validated feature for the AcrIIA protein and biological possible for this potential Aca. Sequence IDs containing "Aca" or any other format not in the format mentioned above, are published Acas. Example: AcrIF10-Aca2|WP_037415913.1, indicate that protein WP_037415913.1 is a published Aca2 and this Aca2 was found down stream of an AcrIIA10. Example: AcrIF11|Aca5|WP_050101207.1, indicate that protein WP_050101207.1 is an Aca5 found to be next to AcrIF11 by the original paper. If only Aca was indicated rather than a specific Aca type, this indicate that this discovered Aca was not found to belong to any current know Aca types. Another example would be "AcrIIA5-AcrIIA1|NP_695135.1", this indicate that Aca protein NP_695135.1 is an AcrIIA1 protein (this protein can function both as an Acr and Aca), and it was found down stream of AcrIIA5. 2. AcrFinder_AcaDB_info.tsv This table contains information on all 4591 potential Aca proteins in the 4591_Potential_AcaDB.faa 1) For potential Acas identified in the RefSeq database, the fields of the table represent the following 4 columns: GCF# Protein_ID Sequence_ID_in_AcaDB Species_Info 2) For potential Acas identified in the IMG database, the fields of the table represent the following 24 columns: Contig# Protein_ID Sequence_ID_in_AcaDB UViG TAXON_OID Scaffold_ID VIRAL_CLUSTERS Ecosystem Ecosystem_Category Ecosystem_Type Ecosystem_Subtype Habitat perc_VPF Host Host_detection Host_domain Estimated_completeness Quality Predicted_genome_size Rationale_for_predicted_genome_size POGs_ORDER POGs_FAMILY POGs_SUBFAMILY POGs_GENUS putative_retrovirus 2) For potential Acas identified in the HuVirDB and GVD database, the fields of the table represent the following 3 columns: Contig# Protein_ID Sequence_ID_in_AcaDB 3. Known_AcrDB.faa This fasta file contains the protein sequences of the 57 currently experimentally validated Acrs collected from https://tinyurl.com/anti-CRISPR as of March 2020. 4. Known_AcrDB.xlsx This excel file contains detailed information about the 57 experimentally validated Acrs. These files below are not used by AcrFinder, but are Acr homologs identified using published Acrs. 5. Prokaryote_AcrHomolog.faa Contain all Acr homolog sequences identified by using diamond homology search against the published Acrs in the Bacterial and Archaea GCFs in the Refseq database. All sequences have a coverage greater than 80%, and evalue <0.01 to the experimental validated Acrs. 6. Prokaryote_AcrHomolog_info.csv Contain information of the seqnences in the Prokaryote_AcrHomolog.faa file Table columns are as follows: GCF#, ProteinID, best_AcrType, evalue, coverage, Species_info 7. Viruses_AcrHomolog.faa Contain all Acr homolog sequences identified by using diamond homology search against the published Acrs using the viral contigs in the GCD, IMG_VR, and HuVirDB database combined. All sequences have a coverage greater than 80%, and evalue <0.01 to the experimental validated Acrs. 8. Viruses_AcrHomolog_info.csv Contain information of the seqnences in the Viruses_AcrHomolog.faa file Table columns are as follows: 1) For proteins in the HuVirDB and GCD database: Contig#, ProteinID, best_AcrType, evalue, coverage 2) For proteins in the IMG_VR database: Contig#, ProteinID, best_AcrType, evalue, coverage, TAXON_OID, Scaffold_ID, VIRAL_CLUSTERS, Ecosystem, Ecosystem_Category, Ecosystem_Type, Ecosystem_Subtype, Habitat, perc_VPF, Host, Host_detection, Host_domain, Estimated_completeness, Quality, Predicted_genome_size, Rationale_for_predicted_genome_size, POGs_ORDER, POGs_FAMILY, POGs_SUBFAMILY, POGs_GENUS, putative_retrovirus # Version 2.0 release # 06/08/2020 # All files of previous version releases will have a "V#" following the corresponding file names. The latest version of file will not have version names following actual file names. **Version 2.0 updates: 1. Known_AcrDB.faa Added new protein sequences: AcrVIA1; AcrVIA2; AcrVIA3; AcrVIA4; AcrVIA5; AcrVIA6; AcrVIA7; AcrIB; AcrVIA1(Lse) This fasta file contains the protein sequences of the 66 currently experimentally validated Acrs collected from https://tinyurl.com/anti-CRISPR as of May 2020. 2. Known_AcrDB.xlsx Added new Anti-CRISPR sequence information: AcrVIA1; AcrVIA2; AcrVIA3; AcrVIA4; AcrVIA5; AcrVIA6; AcrVIA7; AcrIB; AcrVIA1(Lse) This excel file contains detailed information about the 66 experimentally validated Acrs