# dbCAN release 4.0 # 07/20/2015 # total 345 CAZyme HMMs # data based on CAZyDB released on 03/17/2015 # New family models: CBM68, CBM69, CBM70, CBM71, GH133, GT95, GT96, GT97, PL23, AA11, AA12, AA13 # questions/comments to Yanbin Yin: yanbin.yin@gmail.com # dbCAN release 3.0 # 05/11/2013 # total 333 CAZyme HMMs # data based on CAZyDB released on 03/22/2013 # New family models in v3 compared with v2: CBM65, CBM66, CBM67, GH131, GH132, AA1, AA2, AA3, AA4, AA5, AA6, AA7, AA8, AA9, AA10 # Removed family models in v3: GH61 (now become AA9), CBM33 (now become AA10) # dbCAN release 2.0 # 06/06/2012 # total 320 CAZyme HMMs # data based on CAZyDB released on 01/09/2012 # New family models in v2 compared with v1: CBM63, CBM64, GT93, GT94, GH126, GH127, GH128, GH129, GH130 # Updated family model in v2: GH101 # dbCAN release 1.0 # 08/23/2011 # data based on CAZyDB released on 03/22/2011 dbCAN-fam-HMMs.txt, same as dbCAN-fam-HMMs.txt.v4: HMMs for 345 dbCAN families (342 CAZyme families + 3 cellulosome modules) dbCAN-fam-HMMs.txt.v3: HMMs for 333 dbCAN families (330 CAZyme families + 3 cellulosome modules) dbCAN-fam-HMMs.txt.v2: HMMs for 320 dbCAN families (317 CAZyme families + 3 cellulosome modules) dbCAN-fam-HMMs.txt.v1: HMMs for 311 dbCAN families (308 CAZyme families + 3 cellulosome modules) ** if you want to run dbCAN CAZyme annotation on your local linux computer, do the following: ** 1. download dbCAN-fam-HMMs.txt, hmmscan-parser.sh ** 2. download HMMER 3.0 package [hmmer.org] and install it properly ** 3. format HMM db: hmmpress dbCAN-fam-HMMs.txt ** 4. run: hmmscan --domtblout yourfile.out.dm dbCAN-fam-HMMs.txt yourfile > yourfile.out ** 5. run: sh hmmscan-parser.sh yourfile.out.dm > yourfile.out.dm.ps (if alignment > 80aa, use E-value < 1e-5, otherwise use E-value < 1e-3; covered fraction of HMM > 0.3) Cols in yourfile.out.dm.ps: 1. Family HMM 2. HMM length 3. Query ID 4. Query length 5. E-value (how similar to the family HMM) 6. HMM start 7. HMM end 8. Query start 9. Query end 10. Coverage ** About what E-value and Coverage cutoff thresholds you should use, we have done some evaluation analyses using arabidopsis, rice, Aspergillus nidulans FGSC A4, Saccharomyces cerevisiae S288c and Escherichia coli K-12 MG1655, Clostridium thermocellum ATCC 27405 and Anaerocellum thermophilum DSM 6725. Our suggestion is that for plants, use E-value < 1e-23 and coverage > 0.2; for bacteria, use E-value < 1e-18 and coverage > 0.35; and for fungi, use E-value < 1e-17 and coverage > 0.45. ** We have also performed evaluation for the five CAZyme classes separately, which suggests that the best threshold varies for different CAZyme classes (please see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4132414/ for details). Basically to annotate GH proteins, one should use a very relax coverage cutoff or the sensitivity will be low (Supplementary Tables S4 and S9); (ii) to annotate CE families a very stringent E-value cutoff and coverage cutoff should be used; otherwise the precision will be very low due to a very high false positive rate (Supplementary Tables S5 and S10) dbCAN-all-domains.txt: Signature domain sequences for all proteins in dbCAN (E-value < 1) dbCAN-all-domains.txt.1e-5: Signature domain sequences for all proteins in dbCAN (if alignment > 80aa, use E-value < 1e-5, otherwise use E-value < 1e-3) dbCAN-all-domains-metagenome-only.txt dbCAN-all-domains-example.txt: the first 100 rows in dbCAN-all-domains.txt (too big to open in windows) Columns: 1. Family 2. Seq ID 3. Source DB (13 data sources) 4. Source accession number 5. Domain position 6. E-value (how similar to the family HMM) 7. Domain sequences dbCAN-subfam-info.txt: Sub-classifications of sequences in dbCAN-all-domains.txt (sequences share >40% identity within each subfamily) dbCAN-subfam-info-example.txt: the first 100 rows in dbCAN-subfam-info.txt (too big to open in windows) Columns: 1. Family 2. Seq ID (same as column #2 in dbCAN-all-domains.txt) 3. Identity (how similar to the representative sequences, 200 if itself) 4. Seq ID of the representative sequence 5. Subfamily ID CAZyDB-ec-info.txt: EC numbers of CAZyDB proteins Columns: 1. EC number 2. GenBank accession number 3. ENZYME description CAZyDB-phylogeny.tar.gz: Phylogenies of 308 CAZy families with signature domain sequences from CAZyDB proteins