# dbCAN release 5.0 # 07/24/2016 # total 360 CAZyme HMMs # data based on CAZyDB released on 07/15/2016 # New family models: CBM72, CBM73, CBM74, CBM75, CBM76, CBM77, CBM78, CBM79, CBM80, GH134, GH135, GT98, GT99, PL24, GT2_Cellulose_synt # questions/comments to Yanbin Yin: yanbin.yin@gmail.com # dbCAN release 4.0 # 07/20/2015 # total 345 CAZyme HMMs # data based on CAZyDB released on 03/17/2015 # New family models: CBM68, CBM69, CBM70, CBM71, GH133, GT95, GT96, GT97, PL23, AA11, AA12, AA13 # questions/comments to Yanbin Yin: yanbin.yin@gmail.com # dbCAN release 3.0 # 05/11/2013 # total 333 CAZyme HMMs # data based on CAZyDB released on 03/22/2013 # New family models in v3 compared with v2: CBM65, CBM66, CBM67, GH131, GH132, AA1, AA2, AA3, AA4, AA5, AA6, AA7, AA8, AA9, AA10 # Removed family models in v3: GH61 (now become AA9), CBM33 (now become AA10) # dbCAN release 2.0 # 06/06/2012 # total 320 CAZyme HMMs # data based on CAZyDB released on 01/09/2012 # New family models in v2 compared with v1: CBM63, CBM64, GT93, GT94, GH126, GH127, GH128, GH129, GH130 # Updated family model in v2: GH101 # dbCAN release 1.0 # 08/23/2011 # data based on CAZyDB released on 03/22/2011 dbCAN-fam-HMMs.txt, same as dbCAN-fam-HMMs.txt.v5: HMMs for 360 dbCAN families (356 CAZyme families + GT2_Cellulose_synt[PF03552.11] + 3 cellulosome modules) dbCAN-fam-HMMs.txt.v4: HMMs for 345 dbCAN families (342 CAZyme families + 3 cellulosome modules) dbCAN-fam-HMMs.txt.v3: HMMs for 333 dbCAN families (330 CAZyme families + 3 cellulosome modules) dbCAN-fam-HMMs.txt.v2: HMMs for 320 dbCAN families (317 CAZyme families + 3 cellulosome modules) dbCAN-fam-HMMs.txt.v1: HMMs for 311 dbCAN families (308 CAZyme families + 3 cellulosome modules) ** if you want to run dbCAN CAZyme annotation on your local linux computer, do the following: ** 1. download dbCAN-fam-HMMs.txt, hmmscan-parser.sh ** 2. download HMMER 3.0 package [hmmer.org] and install it properly ** 3. format HMM db: hmmpress dbCAN-fam-HMMs.txt ** 4. run: hmmscan --domtblout yourfile.out.dm dbCAN-fam-HMMs.txt yourfile > yourfile.out ** 5. run: sh hmmscan-parser.sh yourfile.out.dm > yourfile.out.dm.ps (if alignment > 80aa, use E-value < 1e-5, otherwise use E-value < 1e-3; covered fraction of HMM > 0.3) Cols in yourfile.out.dm.ps: 1. Family HMM 2. HMM length 3. Query ID 4. Query length 5. E-value (how similar to the family HMM) 6. HMM start 7. HMM end 8. Query start 9. Query end 10. Coverage ** About what E-value and Coverage cutoff thresholds you should use (in order to further parse yourfile.out.dm.ps file), we have done some evaluation analyses using arabidopsis, rice, Aspergillus nidulans FGSC A4, Saccharomyces cerevisiae S288c and Escherichia coli K-12 MG1655, Clostridium thermocellum ATCC 27405 and Anaerocellum thermophilum DSM 6725. Our suggestion is that for plants, use E-value < 1e-23 and coverage > 0.2; for bacteria, use E-value < 1e-18 and coverage > 0.35; and for fungi, use E-value < 1e-17 and coverage > 0.45. ** We have also performed evaluation for the five CAZyme classes separately, which suggests that the best threshold varies for different CAZyme classes (please see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4132414/ for details). Basically to annotate GH proteins, one should use a very relax coverage cutoff or the sensitivity will be low (Supplementary Tables S4 and S9); (ii) to annotate CE families a very stringent E-value cutoff and coverage cutoff should be used; otherwise the precision will be very low due to a very high false positive rate (Supplementary Tables S5 and S10) dbCAN-all-domains.txt: Signature domain sequences for all proteins in dbCAN (E-value < 1) dbCAN-all-domains.txt.1e-5: Signature domain sequences for all proteins in dbCAN (if alignment > 80aa, use E-value < 1e-5, otherwise use E-value < 1e-3) dbCAN-all-domains-metagenome-only.txt dbCAN-all-domains-example.txt: the first 100 rows in dbCAN-all-domains.txt (too big to open in windows) Columns: 1. Family 2. Seq ID 3. Source DB (13 data sources) 4. Source accession number 5. Domain position 6. E-value (how similar to the family HMM) 7. Domain sequences dbCAN-subfam-info.txt: Sub-classifications of sequences in dbCAN-all-domains.txt (sequences share >40% identity within each subfamily) dbCAN-subfam-info-example.txt: the first 100 rows in dbCAN-subfam-info.txt (too big to open in windows) Columns: 1. Family 2. Seq ID (same as column #2 in dbCAN-all-domains.txt) 3. Identity (how similar to the representative sequences, 200 if itself) 4. Seq ID of the representative sequence 5. Subfamily ID CAZyDB-ec-info.txt: EC numbers of CAZyDB proteins Columns: 1. EC number 2. GenBank accession number 3. ENZYME description CAZyDB-phylogeny.tar.gz: Phylogenies of 308 CAZy families with signature domain sequences from CAZyDB proteins