# dbCAN release 4.0
# 07/20/2015
# total 345 CAZyme HMMs
# data based on CAZyDB released on 03/17/2015
# New family models: CBM68, CBM69, CBM70, CBM71, GH133, GT95, GT96, GT97, PL23, AA11, AA12, AA13
# questions/comments to Yanbin Yin: yanbin.yin@gmail.com

# dbCAN release 3.0
# 05/11/2013
# total 333 CAZyme HMMs
# data based on CAZyDB released on 03/22/2013
# New family models in v3 compared with v2: CBM65, CBM66, CBM67, GH131, GH132, AA1, AA2, AA3, AA4, AA5, AA6, AA7, AA8, AA9, AA10
# Removed family models in v3: GH61 (now become AA9), CBM33 (now become AA10)

# dbCAN release 2.0
# 06/06/2012
# total 320 CAZyme HMMs
# data based on CAZyDB released on 01/09/2012
# New family models in v2 compared with v1: CBM63, CBM64, GT93, GT94, GH126, GH127, GH128, GH129, GH130
# Updated family model in v2: GH101

# dbCAN release 1.0
# 08/23/2011
# data based on CAZyDB released on 03/22/2011

dbCAN-fam-HMMs.txt, same as dbCAN-fam-HMMs.txt.v4: HMMs for 345 dbCAN families (342 CAZyme families + 3 cellulosome modules)
dbCAN-fam-HMMs.txt.v3: HMMs for 333 dbCAN families (330 CAZyme families + 3 cellulosome modules)
dbCAN-fam-HMMs.txt.v2: HMMs for 320 dbCAN families (317 CAZyme families + 3 cellulosome modules)
dbCAN-fam-HMMs.txt.v1: HMMs for 311 dbCAN families (308 CAZyme families + 3 cellulosome modules)

** if you want to run dbCAN CAZyme annotation on your local linux computer, do the following:
** 1. download dbCAN-fam-HMMs.txt, hmmscan-parser.sh
** 2. download HMMER 3.0 package [hmmer.org] and install it properly
** 3. format HMM db: hmmpress dbCAN-fam-HMMs.txt
** 4. run: hmmscan --domtblout yourfile.out.dm dbCAN-fam-HMMs.txt yourfile > yourfile.out
** 5. run: sh hmmscan-parser.sh yourfile.out.dm > yourfile.out.dm.ps (if alignment > 80aa, use E-value < 1e-5, otherwise use E-value < 1e-3; covered fraction of HMM > 0.3)
Cols in yourfile.out.dm.ps:
1. Family HMM
2. HMM length
3. Query ID
4. Query length
5. E-value (how similar to the family HMM)
6. HMM start
7. HMM end
8. Query start
9. Query end
10. Coverage
** About what E-value and Coverage cutoff thresholds you should use, we have done some evaluation analyses using arabidopsis, rice, Aspergillus nidulans FGSC A4, Saccharomyces cerevisiae S288c and Escherichia coli K-12 MG1655, Clostridium thermocellum ATCC 27405 and Anaerocellum thermophilum DSM 6725. Our suggestion is that for plants, use E-value < 1e-23 and coverage > 0.2; for bacteria, use E-value < 1e-18 and coverage > 0.35; and for fungi, use E-value < 1e-17 and coverage > 0.45.
** We have also performed evaluation for the five CAZyme classes separately, which suggests that the best threshold varies for different CAZyme classes (please see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4132414/ for details). Basically to annotate GH proteins, one should use a very relax coverage cutoff or the sensitivity will be low (Supplementary Tables S4 and S9); (ii) to annotate CE families a very stringent E-value cutoff and coverage cutoff should be used; otherwise the precision will be very low due to a very high false positive rate (Supplementary Tables S5 and S10)

dbCAN-all-domains.txt: Signature domain sequences for all proteins in dbCAN (E-value < 1)
dbCAN-all-domains.txt.1e-5: Signature domain sequences for all proteins in dbCAN (if alignment > 80aa, use E-value < 1e-5, otherwise use E-value < 1e-3)
dbCAN-all-domains-metagenome-only.txt
dbCAN-all-domains-example.txt: the first 100 rows in dbCAN-all-domains.txt (too big to open in windows)
Columns:
1. Family
2. Seq ID
3. Source DB (13 data sources)
4. Source accession number
5. Domain position
6. E-value (how similar to the family HMM)
7. Domain sequences

dbCAN-subfam-info.txt: Sub-classifications of sequences in dbCAN-all-domains.txt (sequences share >40% identity within each subfamily)
dbCAN-subfam-info-example.txt: the first 100 rows in dbCAN-subfam-info.txt (too big to open in windows)
Columns:
1. Family
2. Seq ID (same as column #2 in dbCAN-all-domains.txt)
3. Identity (how similar to the representative sequences, 200 if itself)
4. Seq ID of the representative sequence
5. Subfamily ID

CAZyDB-ec-info.txt: EC numbers of CAZyDB proteins
Columns:
1. EC number
2. GenBank accession number
3. ENZYME description

CAZyDB-phylogeny.tar.gz: Phylogenies of 308 CAZy families with signature domain sequences from CAZyDB proteins