Contents |
Introduction Data Source Methods for Collecting CAZyme Sequence Data Methods for Collecting Annotation Data Blast Page Annotate Page Download Page Search Page API Javascript API Library PlantCAZymeDomain PlantCAZymeAnnotate PlantCAZymeBLAST HTTP Requests POST http://cys.bios.niu.edu/plantcazyme/api/domain.php POST http://cys.bios.niu.edu/plantcazyme/api/annotate.php POST http://cys.bios.niu.edu/plantcazyme/api/blast.php |
Introduction |
Data Source |
Methods for Collecting CAZyme Sequence Data |
HMMER package v3.0 was used as the tool to search 330 dbCAN HMMs against 35 genomes (the protein datasets). We have tested the performance of dbCAN-based search on all of the 330 CAZyme families as a whole (denoted as All) using different combinations of E-values and overage cutoffs. Figure 1 shows the F-measure values of different parameter combinations for the All sets of Arabidopsis (Figure 1A) and rice (Figure 1B), where F-measure = 2 * (Sensitivity * Precision) / (Sensitivity + Precision). We then selected the combination that gave the highest F-measure value and showed them in Table 2 and Table 3. ![]() Table 2 and 3 show that the coverage > 0.2 and E-value < 1e-23 combination gave the best F-measure for both Arabidopsis (F-measure = 0.91, sensitivity = 0.89 and precision = 0.92) and rice (F-measure = 0.85, sensitivity = 0.84 and precision = 0.85). We have also performed evaluation for the five classes separately, which suggests that the best F-measure varies for different CAZyme classes (Table 2 and 3). Overall the largest two classes GT and GH (81% of CAZyme families) in both plants have higher F-measures than the three smaller classes CE, PL and CBM. It also suggests that: (i) to annotate GH proteins, one should use a very relax coverage cutoff or the sensitivity will be low; (ii) to annotate CE families a very stringent E-value cutoff and coverage cutoff should be used; otherwise the precision will be very low because of a very high false positive rate. ![]() Although it would work best to use different parameter combinations for different classes and for different plants, we decided to use coverage > 0.2 and E-value < 1e-23 as the universal cutoffs, as this setting agrees in both dicots and monocots and makes the parsing process less complicated and easy to reproduce by others. Domain sequences and full-length protein sequences were retrieved for further bioinformatics analyses. |
Methods for Collecting Annotation Data |
RPS-BLAST (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/node20.html) was run with full-length CAZyme protein sequences as query and the NCBI CDD database (hyperlink) as the database. CDD is a protein annotation resource that contains well annotated sequence models. E-value < 1e-2 was used to keep the CDD domain match. Gene Ontology annotation was retrieved from the Phytozome annotation for each genome. See one example (ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Athaliana/ annotation/Athaliana_167_annotation_info.txt.gz). Full-length sequences were used to run the pepwindow program (http://emboss.sourceforge.net/apps/release/6.0/emboss/apps/pepwindow.html) and a graph of the classic Kyte & Doolittle hydropathy plot was generated. Full-length sequences were taken to run TMHMM (http://www.cbs.dtu.dk/services/TMHMM/) to predict the transmembrance regions. Signal peptide was predicted using SignalP (http://www.cbs.dtu.dk/services/SignalP/) PSSpred was run to predict secondary structures (http://zhanglab.ccmb.med.umich.edu/PSSpred/) Full-length sequences were taken to run the COILS program (http://embnet.vital-it.ch/software/COILS_form.html) Plant EST (expressed sequence tag) data were downloaded from EBI (ftp://ftp.ebi.ac.uk/pub/databases/embl/release/). TBLASTN was run with full-length proteins as query to search for homologous EST matches. E-value < 1e-2 was used to keep significant match. NCBI non-redundant protein sequence database (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz) was downloaded. BLASTP was run with full-length proteins as query to search for homologous protein matches. E-value < 1e-2 was used to keep significant match. Protein Data Bank protein sequence database (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/pdbaa.gz) was downloaded. BLASTP was run with full-length proteins as query to search for homologous protein matches. E-value < 1e-2 was used to keep significant match. If there is a significant PDB match, that means the browsed protein has a close homolog with 3D structure solved. CAZyme domain sequences were taken to run OrthoMCL program (http://orthomcl.org/orthomcl/). This was done in a number of steps, each of which is explained below: 1. The domain FASTA sequences were sorted into files based on their family. 2. Each family FASTA file was profiled into a BLAST database using the program makeblastdb with default settings. 3. For each family, the FASTA file was run against the database of the same name using blastp with default settings, with all output being tabular. 4. orthomclAdjustFasta was run for each family, using the family as the identifier. 5. orthomclBlastParser was run using the BLAST results from step 3, and the compliant fasta file created from step 4. 6. orthomclInstallSchema was run, the default configuration file was used, except for the login information to the database. 7. orthomclLoadBlast was run, using the file generated in step 5. 8. orthomclPairs was run with cleanup. 9. orthomclDumpPairs was run. 10. mcl mclInput was run, using --abc and -I 1.5. 11. orthomclMclToGroups was run using the file generated from step 3, and inputing the file from step 10. 12. The resulting file gives each orthologous group on a separate line. For each orthologous group, the alignment was generated by MAFFT (http://www.ebi.ac.uk/Tools/msa/mafft/). The graph was then generated by inputting the resulting file into the php script that we provide here (http://cys.bios.niu.edu/plantcazyme/scripts.php?script=alignment) The above alignment of each orthologous group was used to run FastTree (http://www.microbesonline.org/fasttree/) to generate a phylogenetic tree. The Newick format tree file was turned into a graph using a biopython script (http://cys.bios.niu.edu/plantcazyme/scripts.php?script=tree). For Arabidopsis CAZymes, publication records were retrieved from the TAIR database (ftp://ftp.arabidopsis.org/User_Requests/Locus_Published_20130305.txt). |
BLAST Page |
At BLAST page, you can submit your own protein (blastp) or DNA/RNA (blastx) sequences to search against our pre-computed CAZyme protein sequences. You may also choose a specific species or the CAZy database to search against. If you are submitting a large dataset, expect a long waiting time and better leave your email address so that the result will be sent to you after the job is finished. |
Annotate Page |
This function is the same as what dbCAN server (http://csbl.bmb.uga.edu/dbCAN/annotate.php) provides, except that we do not provide a graphical representation of the result page. Instead the hmmsearch output, a parseable table of per-domain hits, is provided. The format description is available at ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/Userguide.pdf. Briefly, the following columns are what you want to pay attention to: target name, query name, tlen, qlen, C-Evalue, ali coord from and to, hmm coord from and to. |
Download Page |
Download page allows you to bulk download CAZyme sequences of each family or each plant. |
Search Function |
The PlantCAZyme database can be queried with searches by using the search bar at the top of the page. Multiple types of criteria can be specified to narrow your search. There are two options when deciding to do a search: unformatted search and formatted search. You enter a query with no formatting (e.g. entering brackets [] into the query). This will run your query only against the following fields: - ID, e.g. AT2G46570.1 - Family, e.g. CBM10 - Species, e.g. Arabidopsis Thaliana - Domain, e.g. Cellulose_synt See formatted searching for a description of these fields. Please note that unformatted searching is space delimited. So, entering the query "Arabidopsis Thaliana" will yield results from both Arabidopsis Thaliana and Arabidopsis Lyrata, as both contain the word "Arabidopsis". Formatted searching allows you to be more specific and search through more categories. Formatted searches are done by indicating formatting with the use of brackets []. For example, if you want to search for the species Arabidopsis Thaliana, you can search "Arabidopsis Thaliana[Species]". You can write more than one specifier in a query. So if you only wanted the AA1 family, you could write the query as "Arabidopsis Thaliana[Species] AA1[Family]". These specifiers are all strung together in an AND fashion, so a result will only appear if it matches all of the criteria you have given. The categories that can be searched and their description are as follows: Note that these queries shown above are still delimited by spaces. This means that each word is searched rather than the phrase. E.g. searching "Arabidopsis Thaliana[Species]" will bring up anything with a species containing "Arabidopsis" or "Thaliana". However, this can be restricted in either of two ways. The first is to simply split the query. E.g. "Arabidopsis[Species] Thaliana[Species]" would only bring up plants from the species "Arabidopsis Thaliana". The second method is to use parentheses. By putting parentheses around a phrase, it will be treated as a single query. So, "(Arabidopsis Thaliana)[Species]" will result in only plants of the species "Arabidopsis Thaliana". |
API |
The PlantCAZyme API can be used to query the PlantCAZyme database from an external source. The API works over HTTP requests. A JavaScript file for working with the API is provided, and documentation for it is found below. Alternatively, one can send the request themselves. The documentation for these requests is provided in the respective section below. |
The Javascript API Library |
The Javascript API library provides everything you need to make the queries that are available to PlantCAZyme. There are three distinct function objects can be passed information in order to make a query. Each of these requires jQuery in order to exectute. The JavaScript API library can be found here, and jQuery can be downloaded here, or you can link to Google's jQuery here. All of the objects work in the same way. First, construct the query by passing it values. Then, assign it functions to perform when it is done, fails, or every time. Use the next() function to iterate over results, and use the get() command to retrieve information. |
PlantCAZymeDomain |
This function object allows you to look up the Signature Domain that a protein belongs to. This protein must already be a part of the PlantCAZyme database, and you must use the ID that PlantCAZyme uses. Set up the object. var result = new PCAZymeDomain("AT1G18140.1"); The next () function allows us to iterate over results. Use this in a while() loop. while(result.next()) The get() function allows us to get information from the result. Passing no parameters will return an array, otherwise we pass in a value. These are part of the object's data variable. They are as follows: data.ID data.Family data.Start data.End data.Evalue data.Sequence Alternatively, you can pass in the corresponding integer to get the same results. The object's done() function must be set, as the query is done asynchronously. If one wishes, they can also set the fail() and always() functions. The following code will query PlantCAZyme for the domains of the protein "AT1G18140.1". It will alert the user of all of the domains upon success, and alert the user of failure upon a failure. var result = new PCAZymeDomain("AT3G13560.1"); result.done = function(){ while(result.next()){ alert(result.get(result.data.Family)); } } |
PlantCAZymeAnnotate |
This function object allows you to annotate protein sequences to find likely domains. This is done using the HMM's of dbCAN. Set up the object. The ID and sequences are mandatory, but the evalue is optional. The default is 10. var result = new PCAZymeAnnotate("ATT1G05240.1", "MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPKAEEIVRGVTVQYVSRQKTLAAKL\ LRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCA\ DVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLN\ AKDLVVLSGGHTIGISSCALVNSRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLN\ MDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDS\ MVKLGFVQILTGKNGEIRKRCAFPN", {evalue: 10}); The next () function allows us to iterate over results. Use this in a while() loop. while(result.next()) The get() function allows us to get information from the result. Passing no parameters will return an array, otherwise we pass in a value. These are part of the object's data variable. They are as follows: data.targetName data.targetAccession data.targetLength data.queryName data.queryAccession data.queryLength data.fullEvalue data.fullScore data.fullBias data.domainResultNum data.domainResultTotal data.domainc-Evalue data.domaini-Evalue data.domainScore data.domainBias data.hmmFrom data.hmmTo data.aliFrom data.aliTo data.envFrom data.envTo data.acc data.description Alternatively, you can pass in the corresponding integer to get the same results. The object's done() function must be set, as the query is done asynchronously. If one wishes, they can also set the fail() and always() functions. The following code will run a hmmscan for the domains of the protein "AT1G05240.1". It will alert the user of all of the domains upon success, and alert the user of failure upon a failure. var result = new PCAZymeAnnotate("AT1G05240.1", "MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPKAEEIVRGVTVQYVSRQKTLAAKL\ LRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCA\ DVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLN\ AKDLVVLSGGHTIGISSCALVNSRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLN\ MDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDS\ MVKLGFVQILTGKNGEIRKRCAFPN", {evalue: 10}); result.done = function(){ while(result.next()){ alert(result.get(result.data.queryID)); } } |
PlantCAZymeBLAST |
This function object allows you to run a BLAST against the proteins of the PlantCAZyme database. Set up the object. The ID and sequences are mandatory. Options are included in braces, and are case sensitive. Options include evalue, program, database, matrix, filter, mask. The default values are 10, "blastp", "all", "BLOSUM62", "yes", and "no" respectively. var result = new PCAZymeBLAST("AT3G13560.1","MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPKAEEIVRGVTVQYVSRQKTLAAKL\ LRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCA\ DVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLN\ AKDLVVLSGGHTIGISSCALVNSRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLN\ MDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDS\ MVKLGFVQILTGKNGEIRKRCAFPN", {evalue:10, program: "blastp", matrix: "BLOSUM62"}); The next () function allows us to iterate over results. Use this in a while() loop. while(result.next()) The get() function allows us to get information from the result. Passing no parameters will return an array, otherwise we pass in a value. These are part of the object's data variable. They are as follows: data.queryID data.subjectID data.identity data.alignmentLength data.mismatches data.gapOpens data.qStart data.qEnd data.sStart data.sEnd data.evalue data.bitScore The following code will run a BLAST search against the PlantCAZyme proteins for the protein "AT4G18780.1". It will make a list of all hits, and then alert the user upon success, and alert the user of failure upon a failure. var result = new PCAZymeBLAST("AT3G13560.1","MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPKAEEIVRGVTVQYVSRQKTLAAKL\ LRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCA\ DVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLN\ AKDLVVLSGGHTIGISSCALVNSRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLN\ MDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDS\ MVKLGFVQILTGKNGEIRKRCAFPN"); result.done = function(){ var hits = ""; while(result.next()){ hits += result.get(result.data.subjectID) + ","; } alert(hits.substr(0,-1)); } |
HTTP Requests |
Sometimes you may wish to get information from PlantCAZyme without using JavaScript. There is no library provided for these actions, but these requests will be supported. Each request is done through POST to the corresponding API page, and a successful return will be JSON encoded. The examples provided are done in perl. |
POST http://cys.bios.niu.edu/plantcazyme/api/domain.php |
Sending a request here will retrieve the domains for a given protein. The fields are "id". my $request = POST('http://cys.bios.niu.edu/plantcazyme/api/domain.php', ['id' => 'AT1G18140.1']); my $response = $ua->request($request); Returned is an array of arrays. Each row is a domain match, the columns are in the order ID, Family, Start, End, Evalue, Sequence. -1 is returned upon failure. use LWP::UserAgent; use HTTP::Request::Common qw{ POST }; use CGI; use JSON::Parse 'parse_json'; my $ua = LWP::UserAgent->new; my $request = POST('http://cys.bios.niu.edu/plantcazyme/api/domain.php', ['id' => 'AT1G18140.1']); my $response = $ua->request($request); if($response->is_success){ $result = $response->decoded_content; if($result != -1){ $result = parse_json($result); for(my $i = 0; $i < scalar @$result; $i++){ print @$result[$i]->[1] . "\n"; # print the domain } } else{ print "Failure!"; } } else { print $response->code, "\n"; } |
POST http://cys.bios.niu.edu/plantcazyme/api/annotate.php |
Sending a request here will run a hmmscan for a given protein. The fields are "id", "sequence", and "evalue". "evalue" is optional. my $request = POST('http://cys.bios.niu.edu/plantcazyme/api/annotate.php', ['id' => 'AT1G05240.1', 'sequence'=>'MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPKAEEIVRGV TVQYVSRQKTLAAKLLRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCADVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLNAKDLVVLSGGHTIGISSCALVN SRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLNMDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDSMVKLGFVQILTGKNGEIRKRCAFPN', 'evalue'=>10]); my $response = $ua->request($request); Returned is an array of arrays. Each row is a domain match, the columns are in the order target name, target accession, target length, query name, query accession, query length, full domain evalue, full domain score, full domain bias, this domain result number, total results in this domain, is domain c-Evalue, this domain i-Evalue, is domain score, this domain bias, hmm model start position, hmm model end position, domain start position, domain end position, domain envelope start, domain envelope end, probability, and description. -1 is returned upon failure. use LWP::UserAgent; use HTTP::Request::Common qw{ POST }; use CGI; use JSON::Parse 'parse_json'; my $ua = LWP::UserAgent->new; my $request = POST('http://cys.bios.niu.edu/plantcazyme/api/annotate.php', ['id' => 'AT1G05240.1', 'sequence'=>'MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPKAEEIVRGV TVQYVSRQKTLAAKLLRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCADVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLNAKDLVVLSGGHTIGISSCALVN SRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLNMDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDSMVKLGFVQILTGKNGEIRKRCAFPN', 'evalue'=>10]); my $response = $ua->request($request); if($response->is_success){ $result = $response->decoded_content; if($result != -1){ $result = parse_json($result); for(my $i = 0; $i < scalar @$result; $i++){ print @$result[$i]->[0] . "\n"; # print the domain } } else{ print "Failure!"; } } else { print $response->code, "\n"; } |
POST http://cys.bios.niu.edu/plantcazyme/api/blast.php |
Sending a request here will run a BLAST search for a given protein. The fields are "evalue", "program", "database", "matrix", "filter", "mask". Filter and mask are set simply by defining them. my $request = POST('http://cys.bios.niu.edu/plantcazyme/api/blast.php', ['id' => 'AT3G13560.1', 'sequence'=>'MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPK AEEIVRGVTVQYVSRQKTLAAKLLRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCADVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLNAK DLVVLSGGHTIGISSCALVNSRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLNMDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDSMVKLGFVQILTGKNGEIRKRCAFPN', 'evalue'=>10, 'program'=>'blastp', 'matrix'=>'BLOSUM62']); my $response = $ua->request($request); Returned is an array of arrays. Each row is a match, the columns are in the order query ID, subject ID, % identity, alignment length, mismatches, gap opens, query start, query end, subject start, subject end, evalue, bit score. -1 is returned upon failure. use LWP::UserAgent; use HTTP::Request::Common qw{ POST }; use CGI; use JSON::Parse 'parse_json'; my $ua = LWP::UserAgent->new; my $request = POST('http://cys.bios.niu.edu/plantcazyme/api/blast.php', ['id' => 'AT3G13560.1', 'sequence'=>'MAIKNILALVVLLSVVGVSVAIPQLLDLDYYRSKCPK AEEIVRGVTVQYVSRQKTLAAKLLRMHFHDCFVRGCDGSVLLKSAKNDAERDAVPNLTLKGYEVVDAAKTALERKCPNLISCADVLALVARDAVAVIGGPWWPVPLGRRDGRISKLNDALLNLPSPFADIKTLKKNFANKGLNAK DLVVLSGGHTIGISSCALVNSRLYNFTGKGDSDPSMNPSYVRELKRKCPPTDFRTSLNMDPGSALTFDTHYFKVVAQKKGLFTSDSTLLDDIETKNYVQTQAILPPVFSSFNKDFSDSMVKLGFVQILTGKNGEIRKRCAFPN', 'evalue'=>10, 'program'=>'blastp', 'matrix'=>'BLOSUM62']); my $response = $ua->request($request); if($response->is_success){ $result = $response->decoded_content; if($result != -1){ $result = parse_json($result); for(my $i = 0; $i < scalar @$result; $i++){ print @$result[$i]->[1] . "\n"; # print the match } } else{ print "Failure!"; } } else { print $response->code, "\n"; } |