Home

About PlantPro20Fam

Contents
Introduction
Citation
Usage
Search
Output
Statistics
Download


Introduction

From green algae to higher plants, all species in kingdom Plantae have cell walls, complex living biomaterials that play critical roles in the lives of plants, humans, and ultimately, the entire biosphere. Plant cell walls are composed primarily of structural polysaccharides, and recent advances in genomics have led to increasingly sophisticated polysaccharide-centric cell wall models. Missing from wall models, however, are the secreted Pro-rich structural proteins that compose up to 10% of the cell wall mass in higher plants and play critical roles in cell wall physiology. Heavily post-translationally modified, often cross-linked into the cell wall matrix, and composed of tandem repeat (TR) architectures, these secreted structural TR proteins (TRPs) have confounded both biochemical purification and computational analysis, leaving their phylogenetic diversity in the plant kingdom poorly described.

PlantPro20Fam is the first database of its kind dedicated to the storage, classification, and phylogenetics of Pro-rich TRPs targeted to the plant secretory pathway. Built using bioinformatics tools specifically designed for identifying and clustering TR motifs from large multi-genomic databases, PlantPro20Fam classifies 31 distinct groups of secreted Pro-rich TRPs with phylogenetic distributions ranging from broadly conserved secretome “core modules” (e.g. all angiosperms) to highly restricted secretome signatures (e.g. individual plant families). In addition, the PlantPro20Fam database includes hundreds of as yet unclassified Pro-rich TRPs.


Citation

Please cite: Newman AM and Cooper JB (2011) Global analysis of proline-rich tandem repeat proteins reveals broad phylogenetic diversity in plant secretomes. PLoS ONE 6(8): e23167. (Paper)


Usage

The following numbered steps 1-6 coincide with the same numbers on the PlantPro20Fam homepage.

1. Choose species from the dropdown menu or select taxonomic group from the species tree.

2. Using the TR/TRP taxonomy tree, select TRP class(es) or TR motif class(es). Alternatively, select unclassified. (A detailed overview of each TR/TRP class is provided in the Newman and Cooper manuscript (in press)).

3. (Optional) Specify the percent TR domain coverage as a TRP filter. For example, >66% TR coverage will require 66% of all residues in each retrieved protein to have TR content.

4. (Optional) Filter retrieved sequences for either a signal peptide or a GPI anchor.

5. To avoid errors introduced when automatically aligning TR-rich sequences, we manually curated thousands of full and partial ORFs from the PlantPro20Fam database to arrive at a high-quality non-redundant sequence database. Choose the Curated database for the non-redundant set of proteins. Select All Sequences to retrieve unclassified sequences, as well as additional redundant and/or partial sequences captured by the TR/TRP taxonomies.

6. Press Submit!


Search

The search bar, located below the main interface on the PlantPro20Fam homepage, can scan the sequence database for known locus identifiers (e.g. AT3G62680), published Pro-rich TRP names (e.g. PRP3), or previous sequence annotation. If a locus name is contained in the database, only one sequence will be returned. For all other queries, every sequence with a matching substring will be retrieved. In some cases, you will need to select the sequence cluster (see Fig. 4, below) to find the sequence with annotation that matches your query. A more advanced search feature, incorporating boolean filtering, is currently under development.


Output

After retrieving all sequences matching your search criteria, PlantPro20Fam generates a sequence report (e.g., see Fig. 1).

TR domains are shown as colored characters, and predicted signal peptides are underlined. The abbreviated name of each sequence is given in bold white characters (e.g. HPOA), and a descriptive name is given underneath. The sequence report also includes the phylogenetic distributions of all retrieved sequences (Fig. 1A), together with information about the TR content of each sequence (Fig. 1B) and the members of each protein sequence cluster (Fig. 1C).

Figure 1. Sequence Report

Select Show TRs (see Fig. 1B) to display the TR architecture(s) corresponding to a particular protein (e.g., see Fig. 2). Several TR statistics are displayed: class name (TR Motif), % coverage (with respect to entire protein sequence), sequence range, copy number, period, and % identity (degree of similarity between the TR consensus sequence and the aligned TR domain it represents). Unclassified TRs are denoted '-'.

Figure 2. TR Domains

To retrieve TR consensus sequences from the database that match a particular class, select the hyperlinked TR class name (e.g. p3vpvyk, see: Fig. 1B or Fig. 2). A list of aligned consensus sequences will be shown (see Fig. 3). The proline backbone category is also given (e.g. P2/P3/P1). To display aligned TR consensus sequences for all TR classes simultaneously, select All from the phylogenetic distribution table (see Fig. 1A).

Figure 3. TR Consensus Alignment

Many of the protein sequences in the curated database represent "master sequences," or sequences chosen to represent two or more highly similar sequences in the database. To retrieve all sequences associated with a particular master sequence, select the blue hyperlinked sequence ID under the Source table (e.g., see Fig. 1C or Fig. 2). As shown for ID '25_14' in Fig. 4, the Sequence Cluster page has three major sections. The master sequence, given in Figure 4A, includes a sequence header organized in the following manner:
>ID:sequence ID | NAME: protein name [family genus species] master sequence identifier (and forward reading frame number); sequence identifiers (and reading frames) for all additional clustered sequences.

You can browse the multiple sequence alignments used to identify this sequence cluster (Fig. 4B). As shown in Figure 4C, all clustered sequences are also listed. Many sequences are cross-referenced to their original databases. Hover your mouse over the sequence identifier (in the sequence list) to determine whether the sequence is hyperlinked.

Figure 4. Sequence Cluster

To explore the phylogenetic distributions of retrieved sequences, select any hyperlinked number in the phylogenetic distribution table (Fig. 1A). This will open up a new page that displays an interactive phylogenetic wheel (e.g., see Fig. 5). If the wheel is behaving oddly, try refreshing your browser.

Figure 5. Phylogenetic Wheel


PlantPro20Fam Statistics

All Protein Sequences (Full/Partial ORFs)
Source Classified Unclassified Total
TC
download 10/09
4222 559 4781
TA
release 07/07
1209 141 1350
NR
download 10/09
1394 169 1563
NR
new since 10/09
download 02/11
397 76 473
JGI
download 02/10
53 102 155
Mt3.0
download 12/09
17 0 17
Total 7292 1047 8339


Curated "Master" Protein Sequences
Source Total Signal
Peptide
GPI
Anchor
TC
download 10/09
356 320 40
TA
release 07/07
250 222 29
NR
download 10/09
343 319 18
NR
new since 10/09
download 02/11
61 47 2
JGI
download 02/10
47 29 0
Mt3.0
download 12/09
17 11 0
Total 1074 948 89


Download PlantPro20Fam

coming soon!