PROFESS

Release 1.1
(2009-11-06)







StatsDatabase Statistics

Function

COG Clusters 9,725
Enzyme Classes 5,093
GO Terms 27,606
Ligands 9,330
PFAM 10,340
Protein Interactions 11,422

Evolution

Essential Genes 0

Structure

CATH Classes 2,178
Protein Structures 56,699
Structure Comparisons 59,520

Sequence

Protein Sequences 292,953
Sequence Comparisons 95,021
Last update: 2009-11-06

TeamDevelopment Team

Contact:    Thomas Triplet


Database Design and Programming
Peter Revesz
Thomas Triplet


Genomics
Mark A. Griep
Robert Powers
Matthew D. Shortridge

Documentation




Getting Started | Adv. Queries | Core Databases | Modules



Getting Started



With the queries


Like most search engine, the user will type keywords to retrieve the data. However, the PROFESSor goes much beyond the simple approach.

Just type a keyword, selenocysteine for example. By default, the PROFESSor will suggest entries from all core databases. Suggestions are to help the user to refine his query, but selecting a suggestion is left optional.

For example, when searching for selenocysteine, the PROFESSor will retrieve COG clusters, Enzyme Classes, ligands, and terms from the Gene Ontology.




With the Interface



Advanced Queries



Restricting Search to a Given Database


You may also restrict a query to a particular database of your choice by prefixing the keyword with [KEY], where KEY depends on the database and may be one of the following (note that this list will grow with the number of core databases):

ALL, COG, EC, GO, LIGAND, PDB, PFAM


The key ALL allows the user to search in all core databases. It is implicitly used when no key is given. Hence, it will mainly be useful when using multiple keywords (see below).


For example, the query [COG] selenocysteine will retrieve COG clusters related to selenocysteines.



Alternatively, you can also type the ID commonly use to refer to an item. For example, you can type the PDB_ID of a protein structure or an E.C. number. When searching for a ligand, the user can also type the formula of the compound.

Note that if you search for cysteine for example, selenocysteine entries will not be returned except if no cysteine can be found.



Multiple Search


PROFESS may also be queried using many keywords from several databases using boolean logic. For examples of advance queries, see below.

Using regular expressions, the general syntax for queries is defined as:

([OR]0,1[KEY]0,1 keywords ([OR] keywords)*)+


KEY depends on the database and may be one of the following (note that this list will grow with the number of core databases):

ALL, COG, EC, GO, LIGAND, PDB, PFAM


By default, all keywords after a [KEY] are considered as a unique string for the query. This behavior can be altered by prefixing the keywords with [OR]. The wildcard characters % (any number n of characters, with n ≥ 0) and _ (exactly one character) may be used in a query. A logical AND is performed between different keys.


Note that, in general, the following 3 queries will return different results:

[COG] lyase primasereturns clusters matching the string "lyase primase",
[COG] lyase [OR] primasereturns clusters matching either the string "lyase" or the string "primase",
[COG] lyase [COG] primasereturns clusters matching both strings "lyase" and "primase".



Examples of queries


[COG] lyase
Returns clusters matching the string "lyase",
 
[COG] 15 [OR] 520 [OR] 369
Returns COG clusters 15, 520 and 369,
 
[COG] 15 [COG] 520 [COG] 369
Returns an empty set because no COG cluster number can be 15, 520 and 369 at the same time,
 
[EC] selenocysteine [OR] selenium
Returns COG clusters such that one or more proteins is related to "selenocysteine" or "selenium" in the Enzyme Classification,
 
[COG] lyase [EC] selenocysteine [OR] selenium
Returns COG clusters containing "lyase" proteins and such that one or more proteins is related to "selenocysteine" or "selenium" in the Enzyme Classification,
 
[EC] 4.% [LIGAND] C12 H17
Returns COG clusters with one or more proteins with an E.C. number starting with "4." and with one or more proteins that bind a ligand containing "C12 H17", that is, ligands with 12 atoms of carbon and 17 atoms of hydrogen.



Query Assistant


Coming soon...





Core Databases



The following databases are integrated in PROFESS. This list will grow based on user feedback.


Note that the last update and version indicate the last update of the database in PROFESS. Although we will frequently update PROFESS, it may no correspond to the latest version of the database.



CATH Database


CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.

Last update: 

04/24/2009

Version: 

3.2.0

Website: 

http://www.cathdb.info/



Clusters of Orthologous Groups (COG) of proteins database


Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

Last update: 

09/11/2003

Website: 

http://www.ncbi.nlm.nih.gov/COG/



Enzyme Classification


Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the reactions they catalyse.

Last update: 

05/03/2009

Version: 

13

Website: 

http://www.chem.qmul.ac.uk/iubmb/enzyme/



Database of Essential Genes (in E. coli)


Essential genes are those indispensable for the survival of an organism, and therefore are considered a foundation of life. DEG hosts records of currently available essential genes among a wide range of organisms.

Last update: 

05/03/2009

Version: 

5.2

Website: 

http://www.essentialgene.org/



Gene Ontology


The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism.

Last update: 

05/03/2009

Website: 

http://www.geneontology.org/



Kyoto Encyclopedia of Genes and Genomes (KEGG) - Ligands


KEGG LIGAND contains our knowledge on the universe of chemical substances and reactions that are relevant to life. It is a composite database consisting of COMPOUND, DRUG, GLYCAN, REACTION, RPAIR, and ENZYME databases, whose entries are identified by C, D, G, R, RP, and EC numbers, respectively.

Last update: 

01/09/2009

Website: 

http://www.genome.jp/kegg/ligand.html



Protein DataBank (PDB)


The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans.

Last update: 

04/28/2009

Website: 

http://www.rcsb.org



Protein Families (PFAM) database


The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

Last update: 

07/01/2008

Version: 

23.0

Website: 

http://pfam.sanger.ac.uk/



Protein/Protein interactions in E. coli


Protein–protein interactions play key roles in protein function and the structural organization of a cell. A thorough description of these interactions should facilitate elucidation of cellular activities, targeted-drug design, and whole cell engineering. A large-scale comprehensive pull-down assay was performed using a His-tagged Escherichia coli ORF clone library.

Last update: 

01/27/2006

Website: 

http://genome.cshlp.org/content/16/5/686.abstract



Structural Classification of Proteins (SCOP)


The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.
Due to copyright issues, we only provide links to retrieve data from SCOP website rather than reproducing SCOP data on our pages.

Last update: 

06/01/2009

Website: 

http://scop.mrc-lmb.cam.ac.uk/scop/index.html



UniProt Knownledge Base - Taxonomy


Organisms are classified in a hierarchical tree structure. The UniProtKB-Taxonomy database contains every node (taxon) of the tree.

Last update: 

04/12/2009

Website: 

http://www.uniprot.org/taxonomy/





Modules for Protein Functions



Functions


This module aggregates and displays in a unique table, structures from the Protein Data Bank within the current COG cluster along with taxonomic data from UniProtKB and associated functions from the Enzyme Classification, the Gene Ontology, KEGG-Ligands, and PFAM. The table may be downloaded in CSV format.

Since: 

v1.0

Data Sources: 

COG, EC, GO, KEGG-LIGAND, NEWT, PDB, PFAM



Ligands


This module displays details about ligands binding a protein that belong to the COG cluster. Buffers, detergents, ions and solvents are separated to provide the user a quicker access to most relevant ligands. The table may be downloaded in CSV format.

Since: 

v1.0

Data Sources: 

COG, KEGG-LIGAND, PDB



Protein Interactions


This module lists Protein Interactions in E. coli from the Protein/Protein interactions database within the current COG cluster.

Since: 

v1.0

Data Sources: 

COG, PIN



Summary


This module aggregates and generates statistics using data from the Enzyme Classification, the Gene Ontology, and PFAM. For each of the three classifications, we compute the number of proteins within each class (within the current COG cluster) and represent the protein distribution as pie charts. This allows the user to quickly differentiate relevant classes from outliers. Classes are sorted by decreasing number of proteins. The darker the color in the pie chart, the higher the number of proteins.

The aggregate data may be downloaded as:
     • CSV format, by clicking on DL in the title for each of the three classifications.
     • PNG images in high-definition (1000x300), by clicking on the thumbnail.

Since: 

v1.0

Data Sources: 

COG, EC, GO, PDB, PFAM



Modules for Protein Evolution



Essential Genes


This module shows Essential Genes in E. coli from DEG within the current COG cluster. Genes are displayed with corresponding protein structures from the PDB (see module Sequence Similarities for more details about the association Gene/Structure).

Since: 

v1.0

Data Sources: 

COG, DEG, NEWT, PDB



Sequence-based Phylogenetic Trees


This module shows the unrooted phylogenetic tree generated using protein chain sequences from the PDB.

First, the sequences were aligned using ClustalW2. Second, the tree was computed using ClustalW2 using the multiple sequence alignment as a guide. The final image was generated using DrawTree from the PHYLIP package.

The phylogenetic tree may be downloaded as:
     • PHYLIP format, by clicking on DL in the title of the module
     • PNG images in high-definition (3000*3000), by clicking on the thumbnail.

Since: 

v1.0

Data Sources: 

COG, PDB



Structure-based Phylogenetic Trees


This module shows the unrooted phylogenetic tree generated using protein structures from the PDB.

First, the structures were aligned using MAMMOTH-mult, a tool for multiple structure alignments. Second, the tree was computed by ClustalW2 using the multiple structure alignment as a guide. The final image was generated using DrawTree from the PHYLIP package.
The phylogenetic tree may be downloaded in PHYLIP format (by clicking on DL in the title) or as PNG images in high-definition (3000*3000) by clicking on the thumbnail.

Since: 

v1.1

Data Sources: 

COG, PDB



Modules for Protein Structures



Structures


This module aggegates in a unique table, data from CATH, the Protein Data Bank, SCOP and the UniProtKB Taxonomy. Note that due to copyright issues, we only provide links to retrieve data from SCOP website rather than reproducing SCOP data on our pages. The table may be downloaded in CSV format.

Since: 

v1.0

Data Sources: 

CATH, COG, NEWT, PDB, SCOP



Structure Comparisons


The pairwise structure comparison tool DaliLite was used to measure the backbone structure similarity of proteins within each orthologous cluster defined by the COG database. All-against-all pairwise structural comparisons within and between the Proteobacteria and Firmicutes were carried out for all COGs where both phyla were represented by a minimum of two organisms. The Dali Z-scores were normalized to calculate a Fractional Structure Similarity (FSS) score:

FSS= ZAB / ZAA


where ZAB is the Dali Z-score when protein B is compared to protein A and ZAA is the Z-score when protein A is compared to itself. Thus, ZAA represents the maximum Z-score that can be achieved for perfect similarity. FSS provides a simple quantitative measure of the distance the two proteins have diverged in their structures.



Since: 

v1.0

Data Sources: 

COG, NEWT, PDB



Modules for Protein Sequences



Sequences


This module lists protein chain sequences within the COG cluster. Sequences may be downloaded in FASTA format.

Since: 

v1.0

Data Sources: 

COG, NEWT, PDB



Sequence Similarities


By using the COG/KOG databases to connect sequences to structures, we created the bridge between sequence-based databases (protein interactions, essentiality...) and structure-based databases (CATH, EC, PFAM...). The COG database includes 4,876 unique COGs from 66 completed prokaryote genomes, and 4,852 unique KOG from 7 eukaryote genomes). The Basic Local Alignment Search Tool(BLAST) implemented with the Protein Mapping and Comparison Tool (PROMPT v4) was used to match sequences in the PDB with COG and KOG sequences.

As of April 28, 2009, the BLAST search gave a hit rate 97.7% of total PDB protein sequences (53,933 of 55,159 total sequences) matching the COG or KOG databases using a BLAST expectation cut-off of 10-9 (E-value). Of the 53,933 PDB/COG hits, 54.2% matched with greater than 50% sequence identity (17% gave 100% identity).

The table may be downloaded in CSV format.

Since: 

v1.0

Data Sources: 

COG, PDB