Assessment of the probabilities for evolutionary structural changes in protein folds

Juris Viksna, Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia
David Gilbert, Bioinformatics Research Centre, Glasgow University, Glasgow, UK


We propose a model of “evolutionary mutations” for protein structures. The choice of particular mutations is based on structure evolution models proposed by biologists as well as topology changes in protein structures that have been observed in Protein Topology Database. Currently the proposed set contains the following mutations:

An algorithm for computing evolutionary distance according to the given set of structure mutations has been developed and implemented. The algorithm is mainly intended for estimation of comparative probabilities for different types of structure mutations and allows to make fast all-against-all comparisons for large sets of protein structures, however the evolutionary distance that can be computed in practice is limited.
    Experiments on several representative sets of protein domains (according to CATH classification) have been performed in order to estimate comparative probabilities for different types of structure mutations.

The main conclusions:

A heuristic algorithm for structure comparison, which measures an evolutionary distance between two protein domains on the basis of proposed structural changes and their comparative probabilities, is currently under development.

Representative datasets used for assessment of mutation probabilities

The experiments were performed on representative sets of protein domains from CATH class 2 (mainly b) and CATH class 3 (mixed a-b). The following files are provided here:

   Domain names - a text file containing domain names used in representative sets. There are 685 protein domains in representative set for CATH class2 and 1182 domains in representative set for CATH class 3. Domain names are strings of 7 characters and are seperated by linebreaks.
   Topologies - a text file containing topologies of representative domains. Each line in this file describes one domain, and contains domain name, a sequence of secondary elements (in direction from C to N terminus), and information about hydrogen bonds between strands. For example, the entry "1an8002 EEHEHEEE 1~2,1~6,2~4,4-7,6~8" corresponds to CATH domain "1an8002" and shows that it is composed from 2 strands (EE), followed by helix (H), followed by strand (E), followed by helix (H), followed by three strands (HHH). Furthermore, pairs of strands in positions (numbering starts from 1 and includes also helices) 1 and 2, 1 and 6, 2 and 4, 6 and 7 are adjacent (connected by hydrogen bonds) and antiparallel; similarly, strands in positions 4 and 7 are adjacent and parallel.

Other types of files contain symmetric matrices with rows separated by linebreaks and columns seperated by tab symbols. Size of each matrix corresponds to the number of domains. First row of each matrix contains domain names corresponding to particular columns and the first entry in each row is domain name corresponding to this row.

   Sequence similarity - matrix entries are numbers from 0 to 100 representing normalised sequence similarity between domains as computed by Smith-Waterman algorithm.
   Helix distances - matrix entries contain numbers of mutations of type H observed between structuraly related proteins (value "999" means that structural relation has not been detected).
   Strand distances - matrix entries contain numbers of mutations of all other types (except H) observed between structuraly related proteins (value "999" means that structural relation has not been detected).
   Mutations - matrix entries are comma-delimited lists containing types of mutations (except H) observed between structuraly related proteins (value "n" means that structural relation has not been detected and value "i" that structures are identical).

Data files are available here