logo genouest

Protomata-Learner version 0.07 ( warning : Beta version )

Given a sample of (unaligned) sequences belonging to a structural or functional family of proteins, Protomata-Learner infers automata characterizing the family. Automata are graphical models representing a (potentially infinite) set of sequences. They can be used to get new insights into the family, when classical multiple sequence alignments are insufficient, or to search for new family members in the sequence data banks, with the advantage of a finer level of expressivity than classical sequence patterns (such as PSSM, Profile HMM, or Prosite Patterns) enabling to model heterogeneous sequence families.

Quick start guide to Protomata-Learner:

1. Prepare sequence sample.
To obtain better results, quality is here more important than quantity. Prefer sequences whose membership to the family has been established experimentally and discard suspicious sequences. In order to speed up the program and avoid bias, we advise you also to reduce redundancy in your sequence set: you can use for instance the program "Decrease Redundancy" from Cédric Notredame available at ExPASy (Decrease Redundancy default options are usually fine). You can also start from a non redundant databases such as Uniref90.

2. Run Protomata-Learner.
Default parameters have been chosen as a good compromise, but if you want a characterization based on "domain similarity", you can augment the maximal size of fragments upto 20 or more in order to favor the emergence of longer blocks. On the contrary, if you think that characterization should rather be done on the basis of "amino-acid similarity", you can lower the maximal size of fragments to 10 or less to get shorter blocks.

Tip: filling the email field allows you to get an email when the job is finished with an url to retrieve the results. This can be particularly useful if you have to close your browser while the job is still running.

3. Look at results and generate new automata if needed.
To view automata and related alignments, several formats are available (opening links in new windows can be convenient for seeing multiple results). Adjust quorum if needed and proceed eventually to identification and generalization of positions wrt physico-chemical properties (additional results are faster to compute than the first ones: don't hesitate to play with these parameters).

4. In case of unsatisfactory results, try changing the "Partial Local Multiple Alignments" parameters from the first page.
If vertical delimitation of blocks is too fuzzy and you want to obtain stronger consensuses in the blocks, you can augment the threshold for the significance of similarity but this can lead to trivial results. Another possibility is to switch consensus mode to strong: in this mode, all the fragments in a block are required to be significantly similar to all the other fragments in the block. Since this is a strong requirement, significance of fragment similarity can be simultaneously lowered (typically to 3 or 1). These latter setting will result in finer characterizations but can slightly increase computation time...

5. Use automata to search for new sequences in public data banks or a personal set of sequences. Choose an automaton and follow the link to Protomata-Scan tool.

Tip to tune the score threshold before scanning a whole data bank: Blast a representative member of your family and select the first hit whose annotation shows that its sequence cannot belong to the family. By scanning this sequence with a high threshold score, you will get the score for the acceptance of this sequence, which can be used to fix the score threshold of next scans.

Detailed help

Personal parameters

Mail :
Your Mail is optional but you should fill it to receive your results and close your Web browser.


Support of DNA sequences is still experimental. If you want to test it anyway, you can put DNA sequences in fasta format and check either the DNA box or the DNA (with coding regions) box if your DNA sequence are likely to be totally or partially translated.

Partial Local Multiple Alignements

The first step of Protomata Learner consists in searching for a set of locally conserved regions which are characteristic of the family, resulting in a partial local multiple alignment (abbreviated to PLMA).

You can either set the parameters of the PLMA search or import an existing PLMA file. The latter option allows to use a PLMA saved in a previous session and to skip this time-consuming part of the program (the task is comparable to multiple sequence alignment). It can also be used to generate Protomata from PLMA obtained by other means than our algorithm.
The parameters of the PLMA search are the following:
Threshold for significance of fragments similarity
The search for the blocks of locally conserved regions is based on significantly similar fragment pairs computed by Dialign 2.2.2 [1]. For each fragment pair, a weight score (denoted w in Dialign [1]) related to the significance of the fragment pair similarity can be computed. Only fragment pairs with a higher weight than the threshold parameter (significant fragment pairs) will be considered to build the PLMA blocks.
In strong consensus mode, all the fragments in a block are required to be significantly similar to all the other fragments in the block (clique) whereas, in weak consensus mode, two fragment that are not significantly similar may be in the same block as long as it is possible to connect them by a chain of significant fragment pairs (connected component). Strong mode allows usually to get finer results than weak mode but is slightly more time consuming. We advise to begin with weak mode which is usually sufficient and to switch to strong mode if results are not satisfactory. Weak mode should still be better suited for sets of sequences showing internal evolutionary derivation.
Minimum fragment size and maximal fragment size
By limiting the size of the significant fragments to consider, one can influence the shape of the resulting automata: bigger fragments induce longer blocks and vice versa (note: limiting the maximal size of the fragment does not limit the maximal size of the resulting blocks). Since weight function tends to favor longer fragments, tuning can be done essentially by fixing maximal fragment size: bigger values favor characterizations based on domain similarity while smaller values favor characterizations based on amino-acid similarity. Minimum fragment size should have little effect on the result, but can be used to limit the number of significant fragments and then speed up the search.
Some indicative set of parameters:

Significance thresholdConsensusMin sizeMax size
Quality1 or 3strong115
Domain5 or 10weak1020 or more
Domain strong5strong1020 or more
Amino-acid strong1 or 3strong110


Quorum :
it is the required minimal number of sequences per block. By default, three automata are displayed with quorums respectively fixed to 3/3, 2/3, 1/3 of the number of sequences. If you have already an idea of the ideals quorums, you can set your own values (in that case, only one automata will be displayed).
Identification and generalization of position wrt physico-chemical properties :
Checking this box activates the generalization of positions with respect to physico-chemical properties.First, a set of amino acids groups has to be chosen. You can select one of the predefined set (FIXME: liste ici + réf à Taylor + possibilité de télécharger ces fichiers) or use your own set (one line per group, without spaces between amino acids).
Then, threshold parameters for likelihood test have to be set between 0 and 1 (0 corresponds to forced generalization and 1 to always rejecting the generalization). There are two thresholds to fix: the first (g) controls generalization to the smallest group including all the amino acids of the current position. The second threshold (s) is only used if generalization to group failed: it allows then to control generalization to wild card X (representing any amino-acids).
Some examples of extreme values:
g s
0 1 generalization to group in which aa set is included (others not changed)
0 0 generalization to group in which aa set is included (X for others)
1 0 strict identification of groups (X for others)
1 1 no generalization

You can set values close to these ones according to the desired kind of generalization.
Description of your request :
It allows you to add informations about your request into your futur results.

Legend for Protomata and PLMA views

Protomata are automata with 3 types of states: characteristic, gap or exception. As for automata, a sequence is accepted (or recognized) by a protomaton if it can be read in its entire length, beginning in the start state and finishing in the end state.
Conventions used to represent the 3 types of states in protomata:

Characteristic region reading a K or a H followed by two C (derived from a plma block with greater support than the quorum).
Gap state reading any sequence of amino acid (corresponding to the sequence segments linking the source to the target characteristic states, these segment lengths ranging from 58 to 64 amino acids).
Set of exception paths allowing to read one of the sequence segments linking the source to the target characteristic states, the segment lengths ranging from 73 to 78 amino acids.

PLMA show the alignment of the training sequences by a protomaton.
Each sequence is given a number used to label the transitions related to the sequence. Since, in this view, there is only one sequence per gap or exception state, positions in the sequence are given instead of length of segments:
Positions 1 to 64 of sequence 1 in gap region
Positions 1 to 73 of sequence 4 in exception region

A toy example of Protomaton and associated PLMA:

Sequences 1,2 and 3 are accepted through the path constituted of a gap, a characteristic region and a second gap whereas sequences 4 and 5 are accepted as exceptions.

[1] B. Morgenstern (1999). DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211 - 218.


A paper describing Protomata Learner will soon be available... Protomata Learner improves upon its predecessors Protomata-L and Protomata-CL by a new algorithmic approach but relies on the same fragment merging ideas which were introduced in the following papers:

- Learning Automata on Protein Sequences, François Coste and Goulven Kerbellec, JOBIM 2006.

- A Similar Fragments Merging Approach to Learn Automata on Proteins, François Coste and Goulven Kerbellec, ECML 2005.

In French :

- Problème d'optimisation de recherche de cliques pour caractériser des familles de protéines, François Coste and Goulven Kerbellec, ROADEF 2007

- Apprentissage d'automates par fusions de paires de fragments significativement similaires et premières expérimentations sur les protéines MIP, François Coste, Goulven Kerbellec, Boris Idmont, Daniel Fredouille and Christian Delamarche, JOBIM'04


François Coste
Laetitia Guillot
Thi Hong Hanh Hoang
Boris Idmont
Goulven Kerbellec

Thanks also to the authors of Dialign2, GABIOS, ZGRviewer and Graphviz programs or packages.

Please send questions and comments about Protomata-learner to fcoste@irisa.fr

Valid XHTML 1.0 Transitional