SMaRTFinder

What is it?
Download
License
Acknowledgements
Troubleshooting
Contact Information
Related links

What is it?

SMaRTFinder is a command line tool to search biosequences for structured motifs (or structured models), that is lists of patterns separated by distance intervals. An example of a DNA structured motif is AG[CG]TAG(1)<10,50>TGG<23,78>AC[ACG]G, which specifies a subsequence matching the first motif with at most one error, followed 10 to 50 bp further by an exact occurrence of the second motif and 23 to 78 bp further by an exact match of the third motif. You can also specify that a match where, say, only two out of three motifs are found, is a valid match. SMaRTFinder computes relative distances between two (possibly non-adjacent) component patterns as described in the companion paper Structured Motifs Search, which describes the algorithms used by SMaRTFinder.

SMaRTFinder reads a sequence in FASTA format and a structured model specification, either from a file or from the standard input, then searches for exact/approximate occurrences of the structured model (in one or both strands, if the sequence is DNA) and outputs a file in GFF format, which can be visualized, for instance, by Apollo.

See the included README file for more detailed information.

Download

Current version is DR4 (Developer Release), released on February 21, 2007.

License

SMaRTFinder is free software released under the GNU General Public License. See the included files for details.

Acknowledgements

We wish to thank prof. Stefan Kurtz, who re-licensed his Wotd suffix tree code under the Clarified Artistic License allowing us to include it in SMaRTFinder, and Simone Scalabrin, who implemented a prototype version of SMaRTFinder.

Troubleshooting

SMaRTFinder does not compile
SmaRTFinder is written in standard C++ (apart from Kurtz's code which is in C), so it should compile on any ISO compliant compiler. It will not compile with g++ 2.x (g++ 3.x or above is needed), and probably (not tested) with MS Visual C++ 6.0 (7.0 or above is needed). The reason is that g++ 2.x does not fully support the using keyword, and MS VC++ 6.0 does not support covariant return types. SMaRTFinder uses both features.

So far, SMaRTFinder has only been tested under Linux, Mac OS X, Sun OS and Windows XP.

SMaRTFinder crashes
SMaRTFinder is still under development, so many bugs may still be around, although much care has been devoted to avoid them. However, SMaRTFinder is not totally safe from out of memory conditions. Keep in mind that you may undergo shortage of memory if you process very big files: each occurrence of a component pattern in a structured model requires the allocation of approximately 28 bytes, so if you find one million matches, you will need more or less 30Mb of free space. Besides, suffix trees are eager of memory: the Wotd algorithm by Stefan Kurtz is one of the most space efficient constructions: it needs, on average, 10 bytes per processed symbol. So, if your sequence is a 30Mb chromosome, you will need about 300Mb of free memory to entirely build its suffix tree. You can save space by building the suffix tree lazily (which is the default behaviour of SMaRTFinder).
SMaRTFinder eats a lot of memory
See also note above. If you use SMaRTFinder continuously for a while, you may notice that the portion of memory it allocates tends to increase (or, at least, to never decrease). This is due to a technique for memory allocation known as a "memory pool", which has been implemented for faster performance. Basically, it means that SMaRTFinder allocates chunks of memory for its operation which are not released to the system until the program quits. This is not a bug, but the correct behaviour of the program: SMaRTFinder never uses more memory than the maximum amount of memory needed for a user search. Possibly, better memory management strategies will be implemented in the future.
SMaRTFinder is slow when searching for approximate matches
Currently, SMaRTFinder implements a naive approximate matching algorithm (Sellers' algorithm) which is suboptimal, being quadratic (the best approximate matching algorithms are linear on worst case, or even sublinear on average). This will change in the future.

Contact Information

Nicola Vitacolonna.

Related links

SMaRTFinder has similar features as the Anrep Pattern Matching System by Gene Myers. There are some differences, though (apart from performance considerations): Anrep does not allow searching for partial matches, i.e. matches where only a fraction of all the component patterns is present. On the other hand, Anrep implements an approximate network expression matching algorithm, which SMaRTFinder, currently, does not (if you want to contribute with such an implementation, you are welcome!).