Alignment-free sequence analysis (AF) tools have exploded into biological research. As these programs offer computational speed many hundreds of times faster than the comparable alignment-based approaches, they have been applied to problems such as NGS analysis, whole genome phylogeny, identification of recombined and horizontally transferred genes -- and many more. Because of the wide range of possible applications, benchmarking of alignment-free predictions remains a diffult challenge for methods developers and users.
The AFproject service aims at simplifying and standardizing alignment-free benchmarking. And for the users, the benchmarks provide a way to identify the most effective methods for the problem at hand.
Characterize performance of all well-established AF programs under different ecolutionary scenarios.
Create a catalogue of most effective methods for the problem at hand.
Support developers during method implementation process by allowing testing of their tools at different stages of progress and offering opportunity to disseminate the results publicly.
Provide platform for definition of novel dataset depending of technological development: users and developers can request changes or new datasets.
The server benchmarks AF tools against 11 reference datasets, which can be classified into 5 application categories.
||Reference short name
||Low sequence identity (<40%)
|High sequence indentiy (>40%)
||Reference gene trees
||Whole Genome Phylogeny
||29 E.coli/Shigella strains (unassembled genomes)
|29 E.coli/Shigella strains (assembled genomes)
|14 plant species (unassembled genomes)
|14 plant species (assembled genomes)
||Horizontal Gene Transfer
||27 E.coil/Shigella genomes
|7 Yersinia species
|33 simulated E.coli genomes
How does it work?
AF method developer downloads from the server the FASTA dataset from given category (1-5).
Developer uses the downloaded dataset as an input to his/her alignment-free program. The output file should contain all-versus-all pairwise sequence distances, either in TSV or Phylip formats (see below: Formats)
Developer uploads the output file to the server.
The server benchmarks the uploaded predictions and presents a report with the submitted method's performance and comparison to other available methods. The developer can choose to make the report publicly available.
Benchmarks of all 11 datasets accept pairwise sequence distances in
TSV (Tab-separated value) format
Simple text file with three tab-separated columns. First two columns store identifiers of two sequences being compared. Third column has a numerical distance value of this comparison. TSV can have more than 3 columns (the extra columns will be omitted).
Example of Text File Format (4 sequences)
A B 8.876
A C 6.120
A D 4.321
B C 5.231
B D 3.983
C D 0.663
Phylip format (symmetric distance matrix)
A 0.000 8.876 6.120 9.321
B 8.876 0.000 2.231 3.983
C 6.120 2.231 0.000 0.663
D 9.321 3.983 0.663 0.000
Phylip format (lower-triangle distance matrix)
C 6.120 2.231
D 9.321 3.983 0.663
Branch lengths can be incorporated, but are not required.