About Service

Alignment-free sequence analysis (AF) tools have exploded into biological research. As these programs offer computational speed many hundreds of times faster than the comparable alignment-based approaches, they have been applied to problems such as NGS analysis, whole genome phylogeny, identification of recombined and horizontally transferred genes -- and many more. Because of the wide range of possible applications, benchmarking of alignment-free predictions remains a diffult challenge for methods developers and users.

The AFproject service aims at simplifying and standardizing alignment-free benchmarking. And for the users, the benchmarks provide a way to identify the most effective methods for the problem at hand.

Service Features

Characterize performance of all well-established AF programs under different ecolutionary scenarios.
Create a catalogue of most effective methods for the problem at hand.
Support developers during method implementation process by allowing testing of their tools at different stages of progress and offering opportunity to disseminate the results publicly.
Provide platform for definition of novel dataset depending of technological development: users and developers can request changes or new datasets.

Service Content

The server benchmarks AF tools against 11 reference datasets, which can be classified into 5 application categories.

# Category Reference dataset Reference short name Read more
1 Regulatory sequences Cis-regulatory modules CRM
2 Protein homology Low sequence identity (<40%) protein-low-ident
High sequence indentiy (>40%) protein-high-ident
3 Gene Trees Reference gene trees swisstree
4 Whole Genome Phylogeny 29 E.coli/Shigella strains (unassembled genomes) unassembled-ecoli
29 E.coli/Shigella strains (assembled genomes) assembled-ecoli
14 plant species (unassembled genomes) unassembled-plants
14 plant species (assembled genomes) assembled-plants
5 Horizontal Gene Transfer 27 E.coil/Shigella genomes unsimulated-ecoli
7 Yersinia species unsimulated-shigella
33 simulated E.coli genomes sim_hgt

How does it work?

AF method developer downloads from the server the FASTA dataset from given category (1-5).
Developer uses the downloaded dataset as an input to his/her alignment-free program. The output file should contain all-versus-all pairwise sequence distances, either in TSV or Phylip formats (see below: Formats)
Developer uploads the output file to the server.
The server benchmarks the uploaded predictions and presents a report with the submitted method's performance and comparison to other available methods. The developer can choose to make the report publicly available.

Formats

Benchmarks of all 11 datasets accept pairwise sequence distances in TSV or Phylip format.

TSV (Tab-separated value) format

Simple text file with three tab-separated columns. First two columns store identifiers of two sequences being compared. Third column has a numerical distance value of this comparison. TSV can have more than 3 columns (the extra columns will be omitted).

Example of Text File Format (4 sequences)

A   B   8.876
A   C   6.120
A   D   4.321
B   C   5.231
B   D   3.983
C   D   0.663

Phylip format (symmetric distance matrix)

   4
A        0.000 8.876 6.120 9.321
B        8.876 0.000 2.231 3.983
C        6.120 2.231 0.000 0.663
D        9.321 3.983 0.663 0.000

Phylip format (lower-triangle distance matrix)

   4
A
B        8.876
C        6.120 2.231
D        9.321 3.983 0.663

Newick format

(B,(C,D),A);

Branch lengths can be incorporated, but are not required.

(B:2.13125,(C:0.90675,D:1.56975):0.64425,A:6.74475);