The dataset pertains to cis-regulatory sequence modules (CRMs) that are known to regulate expression in the same tissue and/or development stage in fly or human. A CRM can be loosely defined as a contiguous non-coding sequence that contains multiple transcription factor binding sites and drives some aspect of a gene's expression profile.
The dataset was collected by Kantorovitz et al. (2007) in order to test the capacity of alignment-free measures in identification of functional relationships between regulatory sequences (e.g. enhancers or promoters).
The dataset has 1 directory containing 370 FASTA files
crm/ ├── FB.001.1.fasta ├── FB.002.1.fasta ├── FB.003.1.fasta ├── FB.004.1.fasta ├── FB.005.1.fasta ├── FB.006.1.fasta ├── FB.007.1.fasta ├── FB.008.1.fasta ├── FB.009.1.fasta ├── FB.010.1.fasta ├── ...Dataset Structure
The dataset FASTA file is a mixture of 7 subsets of CRM sequences, each taken from different tissue of D. melanogaster or Homo sapiens. Each of the 7 subsets has 2n sequences, where the first n sequences are CRMs ("positive half") and the next n sequences are random non-coding sequences with matching lengths, chosen from the respective genome ("negative half").
# | Subset | n CRM seqs (positive) | n random seqs (negative) |
---|---|---|---|
1 | fly_blastoderm | 82 | 82 |
2 | fly_eye | 17 | 17 |
3 | fly_pns | 23 | 23 |
4 | fly_tracheal_system | 9 | 9 |
5 | human_HBB | 17 | 17 |
6 | human_liver | 9 | 9 |
7 | human_muscle | 28 | 28 |
The test evaluates if functionally-related CRM sequence pairs (from positive half) are better scored by a given alignment-free tool (i.e., have lower distance values) than unrelated pairs of sequences (from negative half).
Specifically, the benchmark procedure takes as input user's file containing the distances between all sequence pairs present in the dataset file. The procedure starts with the extraction, from the user's file, of sequence pairs within "positive" and "negative" halves within each of the 7 subsets. For any subset, the top half (or 300, whichever is smaller) of the pairs are examined. The number of "positive" pairs in this top half is reported. The overall assesment of method accuracy is weighted average of positive pairs across all 7 subsets.
Benchmark supports one of the following file formats:
Simple simple text file with three tab-separated columns: first two columns store identifiers of two sequences being compared, and third column has a numerical distance value of this comparison.
Example of Text File Format (4 sequences)
A B 8.876 A C 6.120 A D 4.321 B C 5.231 B D 3.983 C D 0.663
Square Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A 0.000 8.876 6.120 9.321 B 8.876 0.000 2.231 3.983 C 6.120 2.231 0.000 0.663 D 9.321 3.983 0.663 0.000
Lower-triangle Distance matix in Phylip format
Example of Phylip distance matrix (for 4 sequences)
4 A B 8.876 C 6.120 2.231 D 9.321 3.983 0.663