IKKE: RNA Motif Discovery Tool

IKKE is a C program designed to analyze RNA sequences and identify enriched motifs in protein-bound RNA. It follows a structured pipeline including k-mer counting, frequency calculations, and enrichment analysis, producing ranked k-mers based on their likelihood of being a motif.

Usage

ikke -t [test] -c [control] [OPTIONS]

At minimum, you must provide:

-t, --test : RNA sequences bound to protein (test set)
-c, --control : Control RNA sequences (background set)

Command Line Options

General

-h, --help Print help and exit.
--detailed-help Print detailed help, including hidden options, and exit.
-V, --version Print program version and exit.

Input/Output

-t, --test=filename Input file containing protein-bound RNA sequences. Only one test file is supported.
-c, --control=filename Input file containing control RNA sequences. Only one control file is supported.
-o, --output=filename Set the output file prefix. Default: motif.
-k, --kmer=INT Length of k-mers to analyze. Default: 3.
-i, --iterations=INT Number of iterations for the analysis. Default: 1.
--threads=INT Number of threads for parallel computation. Default: 1. ⚠️ Warning: For long sequences (>16,000 nt), multi-threading may cause incorrect counts or failures. Use with caution.
-d, --delimiter=char Output delimiter. Default: , (CSV). Supported delimiters:
- , → .csv
- \t (tab) → .tsv
- : or | or space → .dsv
--no-log Disable log2 normalization of enrichment values. By default, enrichments are log2-normalized for interpretability.

Algorithms

-R, --enrichments Compute enrichment values (R values) from k-mer frequencies.
-s, --shuffle Shuffle sequences while preserving k-let counts.
--klet=INT Set k-let size for sequence shuffling (ushuffle). Default: -1 (automatic).
-p, --independent-probs Compute enrichments without using a control. Instead, expected frequencies are derived from mono-/di-nucleotide distributions.
-b, --bootstrap[=INT] Perform bootstrapping by subsampling sequences. Default: 10 iterations. Computes mean enrichments and standard deviation from resampled subsets.
--sample=INT Percent of sequences randomly subsampled per bootstrap iteration. Range: 1–100. Default: 10.
--seed=INT Random seed for bootstrap sampling. Default: -1 (random). Use a fixed seed for reproducible results.

Examples

Basic run with default settings
```
ikke -t bound.fa -c control.fa
```
Produces motif.csv with 3-mer enrichments using iterative k-mer knockout enrichment (log2-normalized).
IKKE Iterations
```
ikke -t bound.fa -c control.fa -i 10
```
Produces motif.csv with 3-mer enrichments using iterative k-mer knockout enrichments for 10 iterations. Will contain the 10 most enriched k-mers. Use the -i flag to specify ikke iterations.
Custom k-mer size and output prefix
```
ikke -t bound.fa -c control.fa -k 7 -o experiment1
```
Produces experiment1.csv with 7-mer enrichments. Specify k-mer size with the -k flag.
Regular Enrichments
```
ikke -t bound.fa -c control.fa -R -k 5
```
Produces motif.csv with 5-mer enrichments. The -R flag is used to compute regular enrichments. Computes all 4^k enrichments (1024 for 5-mer) and orders them based on their R value.
Bootstrapped enrichment analysis
```
ikke -t bound.fa -c control.fa -b 50 --sample=20 --seed=123
```
Runs 50 bootstrap iterations, subsampling 20% of sequences each time. Specify seed for reproducible results.
Tab-delimited output
```
ikke -t bound.fa -c control.fa -d "\t"
```
Produces motif.tsv instead of CSV. Columns will be tab delimited.
Multi-threading (use with caution for long sequences)
```
ikke -t bound.fa -c control.fa --threads=4
```
Control Independent Enrichments
```
ikke -t bound.fa -p
```
Produces a motif.csv file with 3-mer enrichments. No control sequences are required.
Sequence Shuffling
```
ikke -t bound.fa -s
```
Produces a motif.csv file with 3-mer enrichments. Control dataset is based off shuffled sequences of the test dataset where k-let=1.
Sequence Shuffling with custom k-mer
```
ikke -t bound.fa -k=6 -s --klet=3
```
Produces a motif.csv file with 6-mer enrichments. Shuffled sequences (for control dataset) preserve tri-nucleotides.

Output Format

The output file contains enriched k-mers ranked by significance. Columns typically include:

k-mer : The k-mer sequence
Enrichment : Raw or log2-normalized enrichment value
Std. Dev. (if bootstrapped) : Variation across bootstrap iterations

Notes

Input sequences should be provided in plain FASTA/FASTQ/raw reads format.
Bootstrap can be applied to all enrichment algorithms (ikke, R, probs, and shuffle).
For reproducibility of randomized methods (shuffle/bootstrap), specify --seed.

Citation

If you use IKKE in your research, please cite the associated publication.