KSTRUCT: RNA Base-Pair Probability Analysis Tool

KSTRUCT is a C program designed to analyze RNA sequences and identify enriched base-pair interactions in protein-bound RNA. It follows a structured pipeline including RNA folding, k-mer counting, and enrichment analysis, producing a CSV file with ranked k-mers based on their likelihood of being bound or unbound.

Usage

kstruct [OPTIONS] [<input.fa>] [<bound.fa>]

At minimum, you must provide:

-t, --test : RNA sequences bound to protein (test set)
-c, --control : Control RNA sequences (background set)

Command Line Options

General

-h, --help Print help and exit.
--detailed-help Print detailed help, including hidden options, and exit.
-V, --version Print program version and exit.

Input/Output

-t, --test=filename Input file containing protein-bound RNA sequences. Only one test file is supported.
-c, --control=filename Input file containing control RNA sequences. Only one control file is supported.
-o, --output=filename Set the output file prefix. Default: rna (produces rna.csv).
-k, --kmer=INT Length of k-mers to analyze. Default: 5.
--threads=INT Number of threads for parallel computation. Default: 1. ⚠️ Warning: For long sequences (>16,000 nt), multi-threading may cause incorrect counts or failures. Use with caution.
-d, --delimiter=char Output delimiter. Default: , (CSV). Supported delimiters:
- , → .csv
- \t (tab) → .tsv
- : or | or space → .dsv

Algorithms

-w, --seq-windows[=INT] Split sequences into sliding windows of the specified size (default: 20) and compute mean base-pair probabilities per nucleotide. Example: for sequence AGCUUCGA with window size 5:
```
AGCUU
 GCUUC
  CUUCG
   UUCGA
```
Each window’s base-pair probabilities are calculated, then averaged across overlapping nucleotides to yield position-specific probabilities.

Examples

Basic run with default settings
```
kstruct -t bound.fa -c control.fa
```
Produces rna.csv with 5-mer base-pair enrichment analysis.
Custom k-mer size and output name
```
kstruct -t bound.fa -c control.fa -k 7 -o structure_test
```
Produces structure_test.csv with 7-mer probabilities.
Sliding window analysis (30 nt window)
```
kstruct -t bound.fa -c control.fa -w 30
```
Computes mean base-pair probability across 30-nt windows.
Tab-delimited output
```
kstruct -t bound.fa -c control.fa -d "\t"
```
Produces rna.tsv instead of CSV.
Multi-threading (use cautiously with long sequences)
```
kstruct -t bound.fa -c control.fa --threads=4
```

Output Format

The output file contains ranked k-mers with their associated base-pair probabilities. Columns typically include:

k-mer : The k-mer sequence
Test probability : Average base-pair probability in the test set
Control probability : Average base-pair probability in the control set
Enrichment : Relative difference in base-pairing likelihood

If sliding windows are used, additional per-position averaged probabilities are included.

Notes

Input sequences should be in FASTA format.
Window size (--seq-windows) determines resolution: smaller windows capture fine structure, larger windows smooth over broader regions.
Parallelization may speed up folding but increases memory usage.

Citation

If you use KSTRUCT in your research, please cite the associated publication.