KSTRUCT: RNA Base-Pair Probability Analysis Tool

KSTRUCT is a C program designed to analyze RNA sequences and identify enriched base-pair interactions in protein-bound RNA. It follows a structured pipeline including RNA folding, k-mer counting, and enrichment analysis, producing a CSV file with ranked k-mers based on their likelihood of being bound or unbound.

Usage

kstruct [OPTIONS] [<input.fa>] [<bound.fa>]

At minimum, you must provide:

  • -t, --test : RNA sequences bound to protein (test set)
  • -c, --control : Control RNA sequences (background set)

Command Line Options

General

  • -h, --help Print help and exit.

  • --detailed-help Print detailed help, including hidden options, and exit.

  • -V, --version Print program version and exit.

Input/Output

  • -t, --test=filename Input file containing protein-bound RNA sequences. Only one test file is supported.

  • -c, --control=filename Input file containing control RNA sequences. Only one control file is supported.

  • -o, --output=filename Set the output file prefix. Default: rna (produces rna.csv).

  • -k, --kmer=INT Length of k-mers to analyze. Default: 5.

  • --threads=INT Number of threads for parallel computation. Default: 1. ⚠️ Warning: For long sequences (>16,000 nt), multi-threading may cause incorrect counts or failures. Use with caution.

  • -d, --delimiter=char Output delimiter. Default: , (CSV). Supported delimiters:

    • ,.csv
    • \t (tab) → .tsv
    • : or | or space → .dsv

Algorithms

  • -w, --seq-windows[=INT] Split sequences into sliding windows of the specified size (default: 20) and compute mean base-pair probabilities per nucleotide. Example: for sequence AGCUUCGA with window size 5:

    AGCUU
     GCUUC
      CUUCG
       UUCGA

    Each window’s base-pair probabilities are calculated, then averaged across overlapping nucleotides to yield position-specific probabilities.

Examples

  1. Basic run with default settings

    kstruct -t bound.fa -c control.fa

    Produces rna.csv with 5-mer base-pair enrichment analysis.

  2. Custom k-mer size and output name

    kstruct -t bound.fa -c control.fa -k 7 -o structure_test

    Produces structure_test.csv with 7-mer probabilities.

  3. Sliding window analysis (30 nt window)

    kstruct -t bound.fa -c control.fa -w 30

    Computes mean base-pair probability across 30-nt windows.

  4. Tab-delimited output

    kstruct -t bound.fa -c control.fa -d "\t"

    Produces rna.tsv instead of CSV.

  5. Multi-threading (use cautiously with long sequences)

    kstruct -t bound.fa -c control.fa --threads=4

Output Format

The output file contains ranked k-mers with their associated base-pair probabilities. Columns typically include:

  • k-mer : The k-mer sequence
  • Test probability : Average base-pair probability in the test set
  • Control probability : Average base-pair probability in the control set
  • Enrichment : Relative difference in base-pairing likelihood

If sliding windows are used, additional per-position averaged probabilities are included.

Notes

  • Input sequences should be in FASTA format.
  • Window size (--seq-windows) determines resolution: smaller windows capture fine structure, larger windows smooth over broader regions.
  • Parallelization may speed up folding but increases memory usage.

Citation

If you use KSTRUCT in your research, please cite the associated publication.