KSTRUCT: RNA Base-Pair Probability Analysis Tool
KSTRUCT is a C program designed to analyze RNA sequences and identify enriched base-pair interactions in protein-bound RNA. It follows a structured pipeline including RNA folding, k-mer counting, and enrichment analysis, producing a CSV file with ranked k-mers based on their likelihood of being bound or unbound.
Usage
kstruct [OPTIONS] [<input.fa>] [<bound.fa>] At minimum, you must provide:
-t, --test: RNA sequences bound to protein (test set)-c, --control: Control RNA sequences (background set)
Command Line Options
General
-h, --helpPrint help and exit.--detailed-helpPrint detailed help, including hidden options, and exit.-V, --versionPrint program version and exit.
Input/Output
-t, --test=filenameInput file containing protein-bound RNA sequences. Only one test file is supported.-c, --control=filenameInput file containing control RNA sequences. Only one control file is supported.-o, --output=filenameSet the output file prefix. Default:rna(producesrna.csv).-k, --kmer=INTLength of k-mers to analyze. Default:5.--threads=INTNumber of threads for parallel computation. Default:1. ⚠️ Warning: For long sequences (>16,000 nt), multi-threading may cause incorrect counts or failures. Use with caution.-d, --delimiter=charOutput delimiter. Default:,(CSV). Supported delimiters:,→.csv\t(tab) →.tsv:or|or space →.dsv
Algorithms
-w, --seq-windows[=INT]Split sequences into sliding windows of the specified size (default:20) and compute mean base-pair probabilities per nucleotide. Example: for sequenceAGCUUCGAwith window size5:AGCUU GCUUC CUUCG UUCGAEach window’s base-pair probabilities are calculated, then averaged across overlapping nucleotides to yield position-specific probabilities.
Examples
Basic run with default settings
kstruct -t bound.fa -c control.faProduces
rna.csvwith 5-mer base-pair enrichment analysis.Custom k-mer size and output name
kstruct -t bound.fa -c control.fa -k 7 -o structure_testProduces
structure_test.csvwith 7-mer probabilities.Sliding window analysis (30 nt window)
kstruct -t bound.fa -c control.fa -w 30Computes mean base-pair probability across 30-nt windows.
Tab-delimited output
kstruct -t bound.fa -c control.fa -d "\t"Produces
rna.tsvinstead of CSV.Multi-threading (use cautiously with long sequences)
kstruct -t bound.fa -c control.fa --threads=4
Output Format
The output file contains ranked k-mers with their associated base-pair probabilities. Columns typically include:
- k-mer : The k-mer sequence
- Test probability : Average base-pair probability in the test set
- Control probability : Average base-pair probability in the control set
- Enrichment : Relative difference in base-pairing likelihood
If sliding windows are used, additional per-position averaged probabilities are included.
Notes
- Input sequences should be in FASTA format.
- Window size (
--seq-windows) determines resolution: smaller windows capture fine structure, larger windows smooth over broader regions. - Parallelization may speed up folding but increases memory usage.
Citation
If you use KSTRUCT in your research, please cite the associated publication.