cel_mirna_v22_report — bash

C. elegans miRNA Sequence Analysis
Pipeline Results Report

author CharlesDexterW date 2026-04-08 miRBase v22 organismcel targets let-7 | lin-4 repo cel-mirna-seqkit-pipeline ↗

Background

This report presents the results of an automated sequence analysis of microRNA precursors from Caenorhabditis elegans extracted from the miRBase v22 database. Two founding members of the miRNA family were selected as targets: cel-let-7 and cel-lin-4.

MicroRNAs (miRNAs) are small (~22 nt) non-coding RNA molecules that regulate gene expression post-transcriptionally by binding to complementary sequences in target mRNAs, suppressing translation or triggering degradation. They are estimated to regulate ~60% of human protein-coding genes and are implicated in development, cancer, and aging.

cel-let-7 (MI0000001) — described by Reinhart et al. (2000). A highly conserved regulator of developmental timing present across bilaterians. Its conservation from worm to human established miRNAs as a universal gene-regulatory mechanism.

cel-lin-4 (MI0000002) — discovered by Lee et al. (1993), the first miRNA ever described. Essential for larval developmental timing in C. elegans via repression of the LIN-14 protein.

GC content is reported as a primary biochemical metric. Higher GC content correlates with greater thermal stability of the RNA duplex — relevant to secondary structure formation in precursor hairpins and their recognition by Dicer and Argonaute proteins during miRNA biogenesis.

Pipeline

The pipeline executes six sequential steps from a single shell invocation. The source database (hairpin.fa, miRBase v22) is downloaded once and cached; all subsequent runs skip the download step. Sequences are filtered by a dynamically constructed regex pattern, converted from RNA to DNA (U → T), and passed to SeqKit's fx2tab module for metric extraction. Results are written to both a TSV data file and a Markdown report.

The RNA → DNA conversion does not affect GC content: replacing U with T leaves the count of guanine and cytosine bases unchanged, so reported GC% values are biochemically equivalent to the original RNA sequences.

Reproducibility: the pipeline targets miRBase v22 via a pinned static URL. Results are identical across machines and runs regardless of future database updates. Full source and documentation at github.com/CharlesDexterW/cel-mirna-seqkit-pipeline.

Data

Source

Input: hairpin.fa from miRBase v22 — a FASTA file containing all precursor hairpin sequences across all organisms in the database. Filtered to C. elegans sequences matching the let-7 and lin-4 gene families using case-insensitive regex via seqkit grep.

Invocation

The following command produced all results in this report. No script modification was required; organism and gene targets are passed as CLI arguments, and the applied regex pattern is echoed at runtime for full audit traceability.

bash exact command used to produce these results
./analyze_mirna.sh cel "let-7|lin-4"

# Pattern applied:  cel-.*let-7|cel-.*lin-4
# Output directory: ./cel_analysis_v22/
# TSV data file:    cel_mirna_v22_results.tsv
# Markdown report:  cel_mirna_v22_report.md

Results

Summary statistics

sequences matched
2
let-7 · lin-4
mean length
96.5nt
99 nt · 94 nt
mean GC content
48.85%
43.43% · 54.26%
miRBase version
v22
pinned · reproducible

Sequence table

Direct output of cel_mirna_v22_results.tsv — all sequences matched by the pipeline. Columns are extracted by SeqKit fx2tab: full miRBase sequence identifier, length in nucleotides, and GC percentage calculated on the DNA-converted sequence.

Table 1. All matched sequences — C. elegans, miRBase v22 (n = 2)
# gene miRBase accession description length (nt) gc (%)
1 cel-let-7 MI0000001 let-7 stem-loop 99 43.43
2 cel-lin-4 MI0000002 lin-4 stem-loop 94 54.26
Note on identifiers: the raw TSV preserves the full miRBase header verbatim (e.g. cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop). The table above splits the header into columns for readability. The green value marks the higher GC content of the two sequences.

GC content

The two sequences show a substantial difference in GC content: cel-lin-4 (54.26%) is ~10.8 percentage points higher than cel-let-7 (43.43%). This difference suggests greater thermodynamic stability in the lin-4 precursor hairpin, potentially reflecting structural constraints imposed by the specific LIN-14 repression mechanism, distinct from the broader developmental timing role of let-7.

Table 2. Per-sequence metrics with deviation from mean GC
sequence length (nt) gc (%) Δ from mean gc
cel-let-7 99 43.43 −5.42
cel-lin-4 94 54.26 +5.41
mean 96.5 48.85
Sample size note: with n = 2 sequences, statistical inference is not appropriate. These results describe the two founding miRNA precursors as annotated in miRBase v22 for C. elegans. Extending the analysis to additional organisms or gene families via the CLI arguments would yield a larger, more analytically meaningful dataset.

Terminal output

Full stdout from a clean run with the parameters above. This constitutes the complete provenance record for the results in this report.

stdout ./analyze_mirna.sh cel "let-7|lin-4" — complete run
[1/6] Creating workspace: cel_analysis_v22 Organism : cel Targets : let-7|lin-4 Pattern : cel-.*let-7|cel-.*lin-4 [2/6] hairpin.fa already exists. Skipping download. [3/6] Processing cel sequences (pattern: cel-.*let-7|cel-.*lin-4)... Sequences found: 2 [4/6] Generating terminal summary... --- BIOCHEMISTRY REPORT: cel miRNA (miRBase v22) --- Generated on: Tue Apr 8 09:14:22 UTC 2026 --------------------------------------------------------- Sequence_ID Length GC_Content cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop 99 43.43% cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop 94 54.26% MEAN (n=2) 96.5 48.85% --------------------------------------------------------- [5/6] Exporting Markdown report to cel_analysis_v22/cel_mirna_v22_report.md... [6/6] Done. --------------------------------------------------------- SUCCESS: Analysis complete. TSV data : ./cel_analysis_v22/cel_mirna_v22_results.tsv Report : ./cel_analysis_v22/cel_mirna_v22_report.md

Conclusions

  • 2 precursor sequences were successfully extracted from miRBase v22: cel-let-7 (MI0000001, 99 nt, GC 43.43%) and cel-lin-4 (MI0000002, 94 nt, GC 54.26%) — the two founding miRNA sequences in the database.
  • cel-lin-4 shows substantially higher GC content (+10.83 pp above cel-let-7), suggesting greater thermodynamic stability in the lin-4 precursor hairpin structure.
  • cel-let-7 is slightly longer (99 nt vs 94 nt), which may reflect more elaborate secondary structure in the precursor hairpin prior to Dicer processing.
  • The pipeline ran successfully end-to-end with FASTA validation, RNA→DNA conversion, and dual-format export (TSV + Markdown) confirmed across all six steps.
  • The pipeline is fully extensible: any miRBase organism and any combination of gene targets can be analysed by passing CLI arguments, with no script modification required.
Limitation: this analysis covers precursor hairpin sequences only (hairpin.fa). Mature miRNA sequences (mature.fa) are not included. Results reflect miRBase v22 annotations; later releases may contain revised accessions or additional C. elegans entries.

Session info

bash tool versions
bash --version && seqkit version && wget --version | head -1
GNU bash, version 5.2.21(1)-release (x86_64-pc-linux-gnu) seqkit v2.8.2 GNU Wget 1.21.4 built on linux-gnu Dependency check (run at pipeline start): wget — present seqkit — present awk — present

References

  • [1] Kozomara, A., Birgaoanu, M., & Griffiths-Jones, S. (2019). miRBase: from microRNA sequences to function. Nucleic Acids Research, 47(D1), D155–D162. doi:10.1093/nar/gky1141
  • [2] Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE, 11(10), e0163962. doi:10.1371/journal.pone.0163962
  • [3] Lee, R. C., Feinbaum, R. L., & Ambros, V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell, 75(5), 843–854. doi:10.1016/0092-8674(93)90529-Y
  • [4] Reinhart, B. J., et al. (2000). The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature, 403, 901–906. doi:10.1038/35002607