C. elegans miRNA Sequence Analysis
Pipeline Results Report
Background
This report presents the results of an automated sequence analysis of microRNA precursors
from Caenorhabditis elegans extracted from the miRBase v22 database. Two founding
members of the miRNA family were selected as targets: cel-let-7
and cel-lin-4.
MicroRNAs (miRNAs) are small (~22 nt) non-coding RNA molecules that regulate gene expression post-transcriptionally by binding to complementary sequences in target mRNAs, suppressing translation or triggering degradation. They are estimated to regulate ~60% of human protein-coding genes and are implicated in development, cancer, and aging.
cel-let-7 (MI0000001) — described by Reinhart et al. (2000). A highly conserved regulator of developmental timing present across bilaterians. Its conservation from worm to human established miRNAs as a universal gene-regulatory mechanism.
cel-lin-4 (MI0000002) — discovered by Lee et al. (1993), the first miRNA ever described. Essential for larval developmental timing in C. elegans via repression of the LIN-14 protein.
GC content is reported as a primary biochemical metric. Higher GC content correlates with greater thermal stability of the RNA duplex — relevant to secondary structure formation in precursor hairpins and their recognition by Dicer and Argonaute proteins during miRNA biogenesis.
Pipeline
The pipeline executes six sequential steps from a single shell invocation. The source
database (hairpin.fa, miRBase v22) is downloaded once and cached; all
subsequent runs skip the download step. Sequences are filtered by a dynamically
constructed regex pattern, converted from RNA to DNA (U → T),
and passed to SeqKit's fx2tab module for metric extraction. Results are
written to both a TSV data file and a Markdown report.
The RNA → DNA conversion does not affect GC content: replacing U with T leaves the count of guanine and cytosine bases unchanged, so reported GC% values are biochemically equivalent to the original RNA sequences.
Data
Source
Input: hairpin.fa from miRBase v22 — a FASTA file containing all precursor
hairpin sequences across all organisms in the database. Filtered to C. elegans
sequences matching the let-7 and lin-4 gene families using
case-insensitive regex via seqkit grep.
Invocation
The following command produced all results in this report. No script modification was required; organism and gene targets are passed as CLI arguments, and the applied regex pattern is echoed at runtime for full audit traceability.
./analyze_mirna.sh cel "let-7|lin-4" # Pattern applied: cel-.*let-7|cel-.*lin-4 # Output directory: ./cel_analysis_v22/ # TSV data file: cel_mirna_v22_results.tsv # Markdown report: cel_mirna_v22_report.md
Results
Summary statistics
Sequence table
Direct output of cel_mirna_v22_results.tsv — all sequences matched by
the pipeline. Columns are extracted by SeqKit fx2tab: full miRBase
sequence identifier, length in nucleotides, and GC percentage calculated on the
DNA-converted sequence.
| # | gene | miRBase accession | description | length (nt) | gc (%) |
|---|---|---|---|---|---|
| 1 | cel-let-7 |
MI0000001 | let-7 stem-loop | 99 | 43.43 |
| 2 | cel-lin-4 |
MI0000002 | lin-4 stem-loop | 94 | 54.26 |
cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop).
The table above splits the header into columns for readability. The green value marks
the higher GC content of the two sequences.
GC content
The two sequences show a substantial difference in GC content: cel-lin-4
(54.26%) is ~10.8 percentage points higher than cel-let-7 (43.43%).
This difference suggests greater thermodynamic stability in the lin-4 precursor hairpin,
potentially reflecting structural constraints imposed by the specific LIN-14 repression
mechanism, distinct from the broader developmental timing role of let-7.
| sequence | length (nt) | gc (%) | Δ from mean gc |
|---|---|---|---|
cel-let-7 |
99 | 43.43 | −5.42 |
cel-lin-4 |
94 | 54.26 | +5.41 |
| mean | 96.5 | 48.85 | — |
Terminal output
Full stdout from a clean run with the parameters above. This constitutes the complete provenance record for the results in this report.
Conclusions
- 2 precursor sequences were successfully extracted from miRBase v22:
cel-let-7(MI0000001, 99 nt, GC 43.43%) andcel-lin-4(MI0000002, 94 nt, GC 54.26%) — the two founding miRNA sequences in the database. cel-lin-4shows substantially higher GC content (+10.83 pp abovecel-let-7), suggesting greater thermodynamic stability in the lin-4 precursor hairpin structure.cel-let-7is slightly longer (99 nt vs 94 nt), which may reflect more elaborate secondary structure in the precursor hairpin prior to Dicer processing.- The pipeline ran successfully end-to-end with FASTA validation, RNA→DNA conversion, and dual-format export (TSV + Markdown) confirmed across all six steps.
- The pipeline is fully extensible: any miRBase organism and any combination of gene targets can be analysed by passing CLI arguments, with no script modification required.
hairpin.fa). Mature miRNA sequences (mature.fa) are not
included. Results reflect miRBase v22 annotations; later releases may contain
revised accessions or additional C. elegans entries.
Session info
bash --version && seqkit version && wget --version | head -1
References
- [1] Kozomara, A., Birgaoanu, M., & Griffiths-Jones, S. (2019). miRBase: from microRNA sequences to function. Nucleic Acids Research, 47(D1), D155–D162. doi:10.1093/nar/gky1141
- [2] Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE, 11(10), e0163962. doi:10.1371/journal.pone.0163962
- [3] Lee, R. C., Feinbaum, R. L., & Ambros, V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell, 75(5), 843–854. doi:10.1016/0092-8674(93)90529-Y
- [4] Reinhart, B. J., et al. (2000). The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature, 403, 901–906. doi:10.1038/35002607