Bioinformatics · Python · Genomic Analysis
Mouse Genome Sequence Analysis
Pipeline (mm9)
End-to-end annotation parsing, sequence extraction, coding region
quantification, and regulatory motif detection — Mus musculus
Python 3.12 NumPy Matplotlib FASTA parsing UCSC knownGene TSS analysis TATA motif mm9 build Reproducible pipeline

A reproducible Python pipeline that processes 10,674 transcripts across four mouse chromosomes — from raw FASTA and UCSC annotation files through to gene-length statistics, coding/non-coding quantification, and strand-aware TATA box detection. Structured for a single-command run with automated report and plot output.

10,674
Transcripts analysed across chr6, 11, 15 & 16
45.3%
Coding fraction of chromosome 6 annotation span
15,509 bp
Median transcript length (mean 45,343 bp)
~3.2 Mb
Longest transcript in the dataset
40 bp
TSS search window for TATA motif detection
2 files
Auto-generated outputs: report .txt + histogram .png
01
FASTA parser — LoadFastaFile()
Handles both plain and zipped FASTA files. Streams line-by-line with garbage collection to keep memory use stable on large chromosomes. Correctly captures the final sequence record — a common edge-case bug in naive parsers.
02
Gene annotation loader — LoadGene()
Parses UCSC knownGene tab-delimited tables into a keyed dictionary supporting both transcript and CDS coordinate modes via a single boolean flag.
03
Sequence extraction
Retrieves any transcript's nucleotide sequence by slicing the chromosomal string. Demonstrated on Cntn4 (uc009dcr.2, chr6: 105,627,738–106,624,264), including first-ATG start codon detection.
04
Gene length distribution — log-scaled histogram
Bins computed in log₁₀ space to expose the multi-order-of-magnitude spread. Log y-axis, human-readable bp tick labels, median annotation, minimal spine styling. Exported to PNG at 150 dpi.
05
Coding / non-coding quantification
Boolean NumPy array marks every annotated base position on chr6. Overlapping transcripts handled correctly via idempotent assignment — no double-counting. Result: 67.7 M coding bp vs. 81.8 M non-coding bp.
06
Strand-aware TATA motif search — TSSChroms()
Locates TSS positions accounting for strand direction, then searches a configurable upstream window for the canonical TATA box. Reports hit count and mean motif–TSS distance.
07
Automated plain-text report
write_report() collects all computed values — counts, statistics, coding fractions, motif results, example gene details — into a datestamped structured text file generated at the end of every run.
Main_file.py — top-of-file constants
# To adapt this pipeline to a different organism or window size,
# only these three constants need to change:
GENE_FILE  = 'mm9_sel_chroms_knownGene.txt'
FASTA_FILE = 'selChroms_mm9.fa.zip'
WINDOW     = 40   # bp upstream/downstream of each TSS