Upwork Portfolio — mm9 Genomic Analysis

Python 3.12 NumPy Matplotlib FASTA parsing UCSC knownGene TSS analysis TATA motif mm9 build Reproducible pipeline

A reproducible Python pipeline that processes 10,674 transcripts across four mouse chromosomes — from raw FASTA and UCSC annotation files through to gene-length statistics, coding/non-coding quantification, and strand-aware TATA box detection. Structured for a single-command run with automated report and plot output.

Key results

10,674

Transcripts analysed across chr6, 11, 15 & 16

45.3%

Coding fraction of chromosome 6 annotation span

15,509 bp

Median transcript length (mean 45,343 bp)

~3.2 Mb

Longest transcript in the dataset

40 bp

TSS search window for TATA motif detection

2 files

Auto-generated outputs: report .txt + histogram .png

What was built

FASTA parser — LoadFastaFile()

Handles both plain and zipped FASTA files. Streams line-by-line with garbage collection to keep memory use stable on large chromosomes. Correctly captures the final sequence record — a common edge-case bug in naive parsers.

Gene annotation loader — LoadGene()

Parses UCSC knownGene tab-delimited tables into a keyed dictionary supporting both transcript and CDS coordinate modes via a single boolean flag.

Sequence extraction

Retrieves any transcript's nucleotide sequence by slicing the chromosomal string. Demonstrated on Cntn4 (uc009dcr.2, chr6: 105,627,738–106,624,264), including first-ATG start codon detection.

Gene length distribution — log-scaled histogram

Bins computed in log₁₀ space to expose the multi-order-of-magnitude spread. Log y-axis, human-readable bp tick labels, median annotation, minimal spine styling. Exported to PNG at 150 dpi.

Coding / non-coding quantification

Boolean NumPy array marks every annotated base position on chr6. Overlapping transcripts handled correctly via idempotent assignment — no double-counting. Result: 67.7 M coding bp vs. 81.8 M non-coding bp.

Strand-aware TATA motif search — TSSChroms()

Locates TSS positions accounting for strand direction, then searches a configurable upstream window for the canonical TATA box. Reports hit count and mean motif–TSS distance.

Automated plain-text report

write_report() collects all computed values — counts, statistics, coding fractions, motif results, example gene details — into a datestamped structured text file generated at the end of every run.

Configurable in three lines

Main_file.py — top-of-file constants

# To adapt this pipeline to a different organism or window size,
# only these three constants need to change:
GENE_FILE  = 'mm9_sel_chroms_knownGene.txt'
FASTA_FILE = 'selChroms_mm9.fa.zip'
WINDOW     = 40   # bp upstream/downstream of each TSS