A Public Engineering Experiment: Building an Open-Source GAMP 5 Training Dataset
I'm publishing an initial 50 URS + 50 FS as an open, CC-BY-SA, synthetically generated GAMP-aligned corpus — then attempting to fine-tune Qwen 3 (7B) on it. Public engineering, in real time, including the failures.
Everyone in pharma IT is currently discussing the potential of AI for GxP documentation. But fine-tuning a model to actually understand the SDLC, follow SOPs, and generate compliant documentation requires something that currently does not exist in the public domain: a high-quality, GAMP-aligned training corpus.
Over the next 30 days, I am running a public experiment to build and test exactly that.
Today, I am publishing an initial synthetically generated open-source dataset of 50 User Requirements Specifications (URS) and 50 Functional Specifications (FS) on my site and on my GitHub account.
Pharma orgs deploying high-risk AI in 2026 need GAMP-aligned training data for the technical-documentation and data-governance obligations (Art. 10, Art. 11). Today there's no public corpus to seed that against. I know because I searched.
To be absolutely clear on provenance: every document in this corpus is synthetically generated strictly from regulatory primary sources (FDA guidance, ISPE GAMP 5, ICH, ISO 13485). It contains zero anonymised customer data from past validation projects.
Over the next 30 days, the repository will expand to include more document types from this internal set, adding VP, VR, TP, TR, RA, IQ/OQ/PQ, Traceability Matrix and more.
High-quality training data shouldn't be a proprietary bottleneck; it is shared infrastructure. I'm publishing this corpus to provide a reliable baseline, so we can focus on the actual engineering challenge.
Once the dataset is fully public, the next phase begins: attempting to fine-tune a Qwen 3 (7B) model exclusively on this regulatory corpus.
I will be upfront: forcing an LLM to reliably navigate compliance and pass a human QA reviewer's rubric is not a trivial, weekend fine-tune. It is going to require rigorous testing, iteration, and likely some failures along the way. I will document that process as it happens.
I don't know the answer yet. I will share the evaluation results when we get there.
Where to find it
- Download (ZIP, ~1.2 MB): neuralarchitects.ae/gxp-corpus — the corpus hub page, no email gate, CC-BY-SA 4.0.
- Direct zip: gxp-corpus-v1.3.zip
- GitHub repo: github.com/neuralarchitects-de/gamp5-corpus — star or watch the repo for release notifications. v1.3 release ships the same zip as a downloadable asset.