Research program

Ayvu-Talian

Language modeling for minority language preservation.

Active research · Open work

low-resource NLP · cultural persistence · from-scratch transformers

PyTorch · From-scratch transformer

The question

How small can a useful language model be when the language itself is small — and what does 'useful' mean when the goal is preservation, not production?

Overview

A decoder-only transformer trained for Talian, a Venetian-derived language spoken in southern Brazil. What language modeling looks like when the training corpus is a community's living memory, not the internet — and the failure mode is cultural loss.

Approach

  • 1.Decoder-only transformer built from scratch — every layer understood, no pretrained shortcuts hiding the real constraints.
  • 2.Corpus construction as fieldwork: gathering, cleaning, and structuring a scarce, living textual record.
  • 3.Tokenization strategies for a language with unstable orthography and heavy code-switching with Portuguese.

Open problems

Stated plainly, because pretending they're solved would be the opposite of research.

  • Evaluation without native benchmarks or large speaker populations.
  • Generation that respects dialectal variation instead of flattening it.
  • Transfer from high-resource relatives (Italian, Venetian) without erasing what makes Talian itself.