I am one of these people! I am one of a handful of people who speak my ancestral language, Kiksht. I am lucky to be uniquely well-suited to this work, as I am (as far as I know) the lone person from my tribe whose academic research background is in linguistics, NLP, and ML. (We have, e.g., linguists, but very few computational linguists.)
So far I have not had that much luck getting the models to learn the Kiksht grammar and morphology via in-context learning, I think the model will have to be trained on the corpus to actually work for it. I think this mostly makes sense, since they have functionally nothing in common with western languages.
To illustrate the point a bit: the bulk of training data is still English, and in English, the semantics of a sentence are mainly derived from the specific order in which the words appear, mostly because it lost its cases some centuries ago. Its morphology is mainly "derivational" and mainly suffixal, meaning that words can be arbitrarily complicated by adding suffixes to them. So baked into English is word order that sometimes we insert words into sentences simply to make the word order sensible. e.g., when we say "it's raining outside", the "it's" refers to nothing at all—it is there entirely because the word order of English demands that it exists.
Kiksht in contrast is completely different. Its semantics are nearly entirely derived from triple-prefixal structure of (in particular) verbs. Word ordering almost does not matter. There are, like, 12 tenses, and some of them require both a prefix and a reflective suffix. Verbs are often 1 or 2 characters, and with the prefix structure, a single verb can often be a complete sentence. And so on.
I will continue working on this because I think it will eventually be of help. But right now the deep learning that has been most helpful to me has been to do things like computational typology. For example, discovering the "vowel inventory" of a language is shockingly hard. Languages have somewhat consistent consonants, but discovering all the varieties of `a` that one can say in a language is very hard, and deep learning is strangely good at it.
I am also working on low-resource languages (in Central America, but not my heritage). I see on Wikipedia [0] it seems it's a case of revival. Are you collecting resources/data or using existing? (I see some links on Wikipedia).
We are fortunate to have a (comparatively) large amount of written and recorded language artifacts. Kiksht (and Chinookan languages generally) were heavily studied in the early 1900s by linguists like Sapir.
re: revival, the Wikipedia article is a little misleading, Gladys was the last person whose first language was Kiksht, not the last speaker. And, in any event, languages are constantly changing. If we had been left alone in 1804 it would be different now than it was then. We will mold the language to our current context just like any other people.
Super interesting, thank you very much for sharing your thoughts!
HN is still one of the few places on the internet to get such esoteric, expert and intellectually stimulating content. It's like an Island where the spirit of 'the old internet' still lives on.
I am applying for graduate school (after 20 years in the software industry) with the intent of studying computational linguistics; specifically, for the documentation and support of dying/dead languages.
While I am not indigenous, I hope to help alleviate this problem. I'd love to hear about your research!
I did research on entirely unrelated NLP, actually. I worked on search for awhile (I am corecipient of SIGIR best paper ‘17) ad a bit in space- and memory- efficient nlp tasks.
Wow kiksht sounds like a pretty cool language! Are there any resources you'd recommend for the language itself? I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!
So, bad news. Culturally, the Wasq'u consider Kiksht something that is for the community rather than outsiders. So unfortunately I think it will be extremely challenging to find someone to teach you, or resources to teach yourself.
How do you combine that feeling/observation together with what you're working on now, which I'm guessing you'll eventually want to publish/make available somehow? Or maybe I misunderstand what the "it will eventually be of help" part of your first comment.
I’m not sure that the AI itself has many ethical complications. Many terms of service grant AI companies ownership of the data you supply to them though and that very problematic for many tribes which consider the language sacred and not to be shared.
Good luck I wish you the best. I think you will almost certainly need to create a LoRA and fine tune an existing model. Is there enough written material available? I think this would be a valuable effort for humanity, as I think the more languages we can model, the more powerful our models will become because they embody different semantic structures with different strengths. (Beyond the obvious benefits of language preservation)
There is more material than you'd normally find, but it is very hard to even fine-tune at that volume, unfortunately. I think we might be able to bootstrap something like that with a shared corpus of related languages, though.
You can always fine tune with the corpus you have and then try in context on top the fine tuning even if it’s insufficient. Then with that - and perhaps augmenting with RAG against the corpus - you might be able to build a context that’s stable enough in the language that you can generate human mediated synthetic corpuses and other reinforcement to enrich the LORA.
> I will continue working on this because I think it will eventually be of help.
They say language shapes thought. Having an LLM speak such a different language natively seems like it would uncover so much about how these models work, as a side effect of helping preserve Kiksht. What a cool place to be!
I learned it from my grandmother and from Gladys's grandkid. Gladys was the last person whose first language was Kiksht, not the last person who speaks it.
Maybe there are some sources talking about fully fluent people still being alive? As currently the article gives the impression they were the last person to "fully" speak it.
I’m aware and I think people do not care much to correct it. The tribe has had consistently bad experiences with outsiders attempting to document this kind of thing and I sense that people are mostly wanting to be left alone at this point. There is some possibility that I would make a different decision, but it’s not for me to decide, it’s for the elders.
That being said, the OpenAI models do a fantastic job at translating sentences so I've put my own model research further to the back. (will try to find some examples)
I can't speak of the true preservation, not many native speakers left, but in my mind that's not even all that important from a personal/cultural perspective.
If the youth who are interested in learning more about their language have a nice interface with 70-80% accurate results and they enjoy doing/learning it then that is a win to me. (and kind of how language evolves anyway) (the noun replacement seems to work great, but grammar is obviously wishy-washy)
(At this point, I just rushed to get my tribes dictionary crawlable so hopefully it will be in a few models next training phases)
I was looking to name a property recently and tried ChatGPT for suggestions on a theme of geographical features in various languages. I had it try Kaurna (Adelaide area) on a whim and was pleasantly surprised (given that Google Translate doesn't cover it, I guess) to find that it gave loads of relevant suggestions without any issue. Any that I picked to selectively check verified with Kaurna dictionaries OK.
Fwiw this is the original usage of llms. The whole context awareness came about for the purpose of translations and the famous ‘attention is all you need’ paper was from the google translation team.
Say what you will about llms being overhyped but this is the original core use case.
I've always resented the way "traditional" cultures are "preserved". It's like when people want to protect "authentic" locales from the corrupting influence of tourism. I'm grateful that my own culture is not (yet) seen primarily through this lens. Imagine the day when legislation makes exceptions for fentanyl as a traditional ritual substance of nomadic Trailer Americans.
To me the beauty of these things is in their liveliness, in the aspiration to flourish and grow, not merely to conserve a little longer, to spend one more night with terminal cancer before the inevitable.
It seems like this could be incredibly fraught with danger once LLMs are involved (the story isn't exactly clear whether they are). If there are no surviving native speakers of a language (or very few) doing something like training an LLM to generate text in that language would run the risk of e.g. transferring English grammar to the vocabulary, and causing a hybrid language to become the dominant form of that language, because the LLM is used widely for study and the native-speaking elders are not.
Is that so bad? If the only way to ensure a language's preservation and continued use is to mangle and evolve it? I think that's kind of a neat concept, to be honest. To live is to change.
The content of modern culture is too much for dying or ancient languages, and what you actually get is English/modern thoughtspace expressed in the lexicon of an until-now separate culture. This flood of spam destroys what was unique and interesting about the culture, and "skin-suits" it.
Maybe, unless some form of Sapir-Whorf is true, and the lexica—and grammar, and phonology, etc.—impose an inalielable umwelt, one inaccessible to hegemony.
They certainly do. But languages are also very elastic. If you speak modern German, for example, you'll find that it's very similar to modern American English in thought and expression. But this was not true of late 19th-century German and 19th-century English. There has unfortunately been substantial thought convergence due to mass communication. If all you speak is English, you may also find that the current dialectical forms are all much weaker and samey compared to the older forms.
Anyway, in resurrected ancient languages, this modern samey-ness is a problem. Go read Latin Wikipedia, for example. It's much more like reading modern English authors (though with Latin words and grammar) than it is like reading classical authors.
--
I add a natural language translation (from ChatGPT) of your statement for anyone who doesn't share your lexica:
> "Maybe, unless some version of the Sapir-Whorf hypothesis is correct, and language—along with its grammar, sounds, and other aspects—creates an inescapable worldview that can't be dominated."
This take us overly cynical in this case—the original and primary use case for LLMs is to model languages in a comprehensive way. It's literally in the name.
Hallucinations rarely make up invalid grammar or invent non-existent words, what we're concerned about is facts, which isn't relevant at all when the goal is language preservation.
Depending on how powerful the language modeling is, I suspect it could lead to an LLM which will confidently and convincingly tell you how to say things like "floppy disk" and "SSD" in every extinct language and even those that went extinct long before computers ever existed, which is... interesting, but not exactly truth.
I've seen LLMs hallucinate nonexistent things in programming languages. It's hard to believe it won't do the same to human ones.
Importantly, the hallucinating non-existent things in programming languages is still stringing together valid English words to make something that looks like it ought to be a correct concept. It doesn't construct new words from scratch.
If a language model were asked what the word for "floppy disk" was in an extinct language and it invented a decent circumlocution, I don't think that would be a bad thing. People who are just engaging with the model as a way of connecting with their cultural heritage won't mind much if there is some creative application, and scholars are going to be aware of the limitations.
Again, the misapplication of language models as databases is why hallucinations are a problem. This use case isn't treating the model as a database, it's treating the model as a model, so the consequences for hallucination are much smaller to the point of irrelevance.
I also don't think it was ever much of a problem for machine translation to begin with. Also modern conversational systems are already addressing the problem (with things like contrastive/counterfactual fine-tuning and RAG/in-context learning) and will just tell you if it doesn't know something instead of hallucinating. But I'm pretty sure op doesn't know the difference between a language model and conversational model anyway, its just the generic "anti-LLM" opinion without much substance.
How would terms like "floppy disk" and "SSD" even appear in the target language if those terms weren't around when the speakers of the language was alive? Or you're thinking a multi-language LLM that tries to automatically translate between terms it didn't actually see in the source/target language during training?
It's common in non-dead languages. If your language is smaller you're bound to have a lot of "foreign words" in it. To this day I find it funny how my native language teachers would start to speak with tons of english and french words when they wanted to "showcase the complexity/beauty of our native language"/appear smart.
An LLM should do fine with that since it's usually the foreign word spelled in a way that makes sense in that language. I'm more curious about the inverse though. It's sometimes quite difficult to explain the meaning of a word in a language that does not have an equivalent, be it because it has a ton of different meanings or because it's some very specific action/object.
They're kind of bad at pretty much all languages, except simpler forms of english and Python. The tonality in the big LLM:s tends to be distinctly inhuman as well.
I suspect it'll be hard to find more material in some obscure, dying language than there is of either of those in the common training sets.
What is "they"? Are you saying transformer architecture somehow is biased towards English? Or are you saying that existing LLMs have that bias?
The only way this project is going to make sense will be to train it fresh on text in the language to be preserved, in order to avoid accidentally corrupting your model with English. If it's trained fresh on only target language content, I'm not sure how we can possibly generalize from the whole-internet models that you're familiar with.
I don't really care about the minutiae of the technical implementations, I'm talking from the experience of pushing text into LLM:s locally and remote, and getting translation in some direction or other back.
To me it doesn't make sense. It seems like an awful way to store information about a fringe language, but I'm certainly not an expert.
and getting translation in some direction or other back.
This seem to make a lot of English speakers upset, that LLM outputs appear translated from perspectives of primarily non-English speakers. But hey, it's n>=2 even at HN now.
I don't know, the translation errors are often pretty weird, like 'hot sauce' being translated to the target equivalent of 'warm sauce'.
Since LLM:s work by probabilistically stringing together sequences of words (or tokens or whatever) I don't expect them to become fully fluent ever, unless natural language degenerates and loses a lot of flexibility and flourish and analogy and so on. Then we might hit some level of expressiveness that they can actually simulate fully.
The current hausse is different but also very similar to the previous age of symbolic AI. Back then they expected computers being able to automate warfare and language and whatnot, but the prime use case turned out to be credit checks.
What language have you tried that they're bad at? I've tried a bunch of European languages and they are all perfect (or perfect enough for me to never know otherwise)
Swedish, german, spaniard spanish, french and luxembourgish french.
Sometimes they do a decent translation, too often they trip up on grammar, vocabulary or just assuming that a string of bytes means the same thing always. I find they work best in extremely formal settings, like documents produced by governments and lawyers.
Have you had the opportunity to interact with less wrapped versions of the models?
There's a lot of intentionality behind the way LLM's are presented from places like ChatGPT/DeepSeek/Claude, you're distinctly trying to talk to something that's actively limited in the way it can speak to you
It's not exactly nonexistant outside of them, but they make it worse than it is
Does it matter? Even most Chinese models are trained with <50% Chinese dataset last I checked, and they still manage to show AliExpress accent that would be natural for a Chinese speaker with ESL training. They're multilingual but not agnostic, they just can grow English-to-$LANG translation ability so long English stays the dominant and defining language in it.
I've run a bunch locally, sometimes with my own takes on system prompts and other adjustments because I've tried to make them less insufferable to use. Not as absurdly submissive, not as eager to do things I've not asked, not as censored, stuff like that.
I find they struggle a lot with things like long sentences and advanced language constructs regardless of the natural language they try to simulate. When it doesn't matter it's useful anyway, I can get a rough idea about the contents of documents in languages I'm not fluent in or make the bulk of a data set queryable in another language, but it's like a janky hack, not something I'd put in front of people paying my invoices.
Maybe there's a trick I ought to learn, I don't know.
This sounds like a grounded use of LLMs. Presumably they're feeding indigenous-language text into the NN to build a model of how the language works. From this model one may then derive starting points for grammar, morphology, vocabulary, and so on. Like, how would you say "large language model" in Navajo? If fed data on Navajo neologisms, an LLM might come up with some long word that means "the large thing by means of which, one can teach metal to speak" or similar. And the tribal community can take, leave, or modify that suggestion but it's based on patterns that are manifest in the language which AI statistical methods can elicit.
Machine learning techniques are really, really good at finding statistical patterns in data they're trained on. What they're not good at is making inferences on facts they haven't been specifically trained to accommodate.
I think what most LLMs have in common, is that they're pretty good at translations out-of-the-box, even the general purpose ones. Probably even the free versions of ChatGPT, Claude or fully free DeepSeek can help you out pretty well with that.
No doubt, it's excellent for archiving, but that's not the same as "preserving" culture. If it's not alive and kicking it's not a culture IMO. You see this happen even with texts : once things start being written down, the actual knowledge tends to get lost (see India for example).
This "AI to help low-resource languages" thing is a big deal in India too, but it just feels like another "jumla" for academics/techbros to make money. I mean, India has brutal/vicious policies that are out to destroy any and every language that's not English (since it's automatically a threat to central rule from Delhi), but pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English... Not even the ones who go "ree bad British man destroyed India" on twitter all day.
>When they don't serve a purpose anymore, they should be replaced by something more functional.
If and when that's a voluntary and organic process, sure. The problem is that replacement more often that not comes about through ethnic cleansing and violence. These indigenous languages were perfectly functional to the people who spoke them at the time but they were replaced because they didn't serve the purpose of colonizers.
And "lost" languages do get reclaimed from time to time. Hebrew and Irish being two examples.
Hebrew was brought back from the dead for the Jewish refugees populating Israel to have a common language. This solved a genuine practical problem.
Teaching Irish to kids who all speak English does not help anyone communicate better. It seems like a nationalist pride project, and those are not my favorite.
Israelis could just as well have chosen English for a common language, or Arabic, or any other living language. Reviving Hebrew was just as much about nationalist pride (or preserving culture if you like) as reviving Irish was for Ireland.
What is the purpose of culture? It's a way of life. Arguably no culture has purpose, so do we force people to live in a different way?
Cultures and languages die out because they're slowly (or revolutionarily quickly!) replaced by another. It's not like there are people out there speaking no language because their mother tongue has died out.
And who gets to decide that a language has no purpose?
> once things start being written down, the actual knowledge tends to get lost (see India for example)
Curious, what do you mean by this?
> pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English
Well I've never heard of this so lack of awareness would be an obvious cause if it's an issue, are there any orgs raising awareness of it? Also seems surprising to me, Bollywood movies are immensely popular and are all in Hindi. Is there a danger of English overtaking Indian society to the extent where Bollywood movies would mostly be made in English?
Once an LLM knows an indigenous language, even if the last speaker dies out, future generations will be able to learn the language and use the LLM to converse with them in that language.
Learning a new language is a good use case for LLMs, not just indigenous languages, but any language.
As for your comment "ree bad British man destroyed India" this sound more like politics than anything of substance.
Yes, but the LLM is roughly equivalent to lossy compression of the corpus it is trained on so why wouldn't you preserve that actual corpus so that it can be used to train some better LLM, or something better than an LLM, in the future?
(There may be a good answer to that question: perhaps, for example, the corpus can't be preserved for data protection reasons but the LLM trained on it can be preserved? For various reasons that doesn't seem very plausible, however.)
Right, you can't keep a culture in stasis. It always changes. There's something to be said for protectionism though (e.g. language laws), with varying degrees of success (Japan quashed Christianity quite well, brutally). There's a reason behind the choice in semantics particularly when it comes to traditional cultures. We don't call it "preserving culture" when historians and archivists document things.
Indigeneity generally refers to the descendants of people who inhabited a territory prior to colonization or the formation of a country, often those who are disadvantaged as a result, and who continue to inhabit the land. Connection to the land is a fundamental element of indigeneity, as is the specific condition of dispossession by colonization: indigeneity, as it is understood today, emerges in relation to colonial processes.
So, although all distinct ethnicities may originate from specific places and times, indigeneity as a political and social identity is meaningful only in the context of colonial domination and resistance.
Not every engineer is engaged in an ongoing struggle for sovereignty against colonizing powers.
The same way linguists are already using audio recordings to preserve dying languages that belong to people who have never had access to audio recordings. You can make a record of a culture in a medium that they didn't/don't have themselves.
AI fabricates and hallucinates (or whatever term you want to use) by design - it doesn't provide a neutral record the way audio recordings might.
I think a better question might be what, specifically, does AI bring to the table that justifies its inherent inaccuracy? Why not just use audio recordings, etc?
LLMs, when trained on a language, provide a neutral representation of the statistics of that data set. Used correctly, that makes them perfect for this kind of task. You're reacting to global misuse of llms as databases, but llms as language models is what they are designed for. It's in the name.
Used judiciously as a model, there's no problem. They only become a problem when people try to treat them as a database.
I mean, it's what it says on the tin: linguists go into remote areas and take recordings of dying languages to preserve them. They'll sometimes prompt for specific words and sentences, other times just record people speaking naturally. Then they'll also write down the rules of the grammar and the vocabulary (another example of using a medium they may not themselves have).