174 points by kiyanwang 33 days ago | 16 comments
antics 29 days ago
I am one of these people! I am one of a handful of people who speak my ancestral language, Kiksht. I am lucky to be uniquely well-suited to this work, as I am (as far as I know) the lone person from my tribe whose academic research background is in linguistics, NLP, and ML. (We have, e.g., linguists, but very few computational linguists.)

So far I have not had that much luck getting the models to learn the Kiksht grammar and morphology via in-context learning, I think the model will have to be trained on the corpus to actually work for it. I think this mostly makes sense, since they have functionally nothing in common with western languages.

To illustrate the point a bit: the bulk of training data is still English, and in English, the semantics of a sentence are mainly derived from the specific order in which the words appear, mostly because it lost its cases some centuries ago. Its morphology is mainly "derivational" and mainly suffixal, meaning that words can be arbitrarily complicated by adding suffixes to them. So baked into English is word order that sometimes we insert words into sentences simply to make the word order sensible. e.g., when we say "it's raining outside", the "it's" refers to nothing at all—it is there entirely because the word order of English demands that it exists.

Kiksht in contrast is completely different. Its semantics are nearly entirely derived from triple-prefixal structure of (in particular) verbs. Word ordering almost does not matter. There are, like, 12 tenses, and some of them require both a prefix and a reflective suffix. Verbs are often 1 or 2 characters, and with the prefix structure, a single verb can often be a complete sentence. And so on.

I will continue working on this because I think it will eventually be of help. But right now the deep learning that has been most helpful to me has been to do things like computational typology. For example, discovering the "vowel inventory" of a language is shockingly hard. Languages have somewhat consistent consonants, but discovering all the varieties of `a` that one can say in a language is very hard, and deep learning is strangely good at it.

ks2048 29 days ago
Awesome. Good luck to you!

I am also working on low-resource languages (in Central America, but not my heritage). I see on Wikipedia [0] it seems it's a case of revival. Are you collecting resources/data or using existing? (I see some links on Wikipedia).

[0] https://en.wikipedia.org/wiki/Upper_Chinook_language

antics 29 days ago
We are fortunate to have a (comparatively) large amount of written and recorded language artifacts. Kiksht (and Chinookan languages generally) were heavily studied in the early 1900s by linguists like Sapir.

re: revival, the Wikipedia article is a little misleading, Gladys was the last person whose first language was Kiksht, not the last speaker. And, in any event, languages are constantly changing. If we had been left alone in 1804 it would be different now than it was then. We will mold the language to our current context just like any other people.

yaseer 29 days ago
Super interesting, thank you very much for sharing your thoughts!

HN is still one of the few places on the internet to get such esoteric, expert and intellectually stimulating content. It's like an Island where the spirit of 'the old internet' still lives on.

bovermyer 29 days ago
I am applying for graduate school (after 20 years in the software industry) with the intent of studying computational linguistics; specifically, for the documentation and support of dying/dead languages.

While I am not indigenous, I hope to help alleviate this problem. I'd love to hear about your research!

antics 29 days ago
I did research on entirely unrelated NLP, actually. I worked on search for awhile (I am corecipient of SIGIR best paper ‘17) ad a bit in space- and memory- efficient nlp tasks.
bovermyer 29 days ago
Still, that's cool. Is this the paper in question? https://dl.acm.org/doi/10.1145/3077136.3080789
antics 28 days ago
That’s the one!
amarant 29 days ago
Wow kiksht sounds like a pretty cool language! Are there any resources you'd recommend for the language itself? I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!
thaumasiotes 29 days ago
> I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!

That's a fairly common language feature; such languages are generally called "agglutinating".

Prominent examples of agglutinating languages are the Eskimo languages, Turkic languages, and Finnish.

https://archive.is/QQnB6

There should be no shortage of resources available if you want to learn Turkish or Finnish.

antics 29 days ago
So, bad news. Culturally, the Wasq'u consider Kiksht something that is for the community rather than outsiders. So unfortunately I think it will be extremely challenging to find someone to teach you, or resources to teach yourself.
diggan 29 days ago
How do you combine that feeling/observation together with what you're working on now, which I'm guessing you'll eventually want to publish/make available somehow? Or maybe I misunderstand what the "it will eventually be of help" part of your first comment.
antics 29 days ago
I myself am very conflicted. I will always do what the elders say but opening things up greatly enhances the chance of survival.
diggan 29 days ago
Yeah, I understand, somewhat of an dilemma. Still, I wish you luck and hope that you manage to find a way that is acceptable to everyone involved.
RobotToaster 29 days ago
Does that present philosophical questions on if an AI is part of the community or an outsider?
antics 29 days ago
I’m not sure that the AI itself has many ethical complications. Many terms of service grant AI companies ownership of the data you supply to them though and that very problematic for many tribes which consider the language sacred and not to be shared.
birdyrooster 29 days ago
[dead]
macinjosh 29 days ago
I wonder why their language is dying. /s
Affric 29 days ago
Having a language that only your in group can speak is a good survival strategy
fnordpiglet 29 days ago
Good luck I wish you the best. I think you will almost certainly need to create a LoRA and fine tune an existing model. Is there enough written material available? I think this would be a valuable effort for humanity, as I think the more languages we can model, the more powerful our models will become because they embody different semantic structures with different strengths. (Beyond the obvious benefits of language preservation)
antics 29 days ago
There is more material than you'd normally find, but it is very hard to even fine-tune at that volume, unfortunately. I think we might be able to bootstrap something like that with a shared corpus of related languages, though.
fnordpiglet 28 days ago
You can always fine tune with the corpus you have and then try in context on top the fine tuning even if it’s insufficient. Then with that - and perhaps augmenting with RAG against the corpus - you might be able to build a context that’s stable enough in the language that you can generate human mediated synthetic corpuses and other reinforcement to enrich the LORA.
troyvit 28 days ago
> I will continue working on this because I think it will eventually be of help.

They say language shapes thought. Having an LLM speak such a different language natively seems like it would uncover so much about how these models work, as a side effect of helping preserve Kiksht. What a cool place to be!

tomrod 29 days ago
Almost sounds like Cebuano / Waray-Waray in that sense.
graemep 29 days ago
I was wondering about what the limitations were.

Lots of languages, even Indo-European languages, have very different word order from English or a much less significant word order.

koolba 29 days ago
According to Wikipedia there were 69 fluent speakers of Kiksht in 1990, and the last one passed away in 2012. How did you learn the language?

https://en.wikipedia.org/wiki/Upper_Chinook_language

antics 29 days ago
I learned it from my grandmother and from Gladys's grandkid. Gladys was the last person whose first language was Kiksht, not the last person who speaks it.
diggan 29 days ago
FYI, the Wikipedia article currently states:

> The last fully fluent speaker of Kiksht, Gladys Thompson, died in July 2012

Which is from https://web.archive.org/web/20191010153203/http://www.opb.or...

Maybe there are some sources talking about fully fluent people still being alive? As currently the article gives the impression they were the last person to "fully" speak it.

antics 29 days ago
I’m aware and I think people do not care much to correct it. The tribe has had consistently bad experiences with outsiders attempting to document this kind of thing and I sense that people are mostly wanting to be left alone at this point. There is some possibility that I would make a different decision, but it’s not for me to decide, it’s for the elders.
SavageNoble 29 days ago
This is the way ;)
thomasfromcdnjs 29 days ago
I'm an Australian indigenous and have been slowly working on this problem in my own way for a few years.

https://github.com/australia/mobtranslate.com/

In it's current iteration the homepage is just running dictionaries through OpenAI. (my tribes dictionary fits in a 100k context window)

My old ambitions can be found somewhat here -> https://github.com/australia/mobtranslate-server

That being said, the OpenAI models do a fantastic job at translating sentences so I've put my own model research further to the back. (will try to find some examples)

I can't speak of the true preservation, not many native speakers left, but in my mind that's not even all that important from a personal/cultural perspective.

If the youth who are interested in learning more about their language have a nice interface with 70-80% accurate results and they enjoy doing/learning it then that is a win to me. (and kind of how language evolves anyway) (the noun replacement seems to work great, but grammar is obviously wishy-washy)

(At this point, I just rushed to get my tribes dictionary crawlable so hopefully it will be in a few models next training phases)

prawn 29 days ago
I was looking to name a property recently and tried ChatGPT for suggestions on a theme of geographical features in various languages. I had it try Kaurna (Adelaide area) on a whim and was pleasantly surprised (given that Google Translate doesn't cover it, I guess) to find that it gave loads of relevant suggestions without any issue. Any that I picked to selectively check verified with Kaurna dictionaries OK.
AnotherGoodName 29 days ago
Fwiw this is the original usage of llms. The whole context awareness came about for the purpose of translations and the famous ‘attention is all you need’ paper was from the google translation team.

Say what you will about llms being overhyped but this is the original core use case.

sdsd 29 days ago
I've always resented the way "traditional" cultures are "preserved". It's like when people want to protect "authentic" locales from the corrupting influence of tourism. I'm grateful that my own culture is not (yet) seen primarily through this lens. Imagine the day when legislation makes exceptions for fentanyl as a traditional ritual substance of nomadic Trailer Americans.

To me the beauty of these things is in their liveliness, in the aspiration to flourish and grow, not merely to conserve a little longer, to spend one more night with terminal cancer before the inevitable.

throwaway970598 29 days ago
It seems like this could be incredibly fraught with danger once LLMs are involved (the story isn't exactly clear whether they are). If there are no surviving native speakers of a language (or very few) doing something like training an LLM to generate text in that language would run the risk of e.g. transferring English grammar to the vocabulary, and causing a hybrid language to become the dominant form of that language, because the LLM is used widely for study and the native-speaking elders are not.
flocciput 29 days ago
Is that so bad? If the only way to ensure a language's preservation and continued use is to mangle and evolve it? I think that's kind of a neat concept, to be honest. To live is to change.
joshdavham 29 days ago
This is a fantastic use case for LLM’s. Also, godspeed to these researchers! There’s unfortunately not a lot of time left for many of these languages.
romaaeterna 29 days ago
"His dream is to revive dying languages..."

The content of modern culture is too much for dying or ancient languages, and what you actually get is English/modern thoughtspace expressed in the lexicon of an until-now separate culture. This flood of spam destroys what was unique and interesting about the culture, and "skin-suits" it.

rexpop 29 days ago
Maybe, unless some form of Sapir-Whorf is true, and the lexica—and grammar, and phonology, etc.—impose an inalielable umwelt, one inaccessible to hegemony.
romaaeterna 28 days ago
They certainly do. But languages are also very elastic. If you speak modern German, for example, you'll find that it's very similar to modern American English in thought and expression. But this was not true of late 19th-century German and 19th-century English. There has unfortunately been substantial thought convergence due to mass communication. If all you speak is English, you may also find that the current dialectical forms are all much weaker and samey compared to the older forms.

Anyway, in resurrected ancient languages, this modern samey-ness is a problem. Go read Latin Wikipedia, for example. It's much more like reading modern English authors (though with Latin words and grammar) than it is like reading classical authors.

--

I add a natural language translation (from ChatGPT) of your statement for anyone who doesn't share your lexica:

> "Maybe, unless some version of the Sapir-Whorf hypothesis is correct, and language—along with its grammar, sounds, and other aspects—creates an inescapable worldview that can't be dominated."

rexpop 27 days ago
ChatGPT has misinterpreted me.

By "inalienable unwelt," I mean it in the sense of "inalienable rights" that cannot be removed by external circumstance.

By "inaccessible to hegemony," I mean an umwelt which cannot be perceived by speakers of the dominant language.

userbinator 29 days ago
s/preserve/hallucinate/

The next few decades are going to be really, really weird.

lolinder 29 days ago
This take us overly cynical in this case—the original and primary use case for LLMs is to model languages in a comprehensive way. It's literally in the name.

Hallucinations rarely make up invalid grammar or invent non-existent words, what we're concerned about is facts, which isn't relevant at all when the goal is language preservation.

userbinator 29 days ago
Depending on how powerful the language modeling is, I suspect it could lead to an LLM which will confidently and convincingly tell you how to say things like "floppy disk" and "SSD" in every extinct language and even those that went extinct long before computers ever existed, which is... interesting, but not exactly truth.

I've seen LLMs hallucinate nonexistent things in programming languages. It's hard to believe it won't do the same to human ones.

lolinder 29 days ago
Importantly, the hallucinating non-existent things in programming languages is still stringing together valid English words to make something that looks like it ought to be a correct concept. It doesn't construct new words from scratch.

If a language model were asked what the word for "floppy disk" was in an extinct language and it invented a decent circumlocution, I don't think that would be a bad thing. People who are just engaging with the model as a way of connecting with their cultural heritage won't mind much if there is some creative application, and scholars are going to be aware of the limitations.

Again, the misapplication of language models as databases is why hallucinations are a problem. This use case isn't treating the model as a database, it's treating the model as a model, so the consequences for hallucination are much smaller to the point of irrelevance.

lyu07282 29 days ago
I also don't think it was ever much of a problem for machine translation to begin with. Also modern conversational systems are already addressing the problem (with things like contrastive/counterfactual fine-tuning and RAG/in-context learning) and will just tell you if it doesn't know something instead of hallucinating. But I'm pretty sure op doesn't know the difference between a language model and conversational model anyway, its just the generic "anti-LLM" opinion without much substance.
diggan 29 days ago
How would terms like "floppy disk" and "SSD" even appear in the target language if those terms weren't around when the speakers of the language was alive? Or you're thinking a multi-language LLM that tries to automatically translate between terms it didn't actually see in the source/target language during training?
idunnoman1222 29 days ago
…Hallucinates a non-existent setting that could easily be added in a PR and merged tomorrow
ahoef 29 days ago
Isn't this something humans would also do if there were native speakers left to speak the language?
Lanolderen 29 days ago
It's common in non-dead languages. If your language is smaller you're bound to have a lot of "foreign words" in it. To this day I find it funny how my native language teachers would start to speak with tons of english and french words when they wanted to "showcase the complexity/beauty of our native language"/appear smart.

An LLM should do fine with that since it's usually the foreign word spelled in a way that makes sense in that language. I'm more curious about the inverse though. It's sometimes quite difficult to explain the meaning of a word in a language that does not have an equivalent, be it because it has a ton of different meanings or because it's some very specific action/object.

cess11 29 days ago
They're kind of bad at pretty much all languages, except simpler forms of english and Python. The tonality in the big LLM:s tends to be distinctly inhuman as well.

I suspect it'll be hard to find more material in some obscure, dying language than there is of either of those in the common training sets.

lolinder 29 days ago
What is "they"? Are you saying transformer architecture somehow is biased towards English? Or are you saying that existing LLMs have that bias?

The only way this project is going to make sense will be to train it fresh on text in the language to be preserved, in order to avoid accidentally corrupting your model with English. If it's trained fresh on only target language content, I'm not sure how we can possibly generalize from the whole-internet models that you're familiar with.

cess11 29 days ago
I don't really care about the minutiae of the technical implementations, I'm talking from the experience of pushing text into LLM:s locally and remote, and getting translation in some direction or other back.

To me it doesn't make sense. It seems like an awful way to store information about a fringe language, but I'm certainly not an expert.

numpad0 29 days ago

  and getting translation in some direction or other back.
This seem to make a lot of English speakers upset, that LLM outputs appear translated from perspectives of primarily non-English speakers. But hey, it's n>=2 even at HN now.
cess11 29 days ago
I don't know, the translation errors are often pretty weird, like 'hot sauce' being translated to the target equivalent of 'warm sauce'.

Since LLM:s work by probabilistically stringing together sequences of words (or tokens or whatever) I don't expect them to become fully fluent ever, unless natural language degenerates and loses a lot of flexibility and flourish and analogy and so on. Then we might hit some level of expressiveness that they can actually simulate fully.

The current hausse is different but also very similar to the previous age of symbolic AI. Back then they expected computers being able to automate warfare and language and whatnot, but the prime use case turned out to be credit checks.

ahoef 29 days ago
What language have you tried that they're bad at? I've tried a bunch of European languages and they are all perfect (or perfect enough for me to never know otherwise)
cess11 29 days ago
Swedish, german, spaniard spanish, french and luxembourgish french.

Sometimes they do a decent translation, too often they trip up on grammar, vocabulary or just assuming that a string of bytes means the same thing always. I find they work best in extremely formal settings, like documents produced by governments and lawyers.

joseda-hg 29 days ago
Have you had the opportunity to interact with less wrapped versions of the models? There's a lot of intentionality behind the way LLM's are presented from places like ChatGPT/DeepSeek/Claude, you're distinctly trying to talk to something that's actively limited in the way it can speak to you

It's not exactly nonexistant outside of them, but they make it worse than it is

numpad0 29 days ago
Does it matter? Even most Chinese models are trained with <50% Chinese dataset last I checked, and they still manage to show AliExpress accent that would be natural for a Chinese speaker with ESL training. They're multilingual but not agnostic, they just can grow English-to-$LANG translation ability so long English stays the dominant and defining language in it.
cess11 29 days ago
I've run a bunch locally, sometimes with my own takes on system prompts and other adjustments because I've tried to make them less insufferable to use. Not as absurdly submissive, not as eager to do things I've not asked, not as censored, stuff like that.

I find they struggle a lot with things like long sentences and advanced language constructs regardless of the natural language they try to simulate. When it doesn't matter it's useful anyway, I can get a rough idea about the contents of documents in languages I'm not fluent in or make the bulk of a data set queryable in another language, but it's like a janky hack, not something I'd put in front of people paying my invoices.

Maybe there's a trick I ought to learn, I don't know.

sofixa 29 days ago
Have you tried Mistral's models? They're explicitly trained on a bunch of languages, not only English.
cess11 29 days ago
Yes.
bitwize 29 days ago
This sounds like a grounded use of LLMs. Presumably they're feeding indigenous-language text into the NN to build a model of how the language works. From this model one may then derive starting points for grammar, morphology, vocabulary, and so on. Like, how would you say "large language model" in Navajo? If fed data on Navajo neologisms, an LLM might come up with some long word that means "the large thing by means of which, one can teach metal to speak" or similar. And the tribal community can take, leave, or modify that suggestion but it's based on patterns that are manifest in the language which AI statistical methods can elicit.

Machine learning techniques are really, really good at finding statistical patterns in data they're trained on. What they're not good at is making inferences on facts they haven't been specifically trained to accommodate.

sealeck 29 days ago
[flagged]
deadbabe 29 days ago
Finally, a good use case for LLMs that isn’t just trying to anthropomorphize some already solved automation problem.
DrillShopper 29 days ago
Yeah but anthropomorphizing the LLM makes VCs spend stupid amounts of money!
Mengkudulangsat 29 days ago
On an side note, can anyone recommend an AI tool I can use to learn a random niche language as a hobby (e.g. Toki Pona)?
diggan 29 days ago
I think what most LLMs have in common, is that they're pretty good at translations out-of-the-box, even the general purpose ones. Probably even the free versions of ChatGPT, Claude or fully free DeepSeek can help you out pretty well with that.
tho23i423434 29 days ago
I wonder how useful this really is.

No doubt, it's excellent for archiving, but that's not the same as "preserving" culture. If it's not alive and kicking it's not a culture IMO. You see this happen even with texts : once things start being written down, the actual knowledge tends to get lost (see India for example).

This "AI to help low-resource languages" thing is a big deal in India too, but it just feels like another "jumla" for academics/techbros to make money. I mean, India has brutal/vicious policies that are out to destroy any and every language that's not English (since it's automatically a threat to central rule from Delhi), but pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English... Not even the ones who go "ree bad British man destroyed India" on twitter all day.

BurningFrog 29 days ago
It's really documenting the culture. A great thing in itself.

I think of cultures and languages as tools. When they don't serve a purpose anymore, they should be replaced by something more functional.

krapp 29 days ago
>When they don't serve a purpose anymore, they should be replaced by something more functional.

If and when that's a voluntary and organic process, sure. The problem is that replacement more often that not comes about through ethnic cleansing and violence. These indigenous languages were perfectly functional to the people who spoke them at the time but they were replaced because they didn't serve the purpose of colonizers.

And "lost" languages do get reclaimed from time to time. Hebrew and Irish being two examples.

BurningFrog 29 days ago
Hebrew and Irish are illustrative examples.

Hebrew was brought back from the dead for the Jewish refugees populating Israel to have a common language. This solved a genuine practical problem.

Teaching Irish to kids who all speak English does not help anyone communicate better. It seems like a nationalist pride project, and those are not my favorite.

krapp 28 days ago
Israelis could just as well have chosen English for a common language, or Arabic, or any other living language. Reviving Hebrew was just as much about nationalist pride (or preserving culture if you like) as reviving Irish was for Ireland.
meigwilym 29 days ago
What is the purpose of culture? It's a way of life. Arguably no culture has purpose, so do we force people to live in a different way?

Cultures and languages die out because they're slowly (or revolutionarily quickly!) replaced by another. It's not like there are people out there speaking no language because their mother tongue has died out.

And who gets to decide that a language has no purpose?

I don't like the way this argument goes.

hollerith 29 days ago
>who gets to decide that a language has no purpose?

The parents of the child doing the language learning or the adult doing it.

encipriano 29 days ago
This demonstrates English utilitarianism basically won the philosophical war in Europe.
DrillShopper 29 days ago
> When they don't serve a purpose anymore, they should be replaced by something more functional.

Okay colonizer.

Boldened15 29 days ago
> once things start being written down, the actual knowledge tends to get lost (see India for example)

Curious, what do you mean by this?

> pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English

Well I've never heard of this so lack of awareness would be an obvious cause if it's an issue, are there any orgs raising awareness of it? Also seems surprising to me, Bollywood movies are immensely popular and are all in Hindi. Is there a danger of English overtaking Indian society to the extent where Bollywood movies would mostly be made in English?

aussieguy1234 29 days ago
Once an LLM knows an indigenous language, even if the last speaker dies out, future generations will be able to learn the language and use the LLM to converse with them in that language.

Learning a new language is a good use case for LLMs, not just indigenous languages, but any language.

As for your comment "ree bad British man destroyed India" this sound more like politics than anything of substance.

bloak 29 days ago
Yes, but the LLM is roughly equivalent to lossy compression of the corpus it is trained on so why wouldn't you preserve that actual corpus so that it can be used to train some better LLM, or something better than an LLM, in the future?

(There may be a good answer to that question: perhaps, for example, the corpus can't be preserved for data protection reasons but the LLM trained on it can be preserved? For various reasons that doesn't seem very plausible, however.)

crackalamoo 29 days ago
You can do both. Preserving the corpus and building the LLM probably gives the best chance for future generations.
ks2048 29 days ago
> but pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English

That’s surprising and seems different than what I’ve seen for other languages in other parts of the world (even if it’s a relatively new phenomenon).

slothtrop 29 days ago
Right, you can't keep a culture in stasis. It always changes. There's something to be said for protectionism though (e.g. language laws), with varying degrees of success (Japan quashed Christianity quite well, brutally). There's a reason behind the choice in semantics particularly when it comes to traditional cultures. We don't call it "preserving culture" when historians and archivists document things.
tomp 29 days ago
Terrible title. Every engineer is indigenous somewhere.
rexpop 29 days ago
Indigeneity generally refers to the descendants of people who inhabited a territory prior to colonization or the formation of a country, often those who are disadvantaged as a result, and who continue to inhabit the land. Connection to the land is a fundamental element of indigeneity, as is the specific condition of dispossession by colonization: indigeneity, as it is understood today, emerges in relation to colonial processes.

So, although all distinct ethnicities may originate from specific places and times, indigeneity as a political and social identity is meaningful only in the context of colonial domination and resistance.

Not every engineer is engaged in an ongoing struggle for sovereignty against colonizing powers.

iamnotsure 29 days ago
me too
temptemptemp111 29 days ago
[dead]
highcountess 29 days ago
[dead]
reportgunner 29 days ago
I don't understand, how can they use AI to preserve their culture when AI was never a part of it ?
lolinder 29 days ago
The same way linguists are already using audio recordings to preserve dying languages that belong to people who have never had access to audio recordings. You can make a record of a culture in a medium that they didn't/don't have themselves.
krapp 29 days ago
AI fabricates and hallucinates (or whatever term you want to use) by design - it doesn't provide a neutral record the way audio recordings might.

I think a better question might be what, specifically, does AI bring to the table that justifies its inherent inaccuracy? Why not just use audio recordings, etc?

lolinder 29 days ago
LLMs, when trained on a language, provide a neutral representation of the statistics of that data set. Used correctly, that makes them perfect for this kind of task. You're reacting to global misuse of llms as databases, but llms as language models is what they are designed for. It's in the name.

Used judiciously as a model, there's no problem. They only become a problem when people try to treat them as a database.

reportgunner 29 days ago
I have never heard of that, can you elaborate ?
lolinder 29 days ago
I mean, it's what it says on the tin: linguists go into remote areas and take recordings of dying languages to preserve them. They'll sometimes prompt for specific words and sentences, other times just record people speaking naturally. Then they'll also write down the rules of the grammar and the vocabulary (another example of using a medium they may not themselves have).
reportgunner 28 days ago
Well that I understand, but how is the AI involved ? You don't use AI to travel to remote areas or to record voice.