BioHacker News | Fruit of the Poisonous Llama? (2023)

▲Fruit of the Poisonous Llama? (2023)(shkspr.mobi)

32 points by edent 154 days ago | 12 comments

▲holowoodman 150 days ago

LLMs and Code assistants always were primarily a means to copyright-wash all available things so they could be freely incorporated into commercial products without the problems of having literal copies of stuff you shouldn't have in your code.

Of course this has to backfire.

However, I'm of the opinion that this is a good thing. Copyright is a sham and needs to be abolished.

▲ahofmann 150 days ago

> LLMs and Code assistants always were primarily a means to copyright-wash all available things so they could be freely incorporated into commercial products without the problems of having literal copies of stuff you shouldn't have in your code.

Great observation!

> Of course this has to backfire.

I fear, that BigCorp will find ways to make this legal. Facebook is obviously sure that they can.

> However, I'm of the opinion that this is a good thing. Copyright is a sham and needs to be abolished.

As others have posted, copyright is a great thing! When someone puts a lot of work into something, like writing a book, or drawing a painting (or writing software), they should have strong rights that protect this work. The problem with the current copyright laws are again big corporations like Disney, who fought nail and tooth to prolong those rights into maddening lengths.

So I'm honestly a bit flabbergasted. How can you realize that corporations are using LLMs to steal from millions of creative people, are about to make billions of dollars with it and come to the conclusion that this is a great way of dealing with this, because "fuck copyright"?

▲varelse 150 days ago

[dead]

▲drdaeman 150 days ago

> always were primarily a means to copyright-wash

No offense but with “always” this awfully sounds like a conspiracy theory. I thought those models were primarily meant to just generate plausible outputs, and copyright wasn’t really considered until relatively recently.

▲jll29 150 days ago

> I'm even happy to hear arguments about whether it is legally binding to say "No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means without the prior written permission of the publisher".

Disclaimer: IANAL.

The law of course depends on jurisdiction, but in many countries copyright law means (a) everything is forbidden unless there is a license contract that permits the license holder to do it, or (b) it falls under a few cases of pre-defined exemptions (e.g. fair use for science and teaching, this is restricted to parts of works e.g. chapters or indiv. papers for teaching in a closed group, or citation of material as scientific standards require, no mention of ML training).

No mention => no license.

▲miohtama 150 days ago

Copyright laws govern republication, broadcasting, and it can be argued that the AI models are not republications of the content. For example software piracy is republication as illegal copies are made available. They do not govern the actual use of software.

The software license then comes from the agreement that you can get a copy of a particular software with particular usage limits, and those limits only, but this agreement is not derived from copyright law.

If you want to read book from author X you still need to buy the book, no LLM can reproduce a book, or an article, today.

▲flir 150 days ago

No copy => no copyight infringement?

Terabytes of training data reduced to a model that fits on a memory stick. The models can certainly spit out bits and pieces, but I haven't seen any evidence they can recite back an entire book.

(I argue that what the model is doing is a lot less like copying, and a lot more like learning).

▲gosub100 150 days ago

So they had a copy, then used it to produce a product that made money, but didn't pay royalties for the use of the copy.

I guess AIs best defense is that, yeah they obtained copies without paying, but it wasn't a "performance" or otherwise distribution of copyright material. The infringement was done internally.

(God this irony is not lost on me: "hey gpt, what's the best argument to make for the $X-circuit court to persuade Judge $Y that we are not infringing on claim $Z under statute $W?")

▲flir 150 days ago

> So they had a copy [...] but didn't pay royalties for the use of the copy.

Possession rather than use, I think, but that seems likely, yes. But I don't believe the model itself is infringing.

▲Havoc 150 days ago

That’s always been a bit of an open secret. And not just Llama. A big chunk of AI world is suspect though harder to prove since others were more circumspect in articulating it than meta.

Remains to be seen what the courts do. Doesn’t seem viable to put this genie back in bottle so doesn’t seem like there are good options for the judge either

▲rob74 150 days ago

> Taking a look at a sample file listing shows a number of books which appear to be commercially sold rather than being released for free.

"a number of books which appear to be commercially sold" is a bit of an understatement. According to a quick search of the list, it contains the complete works of the likes of Stephen King, John Grisham, Michael Crichton, Dan Brown and J.R.R Martin. So not just a few less-known books included by mistake...

▲gmuslera 150 days ago

It can be used for more LLMs or whatever that was somewhat trained with material that have some kind of license attached. Like Github's Copilot, for starters, even open source software have licenses. And they might have taken part of training of most LLMs. If you enforce copyrights in one case then you are enabling it for the other case.

▲squircle 154 days ago

> I suspect we're about to hear some arguments from AI-maximalists that LLaMA is sentient and that deleting it would be akin to murder - and wiping out AIs trained on stolen property is literally genocide. I don't believe that for a second.

Me thinks anthropomorphizing clocks is as silly as worrying if the boiling water in the pot feels pain. Even if it was feasible for a machine to be conscious, it would be conscious at a lower level than "software", which, to me, has come to resemble a hyper hyperbolic information layer (emerging and part/parcel of human consciousness.) (Call it an egregore if you want, but to me it seems more like a data lake.) This may be a platitude but, any agency the machine possesses is due to the human agency that nudged it in such a direction. We've created a lovely mirror test for ourselves.

▲EdwardDiego 150 days ago

I think you're focusing on one paragraph to the detriment of the rest of the article.

▲EdwardDiego 150 days ago

I think you're focusing on one paragraph to the detriment of the rest of the article.

Also, not sure what was going on with your word salad there, but it really started to make no sense around this point "hyper hyperbolic information layer (emerging and part/parcel of human consciouness)"... maybe this is some Lesswrong insider slang?

That's my most charitable interpretation.

▲neom 150 days ago

▲spacebanana7 150 days ago

Courts and lawyers only have so much power.

Yes they generally won the digital IP battles of the 90s, but this is much harder. Millions of people have downloaded Llama models and there's a great variety of derivative and distilled models. It's an open secret that practically every AI company uses copyrighted data.

Moreover there's the political angle. If the US forces the west to obey copyright law for AI it'll be very hard to compete with China.

▲ahofmann 150 days ago

So the end does justify the means? So it is okay to rip off thousands of authors?

▲anonym29 150 days ago

The mere notion that you can copyright or trademark combinations of words to procure legal ownership of those specific combinations, ownership that grants you the ability to forcefully take the property of other people who repeat those same words via a formally codified and enforced system of wage theft (court-awarded "damages" to the author) is patently absurd, and should be an affront to the sensibilities of any person who's ever had to work for a wage.

▲ahofmann 150 days ago

People have written books. Those books were copied into something, called "Books3 section of ThePile". Now those people trying to sell their books, but nobody needs to buy them, because one can download thePile.

This alone is awful and wrong, and we all know this.

Now comes Facebook, and builds an LLM. The LLM can now write somewhat nicely sounding books in no time. People who wrote those books and put all the work, time etc. into it, are angry. And they are right.

It is very telling how many words and arguments are used here to make Facebook's behavior sound like a good thing, or a necessity.

▲lsaferite 150 days ago

There are at least 3 distinct things in your post.

1. Authors' works were (are?) illegally distributed in 'Books3': This seems to clearly be a copyright infringement. That being said, I'd be shocked if someone could prove this copyright infringement made an appreciable impact on an author's income. Something tangible, not a hand-wavey "but they would have bought my book". I know that I don't go out downloading books from places like ThePile if I want to read something. I'd wager most people don't.

2. Facebook (et al.) acquired illegal collections of books vs legally acquiring the books: If that is true (seems likely) then they should suffer punishment for the acquisition. That being said, it's the same cudgel that'd be used to sue individuals into oblivion, so the end outcomes might be less constructive for the rest of humanity. I do feel like corporations involved in mass copyright infringement should be held accountable though.

3. Facebook (et al.) trained foundational models on collections of books they acquired (disconnect #2 from #3 here): I'd argue strongly that the foundational model training is not a copyright violation. It's not storing copyrighted works and distributing them. It is using them to model language patterns and token frequencies that could be used to create an approximation of a copyrighted work if the training was poor and you prompt it properly. There are plenty of experts in this matter that could discuss this in depth, but the essence of the fight boils down to if you believe that the copyrighted works are being distributed via these trained models or not. Now, if people want to change copyright law such that there are specific laws around how ML models are trained on copyrighted works, then perhaps this problem gets resolved in one direction or the other. Until then, all parties are just talking past each other and hoping the courts eventually agree with their arguments.

▲krisoft 150 days ago

> Now those people trying to sell their books, but nobody needs to buy them, because one can download thePile.

Approximately nobody gets their reading material from thePile. People buy books from book shops, or from Amazon, or borrow them from a library or from a friend, or buy it from a second hand shop.

> The LLM can now write somewhat nicely sounding books in no time.

That is again not a thing. People don’t read longform LLM output instead of published books.

▲anonym29 148 days ago

They're angry because the LLM decreases the expected value of a leisure hobby they've invested a lot of time into. Yes, I get it, writing can be good as an art form and art is special and should be cherished.

Also, for what it's worth, Facebook (which should be shut down for unrelated reasons), isn't the only one doing this.

▲K0balt 150 days ago

I am an author who derives his principal income from my published works. Millions of people have illegally downloaded my work, and I would love it if they had maybe sent me a dollar or two if they read it, but so far the bitcoin address published in the book for piracy donations has yet to receive even a single sat. I see this as just a fact of life for publishing in the digital age, so I try to make my work more valuable to have in paper form that in digital form…. To make it an indispensable tool that merits its own physicality based on its utility.

I cannot imagine claiming that I have a right to the -ideas- or concepts in my work , though.

That is what patents are for, and a patent also doesn’t give me the right to exclusive use of my ideas, just limited exclusivity of implementation.

The relatively free sharing of ideas is a fundamental pillar of society, and without the free sharing of ideas, modern civilization would be impossible.

Copyright doesn’t grant you exclusivity to the ideas in your work. It doesn’t grant immunity to having a few sentences repeated even verbatim. Fan fiction is fair use, as are most derivative works.

The only legitimate use of intellectual work is the extraction of knowledge from it, basically the equivalent of reading or teaching it. Training is teaching. It could have been done (much more slowly) by going to the library and using the books in the library to train from. It is not the redistribution of these works. It is the use of these works. If a book cannot be used to transfer knowledge or ideas, then is it legal to read it at all?

There may have been flaws in the process of “reading” this work that involved technical violations of copyright law. But the use of information that is published and not covered by an NDA is part and parcel of the fundamental benefits of living in a society as we understand it.

The use interpretation, application,and processing of knowledge is what we are really discussing here. If we say that copyright prevents this, then we need to abolish libraries, schools, newspapers, the internet, any mechanism by which copyrighted material may be exposed to persons who might wish to use, interpret, or apply that information. What tools they use to do this is irrelevant.

For those that would say that training creates a copy, I defy you to have any llm reproduce a significant body of copyrighted work. You’ll have about as much luck as asking someone to do the same from memory. And if they can, what does that mean? Is the savant reader guilty of copyright infringement if they write down their recollection? Perhaps.

Or is it the actor that causes them to do so? Or is it the act of distributing that perfect memory that then is the violation of copyright ? (I’d argue this is the line in the sand).

▲barnabee 150 days ago

If we want authors and other artists/creators to be paid in the long term we need to find a method that is not copyright.

Copyright helps Spotify more than musicians and bands. It helps Meta and Google more than authors and journalists. And it’s ignored by anyone who feels like it.

We need to tax and restrict corporatism (more so than capitalism) to pay and make way for things we care about, including the arts.

▲freen 150 days ago

Indeed.

The only trick that consistently improves the lot of all participants in a civilization is to take a portion of the private surplus and apply it to the public good.

That’s it. That’s the whole thing.

▲ahofmann 150 days ago

That sounds constructive, I like the idea!

▲withinboredom 150 days ago

Authors (not necessarily the ones alive today or even independent authors, but the big ones ... *cough* Disney) got greedy and made it so copyright is so atrociously long that it is utterly pointless and actively harmful to society. That's my 2 cents anyway.

▲ahofmann 150 days ago

But having the problem that big corporations steal the work from authors isn't a solution for the problem that big corporations fought (or bribed) for bad copyright laws.

▲Philpax 150 days ago

...really? There's strawmen, and then there's this. Even if there were more Blake Lemoines out there, the "consciousness" of a LLM begins and ends with its context window. The weights by themselves are not alive.

▲2c2c2c 150 days ago

watching society 180 and start simping for copyright law is so depressing

▲troupo 150 days ago

▲barnabee 150 days ago

It’s probably possible to have copyright laws that aren’t a net negative, but they’d be very different (perhaps they’d be very short and only restrict commercial use/exploitation), and we don’t.

I’m pretty confident the optimum is at or near or zero copyright restrictions and finding other ways to fund the creators. Which, yes, means accepting that free markets are not the best tool for every job.

▲onli 150 days ago

Copyright is the antithesis to free markets. It's based on state given monopolies for individual product. There are many examples for free markets being bad solutions, but this is not one of them. To the contrary, more free market for once would be good here - and also why one can only hope that big tech wins the fight described in the article, to maybe arrive at that change.

▲littlestymaar 150 days ago

▲ahofmann 150 days ago

Why? What is wrong with the idea that I can put work into something and be (somewhat) sure, that not someone else makes a lot of money with it before I can?

▲littlestymaar 150 days ago

You're not arguing for existing laws here, you're arguing for the concept of copyright which I'm not opposing.

Current laws aren't about protecting the authors, it's about allowing a bunch of (non creative) people to manage portfolio of IP and get loads of money from it while ripping off the artists. That's why current copyright laws are bad.

▲2c2c2c 150 days ago

look into how it works for music production. nonsense top down

▲troupo 150 days ago

Oh, copyright and licensing in music is insane.

It's the "straight to jail" meme for any action you want to do with music

▲Ukv 154 days ago

> isn't this a slam dunk case? Meta literally published a paper where they said "We trained this AI on Intellectual Property which we knew had been obtained without the owners' consent."

Making/receiving copies without authorization of the rightsholder can be permissible - it'll come down to a fair use analysis. Purpose to me seems highly transformative (and "The more transformative the new work, the less will be the significance of other factors"), but the other factors could swing the other way.

Worth noting that when Google Books was determined to be fair use, the amount/substantiality factor didn't weigh against them despite storing full copies of millions of books, because "what matters in such cases is not so much "the amount and substantiality of the portion used" in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public". That could be seen here as the amount present in generated responses, opposed to everything the model was trained on.

Not sure if I can call this a strawman since there will inevitably be someone somewhere making an argument like this, but it's not a defense being used in the lawsuits in question.

My primary issues with "wiping out AIs trained on stolen property" are:

1. It's not just LLMs that are trained like this. If you're making a model to segment material defects or detect tumors, you typically first pre-train on a dataset like ImageNet before fine-tuning on the (far smaller) task-specific dataset. Even if you believe LLMs are mostly hype, there's a whole lot else - much of which fairly uncontroversially beneficial - that you'd be inhibiting

2. Copyright's basis is "To promote the Progress of Science and useful Arts". Wiping out existing models, and likely stifling future ones, seems hard to justify under this basis. Ensuring rightsholders profit is intended as a means to achieve such progress, not a goal in and of itself to which progress can take a back seat

3. I do not believe stricter copyright law would help individuals. Realistically, developers training models would go to Reddit/X/Github/Getty/etc. selling licensed (by ToS agreement) user content, and there's little incentive for those companies to pass on the profit beyond maybe some temporary PR moves. Much of what's possible for open-source or academic communities may no longer be, on account of licensing fees

4. It doesn't seem politically viable to demand models are wiped out. Leading in the field, and staying ahead of China, is currently seen as a big important issue - we're not going to erase our best models because NYT asked so. Could hope for mandatory licensing - I think it'd still likely be a negative for open-source development, but it's more plausible than deleting models trained on copyrighted material

▲regularfry 150 days ago

I think it's deeper than that. I don't think copyright applies at all.

> amount and substantiality of what is thereby made accessible to a public

This starts to get at it, but what really gets to me is the argument that it is the use of the data in training which copyright maximalists assert is wrong. Copyright doesn't protect style. If your model cannot be coerced into producing a duplicate of its original data, then there is no copyright question in the first place.

Of course if it can, and if the amount and substantiality of the portions that it can reproduce are significant, then that protection goes away. But that's quite hard to do: there are cases where precise text reproduction can be done and has been demonstrated, but it's a real challenge to do it reliably and at length, except for some pathological special cases.

To me the best analogy with what's going on today is the format shifting debate in the 2000s, except that to make that comparison you have to assert that an imperfect, unreliable destination format not intended for reproduction is equally harmful, and in the same way, as an mp3 rip of a CD. That's a hard argument to make.

It's also very clear that the primary purpose of an LLM is exactly not duplication. You want them to be able to generalise over their input, not regurgitate it. They have several significant non-infringing uses. If they didn't, they wouldn't be worth the effort. Betamax would apply, in spades.

▲j16sdiz 150 days ago

> 2. Copyright's basis is "To promote the Progress of Science and useful Arts".

This is what the US constitution say. There are more legal theory than the US constitution. In fact, the idea of copyright predates the US constitution.

For example, the "Labor theory of copyright" [1] argue copyright is a exclusive right because every individual has a right to the product of their labor.

See also "Philosophy of copyright" [2] on wikipedia.

[1] https://en.wikipedia.org/wiki/Labor_theory_of_copyright

[2] https://en.wikipedia.org/wiki/Philosophy_of_copyright

▲varelse 150 days ago

[dead]