Of course this has to backfire.
However, I'm of the opinion that this is a good thing. Copyright is a sham and needs to be abolished.
Great observation!
> Of course this has to backfire.
I fear, that BigCorp will find ways to make this legal. Facebook is obviously sure that they can.
> However, I'm of the opinion that this is a good thing. Copyright is a sham and needs to be abolished.
As others have posted, copyright is a great thing! When someone puts a lot of work into something, like writing a book, or drawing a painting (or writing software), they should have strong rights that protect this work. The problem with the current copyright laws are again big corporations like Disney, who fought nail and tooth to prolong those rights into maddening lengths.
So I'm honestly a bit flabbergasted. How can you realize that corporations are using LLMs to steal from millions of creative people, are about to make billions of dollars with it and come to the conclusion that this is a great way of dealing with this, because "fuck copyright"?
No offense but with “always” this awfully sounds like a conspiracy theory. I thought those models were primarily meant to just generate plausible outputs, and copyright wasn’t really considered until relatively recently.
Disclaimer: IANAL.
The law of course depends on jurisdiction, but in many countries copyright law means (a) everything is forbidden unless there is a license contract that permits the license holder to do it, or (b) it falls under a few cases of pre-defined exemptions (e.g. fair use for science and teaching, this is restricted to parts of works e.g. chapters or indiv. papers for teaching in a closed group, or citation of material as scientific standards require, no mention of ML training).
No mention => no license.
The software license then comes from the agreement that you can get a copy of a particular software with particular usage limits, and those limits only, but this agreement is not derived from copyright law.
If you want to read book from author X you still need to buy the book, no LLM can reproduce a book, or an article, today.
Terabytes of training data reduced to a model that fits on a memory stick. The models can certainly spit out bits and pieces, but I haven't seen any evidence they can recite back an entire book.
(I argue that what the model is doing is a lot less like copying, and a lot more like learning).
I guess AIs best defense is that, yeah they obtained copies without paying, but it wasn't a "performance" or otherwise distribution of copyright material. The infringement was done internally.
(God this irony is not lost on me: "hey gpt, what's the best argument to make for the $X-circuit court to persuade Judge $Y that we are not infringing on claim $Z under statute $W?")
Possession rather than use, I think, but that seems likely, yes. But I don't believe the model itself is infringing.
Remains to be seen what the courts do. Doesn’t seem viable to put this genie back in bottle so doesn’t seem like there are good options for the judge either
"a number of books which appear to be commercially sold" is a bit of an understatement. According to a quick search of the list, it contains the complete works of the likes of Stephen King, John Grisham, Michael Crichton, Dan Brown and J.R.R Martin. So not just a few less-known books included by mistake...
Me thinks anthropomorphizing clocks is as silly as worrying if the boiling water in the pot feels pain. Even if it was feasible for a machine to be conscious, it would be conscious at a lower level than "software", which, to me, has come to resemble a hyper hyperbolic information layer (emerging and part/parcel of human consciousness.) (Call it an egregore if you want, but to me it seems more like a data lake.) This may be a platitude but, any agency the machine possesses is due to the human agency that nudged it in such a direction. We've created a lovely mirror test for ourselves.
Also, not sure what was going on with your word salad there, but it really started to make no sense around this point "hyper hyperbolic information layer (emerging and part/parcel of human consciouness)"... maybe this is some Lesswrong insider slang?
That's my most charitable interpretation.
Yes they generally won the digital IP battles of the 90s, but this is much harder. Millions of people have downloaded Llama models and there's a great variety of derivative and distilled models. It's an open secret that practically every AI company uses copyrighted data.
Moreover there's the political angle. If the US forces the west to obey copyright law for AI it'll be very hard to compete with China.
This alone is awful and wrong, and we all know this.
Now comes Facebook, and builds an LLM. The LLM can now write somewhat nicely sounding books in no time. People who wrote those books and put all the work, time etc. into it, are angry. And they are right.
It is very telling how many words and arguments are used here to make Facebook's behavior sound like a good thing, or a necessity.
1. Authors' works were (are?) illegally distributed in 'Books3': This seems to clearly be a copyright infringement. That being said, I'd be shocked if someone could prove this copyright infringement made an appreciable impact on an author's income. Something tangible, not a hand-wavey "but they would have bought my book". I know that I don't go out downloading books from places like ThePile if I want to read something. I'd wager most people don't.
2. Facebook (et al.) acquired illegal collections of books vs legally acquiring the books: If that is true (seems likely) then they should suffer punishment for the acquisition. That being said, it's the same cudgel that'd be used to sue individuals into oblivion, so the end outcomes might be less constructive for the rest of humanity. I do feel like corporations involved in mass copyright infringement should be held accountable though.
3. Facebook (et al.) trained foundational models on collections of books they acquired (disconnect #2 from #3 here): I'd argue strongly that the foundational model training is not a copyright violation. It's not storing copyrighted works and distributing them. It is using them to model language patterns and token frequencies that could be used to create an approximation of a copyrighted work if the training was poor and you prompt it properly. There are plenty of experts in this matter that could discuss this in depth, but the essence of the fight boils down to if you believe that the copyrighted works are being distributed via these trained models or not. Now, if people want to change copyright law such that there are specific laws around how ML models are trained on copyrighted works, then perhaps this problem gets resolved in one direction or the other. Until then, all parties are just talking past each other and hoping the courts eventually agree with their arguments.
Approximately nobody gets their reading material from thePile. People buy books from book shops, or from Amazon, or borrow them from a library or from a friend, or buy it from a second hand shop.
> The LLM can now write somewhat nicely sounding books in no time.
That is again not a thing. People don’t read longform LLM output instead of published books.
Also, for what it's worth, Facebook (which should be shut down for unrelated reasons), isn't the only one doing this.
I cannot imagine claiming that I have a right to the -ideas- or concepts in my work , though.
That is what patents are for, and a patent also doesn’t give me the right to exclusive use of my ideas, just limited exclusivity of implementation.
The relatively free sharing of ideas is a fundamental pillar of society, and without the free sharing of ideas, modern civilization would be impossible.
Copyright doesn’t grant you exclusivity to the ideas in your work. It doesn’t grant immunity to having a few sentences repeated even verbatim. Fan fiction is fair use, as are most derivative works.
The only legitimate use of intellectual work is the extraction of knowledge from it, basically the equivalent of reading or teaching it. Training is teaching. It could have been done (much more slowly) by going to the library and using the books in the library to train from. It is not the redistribution of these works. It is the use of these works. If a book cannot be used to transfer knowledge or ideas, then is it legal to read it at all?
There may have been flaws in the process of “reading” this work that involved technical violations of copyright law. But the use of information that is published and not covered by an NDA is part and parcel of the fundamental benefits of living in a society as we understand it.
The use interpretation, application,and processing of knowledge is what we are really discussing here. If we say that copyright prevents this, then we need to abolish libraries, schools, newspapers, the internet, any mechanism by which copyrighted material may be exposed to persons who might wish to use, interpret, or apply that information. What tools they use to do this is irrelevant.
For those that would say that training creates a copy, I defy you to have any llm reproduce a significant body of copyrighted work. You’ll have about as much luck as asking someone to do the same from memory. And if they can, what does that mean? Is the savant reader guilty of copyright infringement if they write down their recollection? Perhaps.
Or is it the actor that causes them to do so? Or is it the act of distributing that perfect memory that then is the violation of copyright ? (I’d argue this is the line in the sand).
Copyright helps Spotify more than musicians and bands. It helps Meta and Google more than authors and journalists. And it’s ignored by anyone who feels like it.
We need to tax and restrict corporatism (more so than capitalism) to pay and make way for things we care about, including the arts.
...really? There's strawmen, and then there's this. Even if there were more Blake Lemoines out there, the "consciousness" of a LLM begins and ends with its context window. The weights by themselves are not alive.
I’m pretty confident the optimum is at or near or zero copyright restrictions and finding other ways to fund the creators. Which, yes, means accepting that free markets are not the best tool for every job.
Current laws aren't about protecting the authors, it's about allowing a bunch of (non creative) people to manage portfolio of IP and get loads of money from it while ripping off the artists. That's why current copyright laws are bad.
It's the "straight to jail" meme for any action you want to do with music
Making/receiving copies without authorization of the rightsholder can be permissible - it'll come down to a fair use analysis. Purpose to me seems highly transformative (and "The more transformative the new work, the less will be the significance of other factors"), but the other factors could swing the other way.
Worth noting that when Google Books was determined to be fair use, the amount/substantiality factor didn't weigh against them despite storing full copies of millions of books, because "what matters in such cases is not so much "the amount and substantiality of the portion used" in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public". That could be seen here as the amount present in generated responses, opposed to everything the model was trained on.
> I suspect we're about to hear some arguments from AI-maximalists that LLaMA is sentient and that deleting it would be akin to murder - and wiping out AIs trained on stolen property is literally genocide.
Not sure if I can call this a strawman since there will inevitably be someone somewhere making an argument like this, but it's not a defense being used in the lawsuits in question.
My primary issues with "wiping out AIs trained on stolen property" are:
1. It's not just LLMs that are trained like this. If you're making a model to segment material defects or detect tumors, you typically first pre-train on a dataset like ImageNet before fine-tuning on the (far smaller) task-specific dataset. Even if you believe LLMs are mostly hype, there's a whole lot else - much of which fairly uncontroversially beneficial - that you'd be inhibiting
2. Copyright's basis is "To promote the Progress of Science and useful Arts". Wiping out existing models, and likely stifling future ones, seems hard to justify under this basis. Ensuring rightsholders profit is intended as a means to achieve such progress, not a goal in and of itself to which progress can take a back seat
3. I do not believe stricter copyright law would help individuals. Realistically, developers training models would go to Reddit/X/Github/Getty/etc. selling licensed (by ToS agreement) user content, and there's little incentive for those companies to pass on the profit beyond maybe some temporary PR moves. Much of what's possible for open-source or academic communities may no longer be, on account of licensing fees
4. It doesn't seem politically viable to demand models are wiped out. Leading in the field, and staying ahead of China, is currently seen as a big important issue - we're not going to erase our best models because NYT asked so. Could hope for mandatory licensing - I think it'd still likely be a negative for open-source development, but it's more plausible than deleting models trained on copyrighted material
> amount and substantiality of what is thereby made accessible to a public
This starts to get at it, but what really gets to me is the argument that it is the use of the data in training which copyright maximalists assert is wrong. Copyright doesn't protect style. If your model cannot be coerced into producing a duplicate of its original data, then there is no copyright question in the first place.
Of course if it can, and if the amount and substantiality of the portions that it can reproduce are significant, then that protection goes away. But that's quite hard to do: there are cases where precise text reproduction can be done and has been demonstrated, but it's a real challenge to do it reliably and at length, except for some pathological special cases.
To me the best analogy with what's going on today is the format shifting debate in the 2000s, except that to make that comparison you have to assert that an imperfect, unreliable destination format not intended for reproduction is equally harmful, and in the same way, as an mp3 rip of a CD. That's a hard argument to make.
It's also very clear that the primary purpose of an LLM is exactly not duplication. You want them to be able to generalise over their input, not regurgitate it. They have several significant non-infringing uses. If they didn't, they wouldn't be worth the effort. Betamax would apply, in spades.
This is what the US constitution say. There are more legal theory than the US constitution. In fact, the idea of copyright predates the US constitution.
For example, the "Labor theory of copyright" [1] argue copyright is a exclusive right because every individual has a right to the product of their labor.
See also "Philosophy of copyright" [2] on wikipedia.