BioHacker News | You cannot have our user's data

▲You cannot have our user's data(sourcehut.org)

116 points by Tomte 24 days ago | 18 comments

▲simonw 24 days ago

Blocking aggressive crawlers - whether or not they have anything to do with AI - makes complete sense to me. There are growing numbers of badly implemented crawlers out there that can rack up thousands of dollars in bandwidth expenses for sites like SourceHut.

Framing that as "you cannot have our user's data" feels misleading to me, especially when they presumably still support anonymous "git clone" operations.

▲immibis 24 days ago

I still maintain that since we already have this system (it's called "looking up your ISP and emailing them") where if you send spam emails, we contact your ISP, and you get kicked off the internet...

And the same system will also you get banned from your ISP if you port scan the Department of Defense...

why are we not doing the same thing against DoS attackers? Why are ISPs not hesitant to cut people off based on spam mail, but they won't do it based on DoS?

▲diggan 23 days ago

> why are we not doing the same thing against DoS attackers?

The first D in DDoS stands for "distributed", meaning it comes from multiple different origins, usually hacked devices. If we start throwing off every compromised network, we'd only have a few (secure) networks left. Probably network equipment vendors would quickly have to redo their security so it actually protects people.

So yeah, good question.

▲immibis 22 days ago

AI scrapers don't exclusively use botnets. If they had to exclusively use botnets, at least they'd have to pay $1-$20 per gigabyte downloaded...

▲thunderfork 23 days ago

All you need to evade ISP complaints is (e.g.) a botnet of residential IPs making a few requests each, instead of one IP making a ton.

▲zzo38computer 23 days ago

I agree; blocking aggressive crawlers that are badly behaved, etc, is what is sense. The files that are public are public and I should expect anyone who wants a copy can get them and do what they want with them.

▲bee_rider 24 days ago

On the topic of licenses and LLM’s—of course, we have to applaud sourcehut at least trying to not allow all their code to be ingested by some mechanical license violation service. But, it seems like a hard game. Ultimately the job of their site is to serve code, so they can only be so restrictive.

I wonder if anyone has tried going in the opposite direction? Someone like adding to their license: “by training a machine learning algorithm trained on this source code or including data crawled from this site, you agree that your model is free to use by all, will be openly distributed, and any output generated by the model is licensed under open source terms.” (But, ya know, in bulletproof legalese). I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.

▲jsheard 24 days ago

If you squint at the GPL then you could argue that every LLM is already under it, because it's a viral license and there's almost certainly some GPL code in there somewhere. I'm sure the AI companies would beg to differ though, they want a one-way street where there's zero restrictions on IP going into models, but they can dictate whatever restrictions they like on the resulting model, derived models, and model output.

I hope one of the big proprietary models leaks one day so we get to see OpenAI or Google tie themselves in knots to argue that training on libgen is fine, but distilling a leaked copy of GPT or Gemini warrants death by firing squad.

▲bastardoperator 24 days ago

I think the courts were pretty clear, prove damages. I'm not saying I agree in any capacity, but the AI companies went to court, and it appears they've already won.

▲bee_rider 24 days ago

The hope is that we can out the onus on them to start suing people or whatever, at least. The US legal system is biased toward whoever has the biggest budget of course, but the defense still gets a little bit of advantage as well.

▲pabs3 24 days ago

There have been copyright office rulings saying that ML model output is not copyrightable, so that last part of the suggested license seems a bit strange, since the rulings could preclude it for code at some point.

Also, it remains to be seen whether copyright law will outright allow model training without a license, or if there will be case law to make it fair use in the USA, or if models will be considered derivative works that require a license to prepare, or what other outcome will happen.

▲ranger_danger 23 days ago

> There have been copyright office rulings saying that ML model output is not copyrightable

Source? If you're referring to Thaler v. Perlmutter, I would argue that is an incorrect assessment.

> The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler's claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.

If someone had instead argued that they themselves were the author by means of instructing an LLM to generate the output based on a prompt they made up, plus the influence of prior art (the pretrained data), which you could also argue is not only what he probably did anyway (but argued it poorly), but in a way is also how humans make art themselves... that would be a much more interesting decision IMO.

▲bee_rider 24 days ago

It seems like AI companies are making a major bet on the idea that the output of their models can be licensed and used in products.

▲candiddevmike 24 days ago

How can AI code be added to any kind of open source license, or would it just be that code that isn't covered under the license (since it's effectively public domain?)?

▲ranger_danger 23 days ago

In most cases it would just be added like all other code as if it was proper/allowed until someone brought a civil suit with enough evidence to claim they had no permission to use the code AND it somehow damaged them, which could be quite difficult and prohibitively expensive.

So I would argue in most cases people will get away with it. We must remember that the only person's opinion that matters on what's actually illegal or not is a judge's.

▲mvdtnz 23 days ago

What does copyright have to do with it? It's about distribution.

▲majorchord 23 days ago

> Someone like adding to their license

I would assume that clause would be unenforceable. They may be able to try to sue for violating the terms of the license, but I'm fairly confident they're not going to get a judge to order them to give their model away for free even if they won. And they would likely still need to show damages in order to win a contract case.

▲bee_rider 23 days ago

> I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.

The hope is to flip the script. Sure, the company might not fulfill their full obligation under the terms and conditions they agreed to (in the same way that we all agree to these “continuing to use this site means you agree to our terms and conditions,” they are agreeing to the terms and conditions by continuing to scrape the site). But, at least if a model leaks or some pirates software that was generated by one of their LLMs, they can say, well it was open source.

▲ranger_danger 23 days ago

> they can say, well it was open source

I disagree... if those terms are unenforceable, then 1. someone else's model does not become free just because they say it does, and 2. I'm not convinced that is even legal to begin with.

Typically contracts (a license is a contract) have to be fair and mutually beneficial... I don't think a judge would agree that giving a whole model away for free just because you trained on some of their data, is fair, if there's anything even legally wrong with using said data for training in the first place.

You would also need to show a "reasonable purpose" for such a stipulation in the contract. Giving their product away for free as a punishment doesn't sound very reasonable to me, and I don't think a judge would say so either.

▲bee_rider 23 days ago

The model isn’t given away for free as a punishment, they agreed that the model and everything it produces is free as part of the terms and conditions. If they don’t think it is mutually beneficial, they don’t have to use the site.

If they don’t like the contract, they can try and get it invalidated, but at least it will be a distributed problem for them.

▲ranger_danger 23 days ago

> they agreed that the model and everything it produces is free as part of the terms and conditions

> If they don’t think it is mutually beneficial, they don’t have to use the site.

That's not how contracts work insofar as legal enforcement though. If a judge finds that clause to be unenforceable, then they wouldn't be giving anything up.

▲bee_rider 23 days ago

> If they don’t like the contract, they can try and get it invalidated, but at least it will be a distributed problem for them.

▲RadiozRadioz 24 days ago

From the Anubis docs

> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.

This is so not cool. Further gatekeeping websites from older browsers. That is absolutely not their call to make. My choice of browser version is entirely my decision. Web standards are already a change treadmill, this type of artificial "You must be at least Internet Explorer 11" or "this website works best in Chrome" nonsense makes it much worse.

My browser is supported by your website if it implements all the things your website needs. That is the correct test. Not: "Your User-Agent is looking at me funny!" or "The bot prevention system we chose has an arbitrary preference for particular browser versions".

Just run the thing single threaded if you have to.

▲Anon1096 23 days ago

>My browser is supported by your website if it implements all the things your website needs.

Well I guess your browser does not support everything needed?Being able to run a multi threaded proof of work is not the same as checking arbitrary user agents, any browser can implement it.

▲zzo38computer 23 days ago

By default, Anubis will not block Lynx and some other browsers that do not implement JavaScripts, but will block scrapers that claim to be Mozilla-based browsers, and many of the badly behaving ones do claim to be Mozilla based browsers, so this helps. (I do not have a browser compatible with the Anubis software, and Anubis does not bother me.)

If necessary, it would also be possible to do what powxy does which is displaying an explanation of how the proof of work is working in case you want to implement your own.

Having alternative access of some files using other protocols, might also help, too.

▲ranger_danger 23 days ago

So any bot author that reads your comment and switches their user agents to curl can now bypass anubis? That doesn't seem very well thought-out to me.

▲SR2Z 22 days ago

The point of Anubis checking the UA string is speed. The main gate is still the proof-of-work scheme, which makes large-scale scraping impractically expensive.

▲areyourllySorry 23 days ago

it's open source, they always could. anubis is just not used by enough sites to be important. ai crawlers have bigger fish to fry, like cloudflare.

▲xena 24 days ago

Patches are welcome!

▲dale_glass 24 days ago

Yeah, I don't like this.

We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.

In fact I'd say AI is overall a benefit to our project, because we have a large, quite complex platform, and the fact that ChatGPT actually manages to sometimes correctly write scripts for it is quite wonderful. I think it helps new people get started.

In fact in light of the recent Github discussion I'd say I personally see this as a reason to avoid sourcehut. Sorry, but I want all the visibility I can get.

▲notrealyme123 24 days ago

I was surprised to not see the "/s" at the end.

Big-Tech deciding that all our work belongs to them: Good

Small Code hosting platform does not want to be farmed like a Field of Corn: Bad

▲dale_glass 24 days ago

Why would you expect a /s?

I understand their standpoint: it's their infrastructure, and their bills.

However my concerns are with my project, not with their infrastructure bills. I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one. I want ChatGPT/copilot/etc to know it exists and to write code for it, just in case that brings in more users.

Blocking abusive behavior? Sure. But I very specifically disagree with the blanket prohibition of "Anything used to feed a machine learning model". I do not see it being in my interest.

▲whilenot-dev 24 days ago

> I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one

SourceHut is doing exactly how I expect them to act, and that's a good thing.

What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?

▲dale_glass 24 days ago

> What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?

I expect them (especially if they charge for it) to work in my interests as much as possible. Sure, defending themselves against abuse is fine. They have to survive to keep providing service.

However I don't appreciate the imposition of their own philosophy to something that isn't theirs. This here:

    # Disallowed:
    # [...]
    # - Anything used to feed a machine learning model

Is not okay with me.

▲mrweasel 23 days ago

What do you suggest they do? Or is it just the political position that's the problem? The result is the same, pretty much every single AI company is abusing sourcehut.

They have to do something, because I pay for a service, and if I can't use it, I'm not paying in the future. If that means blocking the AI companies that's fine, they can contact me if they want to use my code, we'll figure something out.

▲dale_glass 23 days ago

I expect hosts to be neutral to the maximum possible extent.

For example I expect a host not to have an arbitrary beef with Bing or Kagi, or to refuse to allow connections from France. Blocking can of course be rarely necessary, but what I want from a host is a blocking policy as minimal and selective as possible.

Yes, I understand it's a lot of work and is quite inconvenient, but especially if I'm paying for a service, I'm interested in my interests, not in what's convenient for the host.

▲eesmith 23 days ago

> but especially if I'm paying for a service

I don't believe you are paying for Sourcehut hosting, so why do you care?

For that matter, "This has been part of our terms of service since they were originally written in 2018" so even if you are paying for hosting, why did you start using their services in the first place?

I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.

My beliefs, for example, say that I shouldn't use services from companies which build tools to support an apartheid state (eg, https://www.theverge.com/news/643670/microsoft-employee-prot... ), nor from companies which host those projects.

Even if being neutral were more profitable for them and cheaper for me.

▲dale_glass 23 days ago

> I don't believe you are paying for Sourcehut hosting, so why do you care?

I theoretically could, and it's posted here I imagine to discuss the linked post. So I am.

> For that matter, "This has been part of our terms of service since they were originally written in 2018"

The "No AI" bit seems to show up only in late 2024. Which I'd regard as an extremely unwelcome development had I been paying.

> I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.

Likewise. In my case it's my belief is that when you pay somebody, it's to get things done your way. So for instance I'd be a lot more pleased with a setting.

▲eesmith 23 days ago

Your own statement say that you would never - not even theoretically - pay for Sourcehut hosting.

The 2018 restriction on using "this data for recruiting, solicitation, or profit" would have been an offense to your belief that restrictions should be "as minimal and selective as possible."

▲ranger_danger 23 days ago

> I expect hosts to be neutral to the maximum possible extent.

Why? Who gets to say what's acceptable?

▲candiddevmike 24 days ago

You don't get any attribution, so how can you tell if they're using it or not? This seems like a philosophical argument instead of a technical one.

▲dale_glass 24 days ago

What attribution are you talking about? Here's what I mean:

You can go to ChatGPT and ask it: "please write a Python script that prints "Hello world" in red". And that works.

And you can also go to ChatGPT and ask it: "Please write an Overte script that makes an object red when it's clicked".

And I really like that this works. I certainly don't want it to stop working because Sourcehut has something against LLMs.

▲mtlynch 24 days ago

I don't think the use cases you're describing are what any critics are talking about.

How do you feel about someone with more funding than you going to an LLM and saying, "Reimplement the entire Overte source for me, but change it superficially so that Overte has a hard time suing me for stealing their IP?"

▲dale_glass 24 days ago

We're Apache licensed. A LLM seems very overkill.

▲candiddevmike 24 days ago

I see, I encountered something similar with a DSL. For my use case, I had better results by having a LLM scrape a well formed doc reference page than a source code repo, I'd assume that same behavior extends to training data.

▲dale_glass 24 days ago

Oh, I'm sure there's all sorts of practical considerations regarding optimal LLM training.

All the same though, I don't like my host being so opinionated. I don't want a host that has something against any of the common search engines, and I don't want a host that has something against LLMs. Hosts should be as neutral as possible.

▲maleldil 24 days ago

This would be fine in an ideal world. However, the one we live in has crawlers that don't care how many resources they use. They're fine with taking the server down or bankrupting the owner as long as they get the data they want.

▲dale_glass 24 days ago

And I can understand the abuse argument, however they have a blanket exclusion for AI I do not agree with.

▲sksxihve 24 days ago

The code might not be theirs but the service hosting the code is and nothing is stopping you from hosting your code elsewhere. For some people blocking LLMs might be a reason to use sourcehut over github.

▲mtlynch 24 days ago

>We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.

That's not what the Apache license says.

According to the Apache license, derivative work needs to provide attribution and a copyright notice, which no major LLMs do.

▲ 24 days ago

▲eesmith 24 days ago

I pay for sourcehut hosting, and I have no problems at all with this decision.

Unlike you with your GitHub-based project, I avoid Microsoft like the plague. I do not want to be complicit in supporting their monopoly power, their human rights abuses, and their environmental destruction.

▲LtWorf 24 days ago

You won't get visibility from AI.

I'm curious what your project is. Blockchain?

▲dale_glass 24 days ago

VR platform. We're actually opposed to any blockchain tech as an organization. We're into it for the users.

But I see no reason to have any issues with LLMs. ChatGPT/copilot/etc helping new people getting started? That sounds absolutely great to me.

▲plsbenice34 24 days ago

[flagged]

▲dale_glass 24 days ago

We're effectively open source VR chat, if you want to get an idea of what it's like.

Crypto-focused projects of this kind include Decentraland. Which tend to devolve into things like selling virtual land. That's not our jam and doesn't align with the way the project works anyway -- you can have all the land you want, and set things up for free on your own server.

I don't have any more issues with ChatGPT than I have with Google and Kagi. All of those are closed source projects. But by all means, I love open source, so if an open source LLM can do something like writing code for our platform, that'd be wonderful.

▲plsbenice34 24 days ago

I see, now it makes sense for me with that context

▲mrweasel 24 days ago

> Sorry, but I want all the visibility I can get.

I can understand that, but the various AI companies pounding sourcehut into the ground also results in zero visibility.

▲matt3210 24 days ago

Anubis has had great results blocking LLM agents https://anubis.techaro.lol/

▲ac29 24 days ago

That's what sourcehut is using.

As an aside, I saw this for the first time on a kernel.org website a few days ago and actually thought it might have been hacked since I briefly saw something about kH/s (which screams cryptominer to me).

A screenshot for anyone who hasnt seen it: https://i.imgur.com/dHOmHtn.png

(this screen appears only very briefly, so while it is clear what it is from a static screenshot, its very hard to tell in real time)

▲runjake 24 days ago

Yes, this is explained and linked in the first sentence of the linked article.

▲xvilka 23 days ago

The solution is to make Git fully self-contained and encrypted, just like Fossil[1] - store issues and PRs inside the repository itself, truly distributed system.

[1] https://fossil-scm.org/

▲ 23 days ago

▲rvba-fr 24 days ago

Looks like git diffs are the new gold for training LLMs : https://carper.ai/diff-models-a-new-way-to-edit-code/

▲sltr 24 days ago

> a racketeer like CloudFlare

Could anyone teach me what makes this a fair characterization of Cloudflare?

▲diggan 24 days ago

Not sure exactly what it is referring to, but I could make a guess that it's because Cloudflare sells LLM inference as a service, but also a service that blocks LLMs. A bit like a Anti-DDOS company also selling DDOS services.

For example, https://developers.cloudflare.com/workers-ai/guides/demos-ar... has examples visit websites, then for the people on the other side (who want to protect themselves against those visits) there is https://developers.cloudflare.com/waf/detections/firewall-fo...

Just a guess though, I don't know for sure the authors intentions/meanings.

▲jsheard 24 days ago

> A bit like a Anti-DDOS company also selling DDOS services.

That's not far off from what Cloudflare does either, the majority of DDoS-for-hire outfits are hidden behind and protected by Cloudflare.

Whenever this comes up I do a quick survey of the top DDG results for "stresser" to see if anything's changed, and sure enough:

  mrstresser.com.         21600   IN      NS      sterling.ns.cloudflare.com.
  silentstress.cc.        21472   IN      NS      ernest.ns.cloudflare.com.
  maxstresser.com.        21600   IN      NS      edna.ns.cloudflare.com.
  darkvr.su.              21600   IN      NS      paige.ns.cloudflare.com.
  stresser.sh.            21600   IN      NS      luke.ns.cloudflare.com.
  stresserhub.org.        21600   IN      NS      fay.ns.cloudflare.com.

▲rsync 24 days ago

"Just a guess though, I don't know for sure the authors intentions/meanings."

I am reminded of this posting from years past:

https://news.ycombinator.com/item?id=38496499

"A lot of hacking groups, terror organizations and other malicious actors have been using cloud flare for a while without them doing shit about it. ... It's their business model. More DDoS means more cloudflare customers, yaaay."

I've not spent much time on this topic but I am very interested in the notion that a well-meaning third party established some kind of looking glass that surveyed cloudflare behavior and that third party was sued ?

I'd like to learn more about that situation ...

▲mariusor 24 days ago

I remember that when the first influxes of LLM crawlers have hit Sourcehut, they had some talks with Cloudflare which ended when CF demanded an outrageous amount of money from a company the size of Sourcehut. If I find the source for this, I'll update.

[edit] Here's the source: https://sourcehut.org/blog/2024-01-19-outage-post-mortem/#:~...

▲candiddevmike 24 days ago

Cloudflare has been accused of playing both sides--they host services for known/associated DDoS providers while conveniently offering services to protect DDoS.

▲ 24 days ago

▲ 23 days ago

▲mvdtnz 23 days ago

What I don't understand is why these scrapers so aggressively scrape websites which barely change. How much value is OpenAI etc getting from hammering the ever-living shit out of my website thousands of times a day when the content changes weekly at most? I truly don't understand the tactic. Surely their resources are better spent elsewhere?

▲imglorp 23 days ago

I would like to see a scraper tarpit that provides an endless stream of sub pages all filled with model training poison. Enough inaccurate or inappropriate material from enough tarpits will make this practice less profitable.

▲RadiozRadioz 24 days ago

How about use this to contribute an absolutely tiny amount of hashes to a mining pool on behalf of the website owner, instead of just burning the energy

▲frohwayyy_123 23 days ago

> All features work without JavaScript

Maybe they should update their bullet points...

The footnote saying "fuck you now, maybe come back later" is really encouraging.

▲immibis 24 days ago

How sure are we that they're actually LLM scrapers and not just someone trying to DDoS source hut with plausible deniability?

▲sksxihve 24 days ago

The LLM scrapers could publish the ip ranges they use for scraping like google does, but that would make it easier to block them so they probably wouldn't do that.

https://developers.google.com/search/docs/crawling-indexing/...

▲M95D 21 days ago

> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.

OMG! Yet another firewall telling me what browser and OS to use.

▲ 24 days ago

▲efitz 24 days ago

[flagged]

▲j5155 24 days ago

Use of published information is still always constrained by copyright law. If I had a copyrighted movie playing on my television visible through the window, and you recorded that, redistributing that recording would unambiguously be a violation of copyright law and piracy.

I’m also a little confused by what you’re saying here; are you asking whether scraper bots are illegal, or whether they’re immoral/unethical?

▲efitz 23 days ago

Looking through your window is already covered by a lot of laws (it’s legal sometimes in some places if you didn’t take reasonable effort to prevent it [like closing the blinds], and as long as there was no trespass). Of If I captured a small enough section of a video - say one frame in a photo- that likely is fair use. It is not crystal clear.

I’m getting to the ethical aspect but also trying to be pragmatic. “Publishing a bunch of information on the internet accessible without authentication” is an action that is fundamentally incompatible with controlling the use of that information.

The law cannot substitute for common sense; criminals are gonna crime.

▲ 24 days ago

▲GauntletWizard 24 days ago

Your drone makes a loud buzzing sound, and blocks the street for anyone else trying to get in, and does not move on its own. This is where it escalates from "Taking advantage of public information" to "Harassment".

Crawling source hut once is public information. Crawling it once a day using deltas might still be that. What these AI companies are doing is not that.

▲sneak 24 days ago

Pretending that published data isn’t public is a fool’s errand.

The point of a web host is to serve the users’ data to the public.

Anything else means the web host is broken.

▲diggan 24 days ago

I feel like we're part of a dying generation or something. I keep seeing people who want to post content to the public internet, but they still want to own the data somehow, and control who see it, but still on the public internet.

I'm not sure how it's supposed to work, as I see the public internet as just that, a public square. What goes there can be picked up by anyone, for any purpose. If I want something to be secret, I don't put it on the public internet.

Gonna be interesting to see how that "public but in my control" movement continues to evolve, because it feels like they're climbing an impossible wall.

▲the_other 24 days ago

In the music biz, copyright is sometimes used to prevent very specific uses of material. Recent examples include a number of musicians (or perhaps their labels or publishing houses) denying the use of certain pieces of music at rallies for particular politician.

IANAL, but my read of this is that if the content has the appropriate licence, the licence holder can withhold certain rights & access from certain groups of potential licensees. I'm loosely aware that the common open source licences are highly permissive, so probably they can't be used in this way... but presumably not all licences are like that. So, even though the work is "public", it should still be possible to enforce subsets of rights.

And to take your "public square" analogy... before we had cameras everywhere, there was some expectation of "casual" privacy even in public spaces. Not everyone in the square hears everything said by everyone else in the square. The fact that digital tools make privacy breaches much easier doesn't mean it should be tolerated.

(that said, I'm fairly careful what I publish online)

▲geocar 24 days ago

> I'm not sure how it's supposed to work

Laws.

This one is called copyright.

> Gonna be interesting to see how that "public but in my control" movement continues to evolve

Berne convention is almost 150 years old now...

▲pixl97 24 days ago

Yes, laws work just great over the global public internet.

▲mrweasel 24 days ago

You have to actually utilize the legal system. Other suggest attaching licenses to they published content/data preventing AI training, but attaching the license alone does nothing. You have to actually drag someone to court once in a while for it to work.

It's the same complaint about the GDPR, if it works why are site still doing X/Y/Z... Well because all people do is complain online, you need to report violations and be prepared to take legal action.

▲pixl97 23 days ago

As someone that ran SMTP systems for years legal complaints only work against people in your country and those it works with.

"Dear Russia, pwetty pweeese don't hammer my server to death stealing everything in sight"

[Crickets]

Russian IP ban

▲geocar 23 days ago

> legal complaints only work against people in your country and those it works with.

Well yes. Process Service tends to be quite different from country to country, and if you don't know how it works there, you probably won't be able to make vague "legal complaints" and have them taken seriously.

If you really want to make a legal complaint in Russia, I would suggest you look into what is called a Process Service Specialist who has specific experience with Arbitrazh.

▲mrweasel 23 days ago

I get what your saying, and I do question the value of suing someone in Russia, or China, but did you actually get a lawyer file an actual real lawsuit in Russia? Again, Russia, probably no really going to work.

There's absolutely no reason why you couldn't drag OpenAI to court... you'd need a ton of money, but you could and if you win, then the rest of the AI companies are going to get very busy adjusting their behaviour.

▲sneak 23 days ago

▲mjevans 24 days ago

https://en.wikipedia.org/wiki/Tragedy_of_the_commons ; people could be nicer about their use of public spaces / resources.

Though that also goes towards those posting content. Systems that generate an infinite number of permutations for viewing the same information are a poor design. It can easily lead to even conscientious people discovering that a simple attempt to slowly mirror a website overnight with wget has resulted in some rabbit hole of forum view parameter explosion.

▲prologic 24 days ago

I'm one of those people. I think it comes down to the intent. There is an implicit good will of those that do this that the data isn't abused or the infrastructure behind it overwhelmed (self-hosting). "Big Tech" just make this worse, because their motivations aren't the same as ours (small web).

▲panzagl 24 days ago

The public internet is a public square, but does it have to be some caricature 1970's Times Square dominated by dealers, pimps, and thugs all looking to extract whatever they can from me?

▲elpocko 24 days ago

Once all devices are locked down and all control has been taken away from the users, we can finally have functional DRM on every device and make this dream come true.

> I feel like we're part of a dying generation or something

Well... yes. Just like every living thing that's ever existed ;)

▲sltr 24 days ago

> public but in my control

I think we have a word for that - copyright

▲diggan 24 days ago

But even copyright doesn't give you "control" over something. I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it. Unless I change it sufficiently, then it's again OK to distribute my changed one. Of course, depending on country and whatnot.

Problem remains the same, as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.

▲dijksterhuis 24 days ago

> I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it

the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.

it’s just moving the ‘control’ up a level to a different party. it’s still there.

> as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.

GPL license: thou shall always* provide access to this software’s source code —> form of control in the opposite direction to music copyright, but still a form of control

* subject to terms, conditions, locale and other legal things (IANAL)

▲LegionMammal978 23 days ago

> the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.

That's why every social media site in existence puts terms in its EULA demanding that users grant the site a blanket license to redistribute their content, over and above any separate licenses they may put on it. After it's been redistributed to third parties, the copyright holder has no more control (at least, not via copyright law) over how those copies are privately used.

E.g., on HN: "With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, 'User Content'), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed."

▲01HNNWZ0MV43FF 24 days ago

For Anubis it makes sense.

If you post something on a physical bulletin board, you expect people will come by and read it

If a bunch of "scrapers" come by and form a massive mob so that nobody else can read it, and then cover the board with slightly different copies attributed to themselves, that isn't exactly the "public square" you imagined

▲diggan 24 days ago

> If you post something on a physical bulletin board, you expect people will come by and read it

Or birds, or public cameras, or people who take photos next to it, or people archiving bulletin boards, or...

If I put stuff in public spaces, I expect anyone and anything to be able to read, access and store it. Basically how I treat this very comment, and everything else I put on the public internet.

▲pixl97 24 days ago

Eh, the internet is a weirdly semi-public space. Think of it more like a mall parking lot than a public common. If you put something up for notice there it will probably be fine. For example a number of grocery stores around me have boards for things just like that. But the moment it becomes a public nuisance for them they will trespass your ass outta there faster than a starved dog would eat a dropped hotdog.

▲skydhash 24 days ago

As you say, it's not public as in public road (and even that has police to enforce proper behavior), but more like publicly accessible but on private properties.

When it's my server and I'm paying for it, then banning resources wasters is the right move.

▲sneak 23 days ago

You can’t have your cake and eat it, too. The power of the internet and mass media is that you can publish to the whole world, and make something public to billions of people. With that power comes side effects, such as, obviously, billions of people being able to privately do whatever they want with the information you, you know, published.

“public, but, no, not like that” isn’t a thing and no technological measure can make it a thing.

▲SoftTalker 24 days ago

Have to agree. If you want to limit who sees your content, put it behind a paywall or some sort of subscription.

▲nancyminusone 24 days ago

I don't know, this is like saying that because those bowls of mints at a restaurant are "free", I can back a trailer up to the door and start loading it up. Even if you know they'll never run out of mints.

▲mariusor 24 days ago

I feel like you need to present a very strong case where LLMs are "the public" before you take such a weak position when interpreting the entirety of the article.

Drew makes it perfectly clear in TFA that "the public", as he sees it, is fully entitled and should make use of the data SourceHut provides.

▲sneak 23 days ago

LLMs are just tools, run by human beings who are naturally members of the public. There is no confusion or ambiguity here.

▲kweingar 23 days ago

So are DDoS scripts

▲nottorp 24 days ago

But one can argue that the LLM crawlers deny the rest of the public access to your data by consuming all available bandwidth.

▲sksxihve 23 days ago

If the AI scrapers respected the robots.txt file then this wouldn't be an issue. A company is allowed to set the terms of service for their service and take action if other companies are abusing that.

▲LtWorf 24 days ago

Opening a bakery and feeding the entire world aren't the same.

▲SoftTalker 24 days ago

Then put your content behind a paywall. Bakeries typically aren't free.