Framing that as "you cannot have our user's data" feels misleading to me, especially when they presumably still support anonymous "git clone" operations.
And the same system will also you get banned from your ISP if you port scan the Department of Defense...
why are we not doing the same thing against DoS attackers? Why are ISPs not hesitant to cut people off based on spam mail, but they won't do it based on DoS?
The first D in DDoS stands for "distributed", meaning it comes from multiple different origins, usually hacked devices. If we start throwing off every compromised network, we'd only have a few (secure) networks left. Probably network equipment vendors would quickly have to redo their security so it actually protects people.
So yeah, good question.
I wonder if anyone has tried going in the opposite direction? Someone like adding to their license: “by training a machine learning algorithm trained on this source code or including data crawled from this site, you agree that your model is free to use by all, will be openly distributed, and any output generated by the model is licensed under open source terms.” (But, ya know, in bulletproof legalese). I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.
I hope one of the big proprietary models leaks one day so we get to see OpenAI or Google tie themselves in knots to argue that training on libgen is fine, but distilling a leaked copy of GPT or Gemini warrants death by firing squad.
Also, it remains to be seen whether copyright law will outright allow model training without a license, or if there will be case law to make it fair use in the USA, or if models will be considered derivative works that require a license to prepare, or what other outcome will happen.
Source? If you're referring to Thaler v. Perlmutter, I would argue that is an incorrect assessment.
> The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler's claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.
If someone had instead argued that they themselves were the author by means of instructing an LLM to generate the output based on a prompt they made up, plus the influence of prior art (the pretrained data), which you could also argue is not only what he probably did anyway (but argued it poorly), but in a way is also how humans make art themselves... that would be a much more interesting decision IMO.
So I would argue in most cases people will get away with it. We must remember that the only person's opinion that matters on what's actually illegal or not is a judge's.
I would assume that clause would be unenforceable. They may be able to try to sue for violating the terms of the license, but I'm fairly confident they're not going to get a judge to order them to give their model away for free even if they won. And they would likely still need to show damages in order to win a contract case.
The hope is to flip the script. Sure, the company might not fulfill their full obligation under the terms and conditions they agreed to (in the same way that we all agree to these “continuing to use this site means you agree to our terms and conditions,” they are agreeing to the terms and conditions by continuing to scrape the site). But, at least if a model leaks or some pirates software that was generated by one of their LLMs, they can say, well it was open source.
I disagree... if those terms are unenforceable, then 1. someone else's model does not become free just because they say it does, and 2. I'm not convinced that is even legal to begin with.
Typically contracts (a license is a contract) have to be fair and mutually beneficial... I don't think a judge would agree that giving a whole model away for free just because you trained on some of their data, is fair, if there's anything even legally wrong with using said data for training in the first place.
You would also need to show a "reasonable purpose" for such a stipulation in the contract. Giving their product away for free as a punishment doesn't sound very reasonable to me, and I don't think a judge would say so either.
If they don’t like the contract, they can try and get it invalidated, but at least it will be a distributed problem for them.
> If they don’t think it is mutually beneficial, they don’t have to use the site.
That's not how contracts work insofar as legal enforcement though. If a judge finds that clause to be unenforceable, then they wouldn't be giving anything up.
> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.
This is so not cool. Further gatekeeping websites from older browsers. That is absolutely not their call to make. My choice of browser version is entirely my decision. Web standards are already a change treadmill, this type of artificial "You must be at least Internet Explorer 11" or "this website works best in Chrome" nonsense makes it much worse.
My browser is supported by your website if it implements all the things your website needs. That is the correct test. Not: "Your User-Agent is looking at me funny!" or "The bot prevention system we chose has an arbitrary preference for particular browser versions".
Just run the thing single threaded if you have to.
Well I guess your browser does not support everything needed?Being able to run a multi threaded proof of work is not the same as checking arbitrary user agents, any browser can implement it.
If necessary, it would also be possible to do what powxy does which is displaying an explanation of how the proof of work is working in case you want to implement your own.
Having alternative access of some files using other protocols, might also help, too.
We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.
In fact I'd say AI is overall a benefit to our project, because we have a large, quite complex platform, and the fact that ChatGPT actually manages to sometimes correctly write scripts for it is quite wonderful. I think it helps new people get started.
In fact in light of the recent Github discussion I'd say I personally see this as a reason to avoid sourcehut. Sorry, but I want all the visibility I can get.
Big-Tech deciding that all our work belongs to them: Good
Small Code hosting platform does not want to be farmed like a Field of Corn: Bad
I understand their standpoint: it's their infrastructure, and their bills.
However my concerns are with my project, not with their infrastructure bills. I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one. I want ChatGPT/copilot/etc to know it exists and to write code for it, just in case that brings in more users.
Blocking abusive behavior? Sure. But I very specifically disagree with the blanket prohibition of "Anything used to feed a machine learning model". I do not see it being in my interest.
SourceHut is doing exactly how I expect them to act, and that's a good thing.
What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?
I expect them (especially if they charge for it) to work in my interests as much as possible. Sure, defending themselves against abuse is fine. They have to survive to keep providing service.
However I don't appreciate the imposition of their own philosophy to something that isn't theirs. This here:
# Disallowed:
# [...]
# - Anything used to feed a machine learning model
Is not okay with me.They have to do something, because I pay for a service, and if I can't use it, I'm not paying in the future. If that means blocking the AI companies that's fine, they can contact me if they want to use my code, we'll figure something out.
For example I expect a host not to have an arbitrary beef with Bing or Kagi, or to refuse to allow connections from France. Blocking can of course be rarely necessary, but what I want from a host is a blocking policy as minimal and selective as possible.
Yes, I understand it's a lot of work and is quite inconvenient, but especially if I'm paying for a service, I'm interested in my interests, not in what's convenient for the host.
I don't believe you are paying for Sourcehut hosting, so why do you care?
For that matter, "This has been part of our terms of service since they were originally written in 2018" so even if you are paying for hosting, why did you start using their services in the first place?
I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
My beliefs, for example, say that I shouldn't use services from companies which build tools to support an apartheid state (eg, https://www.theverge.com/news/643670/microsoft-employee-prot... ), nor from companies which host those projects.
Even if being neutral were more profitable for them and cheaper for me.
I theoretically could, and it's posted here I imagine to discuss the linked post. So I am.
> For that matter, "This has been part of our terms of service since they were originally written in 2018"
The "No AI" bit seems to show up only in late 2024. Which I'd regard as an extremely unwelcome development had I been paying.
> I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
Likewise. In my case it's my belief is that when you pay somebody, it's to get things done your way. So for instance I'd be a lot more pleased with a setting.
The 2018 restriction on using "this data for recruiting, solicitation, or profit" would have been an offense to your belief that restrictions should be "as minimal and selective as possible."
Why? Who gets to say what's acceptable?
You can go to ChatGPT and ask it: "please write a Python script that prints "Hello world" in red". And that works.
And you can also go to ChatGPT and ask it: "Please write an Overte script that makes an object red when it's clicked".
And I really like that this works. I certainly don't want it to stop working because Sourcehut has something against LLMs.
How do you feel about someone with more funding than you going to an LLM and saying, "Reimplement the entire Overte source for me, but change it superficially so that Overte has a hard time suing me for stealing their IP?"
All the same though, I don't like my host being so opinionated. I don't want a host that has something against any of the common search engines, and I don't want a host that has something against LLMs. Hosts should be as neutral as possible.
That's not what the Apache license says.
According to the Apache license, derivative work needs to provide attribution and a copyright notice, which no major LLMs do.
Unlike you with your GitHub-based project, I avoid Microsoft like the plague. I do not want to be complicit in supporting their monopoly power, their human rights abuses, and their environmental destruction.
I'm curious what your project is. Blockchain?
But I see no reason to have any issues with LLMs. ChatGPT/copilot/etc helping new people getting started? That sounds absolutely great to me.
Crypto-focused projects of this kind include Decentraland. Which tend to devolve into things like selling virtual land. That's not our jam and doesn't align with the way the project works anyway -- you can have all the land you want, and set things up for free on your own server.
I don't have any more issues with ChatGPT than I have with Google and Kagi. All of those are closed source projects. But by all means, I love open source, so if an open source LLM can do something like writing code for our platform, that'd be wonderful.
I can understand that, but the various AI companies pounding sourcehut into the ground also results in zero visibility.
As an aside, I saw this for the first time on a kernel.org website a few days ago and actually thought it might have been hacked since I briefly saw something about kH/s (which screams cryptominer to me).
A screenshot for anyone who hasnt seen it: https://i.imgur.com/dHOmHtn.png
(this screen appears only very briefly, so while it is clear what it is from a static screenshot, its very hard to tell in real time)
Could anyone teach me what makes this a fair characterization of Cloudflare?
For example, https://developers.cloudflare.com/workers-ai/guides/demos-ar... has examples visit websites, then for the people on the other side (who want to protect themselves against those visits) there is https://developers.cloudflare.com/waf/detections/firewall-fo...
Just a guess though, I don't know for sure the authors intentions/meanings.
That's not far off from what Cloudflare does either, the majority of DDoS-for-hire outfits are hidden behind and protected by Cloudflare.
Whenever this comes up I do a quick survey of the top DDG results for "stresser" to see if anything's changed, and sure enough:
mrstresser.com. 21600 IN NS sterling.ns.cloudflare.com.
silentstress.cc. 21472 IN NS ernest.ns.cloudflare.com.
maxstresser.com. 21600 IN NS edna.ns.cloudflare.com.
darkvr.su. 21600 IN NS paige.ns.cloudflare.com.
stresser.sh. 21600 IN NS luke.ns.cloudflare.com.
stresserhub.org. 21600 IN NS fay.ns.cloudflare.com.
I am reminded of this posting from years past:
https://news.ycombinator.com/item?id=38496499
"A lot of hacking groups, terror organizations and other malicious actors have been using cloud flare for a while without them doing shit about it. ... It's their business model. More DDoS means more cloudflare customers, yaaay."
I've not spent much time on this topic but I am very interested in the notion that a well-meaning third party established some kind of looking glass that surveyed cloudflare behavior and that third party was sued ?
I'd like to learn more about that situation ...
[edit] Here's the source: https://sourcehut.org/blog/2024-01-19-outage-post-mortem/#:~...
Maybe they should update their bullet points...
The footnote saying "fuck you now, maybe come back later" is really encouraging.
OMG! Yet another firewall telling me what browser and OS to use.
https://developers.google.com/search/docs/crawling-indexing/...
I’m also a little confused by what you’re saying here; are you asking whether scraper bots are illegal, or whether they’re immoral/unethical?
I’m getting to the ethical aspect but also trying to be pragmatic. “Publishing a bunch of information on the internet accessible without authentication” is an action that is fundamentally incompatible with controlling the use of that information.
The law cannot substitute for common sense; criminals are gonna crime.
Crawling source hut once is public information. Crawling it once a day using deltas might still be that. What these AI companies are doing is not that.
The point of a web host is to serve the users’ data to the public.
Anything else means the web host is broken.
I'm not sure how it's supposed to work, as I see the public internet as just that, a public square. What goes there can be picked up by anyone, for any purpose. If I want something to be secret, I don't put it on the public internet.
Gonna be interesting to see how that "public but in my control" movement continues to evolve, because it feels like they're climbing an impossible wall.
IANAL, but my read of this is that if the content has the appropriate licence, the licence holder can withhold certain rights & access from certain groups of potential licensees. I'm loosely aware that the common open source licences are highly permissive, so probably they can't be used in this way... but presumably not all licences are like that. So, even though the work is "public", it should still be possible to enforce subsets of rights.
And to take your "public square" analogy... before we had cameras everywhere, there was some expectation of "casual" privacy even in public spaces. Not everyone in the square hears everything said by everyone else in the square. The fact that digital tools make privacy breaches much easier doesn't mean it should be tolerated.
(that said, I'm fairly careful what I publish online)
Laws.
This one is called copyright.
> Gonna be interesting to see how that "public but in my control" movement continues to evolve
Berne convention is almost 150 years old now...
It's the same complaint about the GDPR, if it works why are site still doing X/Y/Z... Well because all people do is complain online, you need to report violations and be prepared to take legal action.
"Dear Russia, pwetty pweeese don't hammer my server to death stealing everything in sight"
[Crickets]
Russian IP ban
Well yes. Process Service tends to be quite different from country to country, and if you don't know how it works there, you probably won't be able to make vague "legal complaints" and have them taken seriously.
If you really want to make a legal complaint in Russia, I would suggest you look into what is called a Process Service Specialist who has specific experience with Arbitrazh.
There's absolutely no reason why you couldn't drag OpenAI to court... you'd need a ton of money, but you could and if you win, then the rest of the AI companies are going to get very busy adjusting their behaviour.
Though that also goes towards those posting content. Systems that generate an infinite number of permutations for viewing the same information are a poor design. It can easily lead to even conscientious people discovering that a simple attempt to slowly mirror a website overnight with wget has resulted in some rabbit hole of forum view parameter explosion.
> I feel like we're part of a dying generation or something
Well... yes. Just like every living thing that's ever existed ;)
I think we have a word for that - copyright
Problem remains the same, as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.
it’s just moving the ‘control’ up a level to a different party. it’s still there.
> as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
GPL license: thou shall always* provide access to this software’s source code —> form of control in the opposite direction to music copyright, but still a form of control
* subject to terms, conditions, locale and other legal things (IANAL)
That's why every social media site in existence puts terms in its EULA demanding that users grant the site a blanket license to redistribute their content, over and above any separate licenses they may put on it. After it's been redistributed to third parties, the copyright holder has no more control (at least, not via copyright law) over how those copies are privately used.
E.g., on HN: "With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, 'User Content'), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed."
If you post something on a physical bulletin board, you expect people will come by and read it
If a bunch of "scrapers" come by and form a massive mob so that nobody else can read it, and then cover the board with slightly different copies attributed to themselves, that isn't exactly the "public square" you imagined
Or birds, or public cameras, or people who take photos next to it, or people archiving bulletin boards, or...
If I put stuff in public spaces, I expect anyone and anything to be able to read, access and store it. Basically how I treat this very comment, and everything else I put on the public internet.
When it's my server and I'm paying for it, then banning resources wasters is the right move.
“public, but, no, not like that” isn’t a thing and no technological measure can make it a thing.
Drew makes it perfectly clear in TFA that "the public", as he sees it, is fully entitled and should make use of the data SourceHut provides.