373 points by kepano 22 hours ago | 29 comments
tmpfs 21 hours ago
Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.

In the end I found the python trifatura library to extract the best quality content with accurate meta data.

You might want to compare your implementation to trifatura to see if there is room for improvement.

acrophobic 16 hours ago
> ...it being Javascript didn't suit my project.

If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.

[0]: https://github.com/go-shiori/go-readability

[1]: https://github.com/markusmobius/go-trafilatura

breadchris 2 hours ago
this is what i came here to see, thanks!
fabmilo 18 hours ago
reference to the library: https://trafilatura.readthedocs.io/en/latest/

for the curious: Trafilatura means "extrusion" in Italian.

| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)

(btw I think you meant trafilatura not trifatura)

thm 16 hours ago
Been using it since day one but development has stalled quite a bit since 2.0.0.
winddude 4 hours ago
It's a bit old, but I bench marked a number of the web extraction tools years ago, https://github.com/Nootka-io/wee-benchmarking-tool, resiliparse-plain was my clear winner at the time.
creakingstairs 20 hours ago
I was just looking at obsidian web-clipper's source code because I've been quite impressed at its markdown conversion results and came across Defuddle in there. I'll be using for my bespoke read-it-later/ knowledge-base app, so thank you in advance :D
Tsarp 16 hours ago
Been using the obsidian clipper since it was out and this is a really neat. The per website profile based extraction is awesome.

Even if you are not a obsidian user, the markdown extraction quality is the most reliable Ive seen.

audessuscest 13 hours ago
thanks for the tip!
binarymax 6 hours ago
Really nice work. I appreciate the example with JSDOM as that’s exactly how I use readability, and this looks like a nice drop-in replacement.

Question: How did you validate this? You say it works better than readability but I don’t see any tests or datasets in the repo to evaluate accuracy or coverage. Would it be possible to share that as well?

acrophobic 17 hours ago
Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.
khasan222 14 hours ago
That codebase definitely leaves much to be desired, I’ve already had to fork it for work in order to fix some bugs.

1 such bug, find a foreign language with commas in between numbers instead of periods, like Dutch(I think), and a lot of prices on the page. It’ll think all the numbers are relevant text.

And of course I tried to open a pr and get it merged, but they require tests, and of course the tests don’t work on the page Im testing. It’s just very snafu imho

fabrice_d 3 hours ago
This seems to be https://github.com/mozilla/readability/pull/853#issuecomment... and I think their expectations are pretty reasonable.
khasan222 2 hours ago
Meh, maybe I'm standing too close to the problem, Idk. It is always frustrating trying to use a tool, and it not work though. I know it's free and all, but then I feel like helping people make good contributions is paramount in maintaining and fixing bugs.

Clearly the comma thing is a bug, it's the lack of wanting to fix it actually that is a bit disheartening, and why I think it is a deadish repo

infogulch 2 hours ago
Since it's written in javascript is there any chance it could be packaged as a bookmarklet?
jeanlucas 18 hours ago
Obsidian Web Clipper is a great tool to turn chatGPT conversations in markdown, or to just print it (believe me, it is a user case)
emaro 12 hours ago
Not sure about other clients, but Kagi Assistant directly offers to save a conversation as Markdown. Using Obsidian's web-clipper is a good idea too though.
kouru225 4 hours ago
Is that a paid plugin?
T0Bi 17 hours ago
I just ask ChatGPT to provide the summary or whatever I need as a markdown file.
ahsd1 3 hours ago
Cool. Im looking for something similar but for stripping signatures and boilerplate disclaimers from html email. Could this work for that?
miketromba 1 hour ago
Excellent work. A modern alternative to readability was much needed. This is especially useful for building clean web context for LLMs. Thanks for open-sourcing this!
elcritch 54 minutes ago
I found LLMs are really good at taking a web page and transforming it to markdown. Well rather commercial LLMs like Claude and Gemini are.

Unfortunately I tried a bunch of hugging face mode on a I could run on my MacBook and all of them ignored my prompts despite trying every variation I could think of. Half the time they just tried summarizing it and describing what JavaScript was. :/

severusdd 9 hours ago
This is very cool! Given how messy and busy many websites have become, we really need a robust markdown converter that lets readers focus on reading the content. Nice to see something stepping up where Readability left off.

Thank you for picking up this work :-)

ricardonunez 8 hours ago
I’ll give it a try. I’m not happy with my current setup for markdown to HTML on the wysiwyg editor I’m using, this may provide better results if I go with my own tool bar and editor.
jonplackett 13 hours ago
Does anyone know why readers don’t work for some websites where it looks like they should - ie normal article with lots of text.

You just get a completely white page (on the iPhone reader). Usually it’s a news website.

Is this the website intentionally obscuring the content to ensure they can serve their ads? If so how do they go about it?

miki123211 12 hours ago
Cookie and "we care about your privacy" banners are often the cause here, especially if you're in the EU / UK / possibly California[1].

On some websites, those are just modals that obscure the content, something that reader mode can usually deal with just fine, but on others, they're implemented as redirects or rendered server-side.

If reader mode doesn't work, dismiss those first and try again.

shrinks99 18 hours ago
I've been super happy with Obsidian Web Clipper! It's worked really well for me with the one exception of importing publish dates (which is more than forgivable !)
rcarmo 22 hours ago
The Python analogues seem to be well maintained. I did my own implementation of the Readability algorithm years ago and dropped it in favor them, and I have a few scrapers going strong with regular updates.
kepano 22 hours ago
Are there any in particular you can recommend?
khimaros 21 hours ago
not parent, but this one looks maintained https://github.com/buriy/python-readability
ulrischa 11 hours ago
I have build something similar:https://devkram.de/markydown but with php. Easy for self hosting
11 hours ago
Andr2Andr 11 hours ago
Serious question - who and why would be using this tool? What is the use case? In other comments I have only seen exporting ChatGPT conversations to md
rollcat 11 hours ago
This is a library, not a tool. You can use it for a number of purposes:

- Providing "reader mode" for your visitors

- Using it in a browser extension to add reader mode

- Scrapping

- Plugging it into a [reverse] proxy that automatically removes unnecessary bloat from pages, for e.g. easier access on retro hardware <https://web.archive.org/web/20240621144514/https://humungus....> (archive.org link, because the website goes down regularly)

degosuke 11 hours ago
I use LogSeq a lot - and having the option to scrape a website with only the text in MD seems like a great fit.
timdeve 11 hours ago
Looks good, I'm gonna try to swap readability in my RSS reader with this.

And with Pocket going away I might have to add save it later to it...

inhumantsar 20 hours ago
can confirm that readability seems to be on life support. I used it slurp, an obsidian plugin which serves the same basic purpose as web clipper, and always had a hard time getting PRs reviewed and merged.

i started working on my own alternative but life (and web clipper) derailed the work.

it's funny. somehow slurp keeps gaining new users even though web clipper exists. so i might have to refactor it to use your library sometime soon even though I don't use slurp myself anymore.

billconan 21 hours ago
Are you using ai models behind the scenes? I saw Gemini and others in the code. I am asking mainly to understand the cost of using yours vs. readability. Thank!
kepano 21 hours ago
No it's all rules-based. I think the code you're referring to is "extractors", which are website-specific rules that I'm working on to standardize the output from sites with comments threads (e.g. HN, Reddit) and conversational chats (ChatGPT, Claude, Gemini).
pugio 16 hours ago
I would love something which reliably extracted a markdown back/forth from all the main LLM providers. I tried `defuddle` on a shared Gemini URL and it returned nothing but the "Sign In" link. Maybe I'm using your extractor wrong? How are you managing to get the rendered conversation HTML?
bambax 12 hours ago
I think most LLM APIs return markdown and the conversion md->html happens after; so if you query the API directly you get markdown "for free".
90s_dev 19 hours ago
Neat. With ~3 more lines of code, you could get a URL and render it in simpler HTML and be a full fledged replacement.
khaki54 18 hours ago
seems pretty much perfect including obsidian clipper. Thanks!
ioma8 10 hours ago
Tried it on some webpages, doesnt work well.
revskill 11 hours ago
Interesting that Markdown does not support form element.
busymom0 22 hours ago
In the playground, after I enter a url, I can't seem to figure out how to submit it to fetch the url? I tried pressing the return key on iOS keyboard but it didn't do anything. Am I missing something?
kepano 22 hours ago
The input is there to test the url option — which I admit is a bit confusing, so I have removed it for now. I haven't found a good and free way to proxy requests from a GitHub page (yet).
input_sh 20 hours ago
A bit off-topic, but I'm very excited to see the launch of Bases! I've obsessively followed the roadmap for like a year awaiting this day and have been frequently disappointed to still see it stuck somewhere under "planned".

Not that I didn't already implement a read-it-later solution with Obsidian+Dataview, but this definitely makes things simpler!

jeanlucas 18 hours ago
Didn't it release just some days ago?
sn0n 16 hours ago
Bases?
input_sh 11 hours ago
https://help.obsidian.md/bases

Note that I'm using a preview (catalyst) version, it will reach stable soon. I'm assuming kepano will submit it here then.

19 hours ago
fkfyshroglk 21 hours ago
For those not in the know: [Readability](https://github.com/mozilla/readability)
17 hours ago
andrethegiant 20 hours ago
[flagged]
simpaticoder 19 hours ago
Interesting. How do you avoid users misusing such a tool? How do users know you won't misuse the tool against users? On a technical note, do you rotate IP's on each request, even for sub-resources of the same page?
ghilston 19 hours ago
Interesting! Your website does not explain what the free tier limits are. Can you explain those?
andrethegiant 19 hours ago
Free tier (i.e. using API keys but without a paid subscription) is rate-limited to 10 requests per minute. https://pure.md/docs/#section/Rate-limits
ghilston 17 hours ago
Thanks!
19 hours ago
xyzzy_plugh 19 hours ago
[flagged]
latchkey 19 hours ago
[flagged]
kepano 18 hours ago
Feel free to help :)
latchkey 18 hours ago
As an open source developer for 3 decades now, I used to have this flippant attitude. Trust me when I say, it doesn't work.

Build the framework for tests and then require anyone who wants to help build the product to write tests with their PRs.

You can't just push some code out there and expect people to "feel free to help", it doesn't happen, and is quite a turnoff.

To the downvoters, this is what I see as valid feedback to a rather flippant response.

jeanlucas 18 hours ago
You just wanted to complain and not add anything? Not really getting your point at all
latchkey 17 hours ago
Sorry you're not getting my point. It isn't a complaint. I'm responding to a rather flippant "feel free to help" with some advice from someone who's been doing this a long time.

I've got a project that has been going for 6 years now and attracted 500 stars and gets 49k downloads a month. It works because it has comprehensive unit tests and people can rely on it. When I was just starting out on that project, I didn't tell people to feel free to help. I put the effort in. It is important to lay the groundwork beyond just writing the utility.

m0zzie 17 hours ago
Apologies if you already know this, but I noticed you’re getting flagged so thought I’d add some context: the author is the CEO of Obsidian and has a few successful projects, so bragging about your 500 stars and saying things like “when I was just starting out, I didn't tell people to feel free to help. I put the effort in” is probably rubbing people the wrong way.
latchkey 17 hours ago
Clarified "starting out on that project". I've been doing this for 30 years and I'm also a CEO. I've had multiple successful projects, like starting Java@Apache and open sourcing Tomcat.

I made a lot of mistakes along the way and one of them was being flippant on my responses to people like that. Just sharing my insights.

m0zzie 16 hours ago
Your follow up post and edits help with clarifying the tone, hopefully readers see that.
17 hours ago