I think the reason that a there are so many MongoDB wire compatible projects (like this postgres extension from microsoft, and ferretdb) is because people have systems that have MongoDB clients built into their storage layers but don't want to be running on MongoDB anymore, exactly because "performance, scale, and flexibility hit a wall".
If you can change the storage engine, but keep the wire protocol, it makes migrating off Mongo an awful lot cheaper.
MongoDB's protocol has fewer features (RIP easy synchronization and replication) but worse is better and if the "core" feature of a database it its query language, then the best/worst query language won, the simple, dumb, easy to replicate one.
We started with Postgres and added support for MongoDB after they deprecated Atlas Device Sync in September last year.
For known attributes the model based on rigid tables having strict typing, with unknowns placed in one or more JSON fields works best for me. I mean for example there like 100 known attributes of a product, and the category-specific ones, the unknowns generated dynamically at runtime, go to the "product_attributes" JSON column.
You can do actual searches inside the JSON, index the table with JSONB contents etc. Things that became available in MongoDB very very late.
Going from 150ms to 8ms and 80% CPU reduction does not make any sense perf wise. I stand my point it's missing a lot of details in your post and probably miss usage.
Postgres has a good-enough (if not better) document DB inside it, but it also has a billion other features and widespread adoption. Mongo needs a huge benefit somewhere to justify using it.
Edit: on top of that, very few things that you store in the DB are non-relational. So you will always end up recreating a relational database on top of any NoSQL database. So why bother when you can just go for a RDBMS from the start?
Sure you can implement things to make it better, but it's layers added that balloon the complexity. Most robust systems end up requiring more than one type of database. It is nice to work on projects with a limited scope where RDBMS is good enough.
Lol. No relational database slows to a crawl on `is_deleted=true` or versioning
In general so far not a single claim by NoSQL databases has been shown to be true. Except KV databases, as they have their own uses
There's a large amount of solutions for different kinds of data for a reason.
> ... takes solutions that require engineering teams.
All it took was an understanding of the data. And just one guy (me), not an "engineering team". Mongo knows only one way of sharding data. That one way may work for some use-cases, but for the vast majority of use-cases it's a Bad Idea. Postgres lets me do things in many different ways, and that's without extensions.
If you don't understand your data, and you buy in to the marketing bullshit of a proprietary "solution", and you're too gullible to see through their lies, well, you're doomed to fail.
This fear-mongering that you're trying to pull in favour of the pretending-to-be-a-DB that is Mongo is not going to work anymore. It's not the early 2010s.
I have worked with tables on this scale. It definitely is not a walk in the park with traditional setups. https://www.timescale.com/blog/scaling-postgresql-to-petabyt...
Now data chunked into objects distributed around to be accessed by lots of servers, that's no sweat.
I'd love to see how you handle database maintenance when your active data is over 100TB.
Define "huge". Define "massive".
For modern RDBMS that starts at volumes that can't really fit on one machine (for some definition of "one machine"). I doubt Mongo would be very happy at that scale, too.
On top of that an analysis of the query plan usually shows trivially fixable bottlenecks.
On top of that it also depends on how you store your versioned data (wikipedia stores gzipped diffs, and runs on PHP and MariaDB).
Again, none of the claims you presented have any solid evidence in real world.
I'm getting paid to move a database that size this morning.
I assume since then it's grown at least ten-fold. It's already an amount of data that would cripple most NoSQL solutions on the market.
I honestly feel like talking to functional programming zealots. There's this fictional product that is oh so much better than whatever tool you're talking about. No one has seen it, no one has proven it exists, or works better than the current perfectly adequate and performant tool. But trust us, for some ridiculous vaguely specified constraints it definitely works amazingly well.
This time "RDBMS is bad at soft deletions and versions because 19TBs of revisions on one of the world's most popular websites is tiny"
[1] https://meta.wikimedia.org/wiki/Data_dumps/Dumps_sizes_and_g...
They store revisions in compressed storage mostly read only for archival. https://wikitech.wikimedia.org/wiki/MariaDB#External_storage
They have the layout and backup plans of their servers available.
They've got an efficient layout, and they use caching, and it is by nature very read intensive.
https://wikitech.wikimedia.org/wiki/MariaDB#/media/File:Wiki...
Archival read only servers don't have to worry about any of the maintenance mentioned. Use chatgpt or something to play your devil's advocate, because what you're saying is magical and non existent is quite common.
Indeed, the tag line for one of the releases, I think 9.4 or 9.5, was “NoSQL on ACID”.
Though pretty sure Bruce is a teetotaler.
Hmm, interesting! I've been having some intermittent issues recently with a Postgres table that uses JSONB, where everything seems to lock up for several minutes at a time, with no useful logs as to why.
Do you have any more info about the GIN issue, and how it can be verified as being a problem please?
It’s not a super well-documented feature in postgres, so I also wound up doing some digging in the mailing lists, but I was really appreciative of all the detail in that GL issue and the associated links.
I think it illustrates the issue he's mentioning well enough.
> You can do actual searches inside the JSON, index the table
To be clear, you can do those things with Mongo (and the features have been available for a very long time), they just don't work as well, and are less reliable at scale.
Probably if someone looking to use jsonb to replace NoSQL DB would need to look deeper.
My team has compared PostgreSQL and MongoDB on analytical queries on JSON, and the results so far have been unimpressive: https://github.com/ClickHouse/JSONBench/
It's powerful, I love that it's there, but actually using it is always such a chore.
SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qui"]}';
or
SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @? '$.tags[*] ? (@ == "qui")';
which find documents in which the key "tags" contains array element "qui"
db.api.find(
{ "tags": "qui" },
{
"guid": 1,
"name": 1,
"_id": 0
}
)
I think the biggest use case is big data and dev platform that need application compatibility and wrapping Atlas is less attractive for some reason.
And with things like this in the source https://github.com/microsoft/documentdb/blob/a0c44348fb473da...
I'd definitely stay away from it.
Looks like it supports background indexing too.
Date and Binary are ones I wish JSONB supported most often.
Selectivity is pretty hard to get right though so I can see why they stubbed it out and will deal with an actual implementation later.
That design allows arbitrary nested JSON data to be indexed using inverted indexes on top a variation of B-trees called Bw-trees, and seems like a nice way of indexing data automatically in a way that preserves the ability to do both exact and range matching on arbitrarily nested values.
Not sure if the query capabilities and syntax match azure docdb but the basic functionality should be workable.
GIN does not support range searches (needed for <, <=, >, >=), prefix or wildcard, etc. It also doesn't support index-only scans, last I checked. You cannot efficiently ORDER BY a nested GIN value.
I recommend reading the paper.
Developers, developers, developers
I think a `||` that works with JSON patches would be nice, but you can easily implement that as an extension if you need it.
Not the end of the world, but I have to look up the syntax every time I need to write a query.
I would prefer to use Postgres as my database, so this is worth investigating. Taking a brief look at the github page, it looks like it will be easy to swap it out in my code. So I think I know what I'll be spending my next sprint on.
The complexity of maintaining a relational database is at the core of their proprietary business model.
For operational systems, documents are just better in every way.
Leave the rdbms for analytics and reporting.
They (Microsoft) really don’t want you using it.
Cosmos can be a drop-in replacement.
If you start thinking in Domain-Driven Design terms, you realize the technology should be dictated by the business models. In many (if not most) cases, a document database is more than sufficient for most bounded contexts and its services, events, commands, and data.
Spinning up a relational database, managing schema changes, and tuning indexes is a constant drag on productivity.
If Microsoft wanted to compete with DynamoDB or MongoDB, they'd make the Cosmos document database a first line service in Azure. But you have to first spin up Cosmos and then identify a type of data storage. There is no technical reason for this setup other than to lump various non-relational data storage types into one service and create confusion and complexity.
I've done bake-offs between Microsoft and AWS and when I talk about a standard serverless architecture, the MS people are usually confused and respond, "What do you mean?" and the AWS folks are "Cool, so Lambdas and DynamoDB. We have 17 sample solutions for you."
I'm not saying you can't do serverless in Azure. I'm saying the support and advocacy is not there.
They call (or called, 3 years ago) it a "multi-modal" database, but really it was just a wrapper around three engines. It did come with fairly standardised pricing and availability guarantees, though, so my impression it was it was trying to sell to cloud architects and CIOs who might appreciate that.