BioHacker News | Optimizing Postgres table layout for maximum efficiency

▲Optimizing Postgres table layout for maximum efficiency(r.ena.to)

204 points by napsterbr 268 days ago | 14 comments

▲johnthuss 267 days ago

This is a super useful thing to know and I'm thankful for this article highlighting this aspect of Postgres.

But I would disagree with the takeaway to focus on optimizing your indexes and not your tables. The reason is that the order of columns in a multi-column index is highly meaningful and intentional in order to support match on a range of values for the last column in the index. The way databases work you can only utilize a multi-column index on (customer_id int4, date timestamp) if have an equality match on customer_id, like "WHERE customer_id = 1 AND BETWEEN '2024-01-01' and '2025-01-01'". If you reorder these columns in the index to put the larger date column first, then, sure, you save space in the index, but you also make it worthless – it will never be used by the query above. As such, optimizing a multi-column index is only useful when all the columns are queried for equality rather than a range.

In contrast, when you are creating a NEW table you might not think hard about the order of the columns in the table, and especially not about the data-sizes of each column and their alignment. But doing so at the time you create the table can be tremendously beneficial if it is going to be very large. It is important to note that you not only save space on-disk, but in precious RAM when the tuples are loaded.

▲rand_r 266 days ago

This is sort of an aside, but a very interesting thing about Postgres is that it can efficiently combine independent column indexes together, so there is much less of a need, compared to older databases, to even create multi-column indexes. It's a feature from 8.1 called "Bitmap Scan". Basically, if you create an index on column X and an index on column Y, it can use them to do queries involving either or both columns pretty efficiently (for any number of columns).

It's not as fast as a multi-column index, but the savings of not having to worry about all the combinations of columns that can be queried together could well be worth it.

- https://www.postgresql.org/docs/release/8.1.0/

- https://www.postgresql.org/docs/current/indexes-bitmap-scans...

▲williamdclt 266 days ago

It’s very cool, but at high throughput you really see the difference. This bitmap scan can take huge amounts of cpu, reduced to nothing (and much faster) when setting up a proper multi column index.

On small/medium tables and lowish throughout though, yeah it’s often good enough and avoids having many indexes for specific use cases (which is a cost in itself, in memory/cpu/storage)

▲jamespo 266 days ago

Do you have any benchmarks? Be interesting to compare.

▲Pamar 266 days ago

Bitmap Scan sounds a lot like Rushmore technology in Foxpro[1]. Are they the same?

1) https://en.wikipedia.org/wiki/FoxPro

It is difficult to find a complete explanation for Rushmore nowadays, from what I remember, it would create a bitmap where each bit represented the nth record of the table you wanted to search, then with a single, fast sequential scan it would set the nth bit to 1 if the record satisfied all clauses of your search, 0 otherwise.

Try to see if this makes any sense to you: http://www.foxpert.com/docs/howfoxproworks.en.htm

▲qazxcvbnm 266 days ago

Is there any potential that indexes could be created over foreign key joins in the future? I know that as of today, no multi-table indices or statistics exist for Postgres, which has had led me to do some further denormalisations.

▲jeltz 267 days ago

This is a limitation which is currently being worked on. The order will still matter of course but it will allow PostgreSQL to make some use of indexes even when the order of coulums does not match.

https://www.postgresql.org/message-id/CAH2-Wzmn1YsLzOGgjAQZd...

▲johnthuss 267 days ago

>>> This is feasible in cases where the total number of distinct values in the column 'a' is reasonably small (think tens or hundreds, perhaps even thousands for very large composite indexes).

It's great this is improving, but this is a fairly narrow improvement. Personally, the multi-column indexes I use would not be improved by this change since column 'a' does not store a "reasonably small" distribution of values.

▲sgarland 267 days ago

This (index ordering resulting in useless indices) is not true, at least not in newer versions of Postgres (I just tried with 15).

While the query will take much longer (for me, it was about 47 msec vs 0.6 msec for 2,000,000 rows), it still uses the index.

Similarly, while normally you wouldn’t expect a query using predicate Y to use an index defined on columns (X, Y, Z) if predicate X is also not referenced, Postgres may choose to do so, depending on table stats, and the relative difference between random_page_cost and seq_page_cost. I’ve seen it happen before.

▲ghusbands 267 days ago

> This (index ordering resulting in useless indices) is not true, at least not in newer versions of Postgres (I just tried with 15).

> While the query will take much longer (for me, it was about 47 msec vs 0.6 msec for 2,000,000 rows), it still uses the index.

I'd argue that something being 78x slower can make it pretty useless, but it is indeed at least used, in some cases.

▲sgarland 267 days ago

Certainly I wouldn’t encourage this, no, but it is possible, and is still usually faster than a sequential scan.

▲branko_d 266 days ago

From the article:

  Having the data aligned ensures faster access time when retrieving pages from disk.

Byte-level alignment cannot possibly have anything to do with retrieving pages from disk, simply because the unit of retrieval is the whole page. From the hardware/OS perspective, a page is just an opaque blob of bytes (comprised from one or more blocks on the physical drive).

Only after these bytes have reached RAM does the byte-level alignment play a role, because CPU works slower on misaligned data.

The article itself then goes on to illustrates the above (and seemingly contradict itself):

  SQLite does not pad or align columns within a row. Everything is tightly packed together using minimal space. Two consequences of this design:

  SQLite has to work harder (use more CPU cycles) to access data within a row once it has that row in memory.
  SQLite uses fewer bytes on disk, less memory, and spends less time moving content around because there are fewer bytes to move.

▲winternewt 266 days ago

I think the idea is that padding results in lower storage efficiency, which means fewer rows per page and hence lower I/O throughput. By changing the column order you can reduce the amount of padding required.

▲branko_d 266 days ago

Sure, having less padding increases I/O efficiency. I was just commenting on the author's apparent confusion as to why the padding is there in the first place.

Here is the full(er) quote:

  Postgres will happily add padding to the underlying data in order to make sure it is properly aligned at the physical layer. Having the data aligned ensures faster access time when retrieving pages from disk.

This might be misunderstood as "Postgres adds padding to speed-up disk I/O", which is the opposite of what actually happens. Padding slows-down I/O but speeds-up the CPU processing afterwards.

SQLite made the opposite tradeoff.

▲napsterbr 266 days ago

You are absolutely correct, the current wording causes confusion as to where the speed-up happens. Over the weekend I'll add a note and link to this thread, thanks for pointing that out.

▲dspillett 266 days ago

The wording implies to me that Postgres is doing the padding for alignment to reduce IO costs which, as branko_d suggests, would do the opposite. You are reading it as the intervention of the DBA reordering columns to remove the padding will improve IO efficiency by fitting more rows into each page, which you are right would be beneficial in that way.

Postgres will be performing padding for alignment to improve processing speed once data is in local memory – CPUs are usually much faster at reading & writing aligned data⁰. This is trading off memory use and IO efficiency for CPU gain, which is the right optimisation if you assume that your core working set fits nicely into RAM and that your CPU(s) have large enough cache that you don't create the same problem there¹. Other DBs don't do this padding at all either because they didn't think of it or, more likely in the case of the big ones, because they are optimising more for IO than being concerned about CPU bottlenecks, or perhaps they natively rearrange the fields where it makes a difference instead of being beholden to the column ordering given by the user².

----

[0] in fact some architectures don't directly support unaligned access at all, though probably not any architectures Postgres supports

[1] causing extra cache evictions if cache segment width aligns badly with the padding such that less data fits in the available cache

[2] if the user needs to care about physical ordering like this, you have a leaky abstraction

▲gnfargbl 267 days ago

The ever-useful postgres_dba tool (https://github.com/NikolayS/postgres_dba) can help with this. Install it, start psql, run :dba and choose option p1 to see how much table space you would save by reordering your columns.

▲samokhvalov 267 days ago

Thanks for mentioning!

▲fforflo 266 days ago

Looked nice so I automatically did git clone && cd postgres_dba && make all install installcheck... and failed :D

Creatures of habit... any plans to make this an extension?

▲gnfargbl 266 days ago

Thanks for the tool! It has helped me out of the mire more than once.

▲wccrawford 267 days ago

Interesting. But if this is important, why doesn't Postgres do it invisibly, automatically? Surely there's a most-correct sequence, with the longest values first, and the shortest values last, and then (probably) the variable-length values?

It could then happily report back the data in the order that the DB Admin/developer expects.

▲jeltz 267 days ago

Because nobody has implemented it yet. A patch would be very welcome by many but nobody has written one.

▲j45 267 days ago

Updating column rows in MySQL is pretty trivial, and I've wondered why it's not the same in Postgres, even at the GUI level.

Ordering columns for performance might be a different order of reccomendation.

▲rtuin 267 days ago

It’s really something you don’t know how this applies to your Postgres DB, until you run into a situation where you do.

The author explains this very well, it’s a good read! I’ve learned about this padding little over a year ago, while I was designing a data intensive application with a colleague. I was skeptical about the advantage at first, but for our specific design, where we have 100 to 480+ columns in one table it makes a huge difference on table store size. Not so much on the indexes, though.

▲sgarland 267 days ago

> where we have 100 to 480+ columns in one table

I’m sorry, what? Why?

▲magicalhippo 266 days ago

Not OP but in similar situation. Main reason is that the primary tables represent various official declarations, and they have a lot of fields. In addition, our customers are required by law to keep the data readily available for 5-10 years. Also the law states our customers are responsible for the data they send through our systems, so they want to see what they're sending.

So just to give an example of how field count "blows up", there can be up to 8 different parties related to a single declaration, and for each we want to have for display/print purposes our internal id, id in customers system, organization number, name, address, postal code, location, country code, contact person, contact person phone and email.

So we're talking about almost 100 fields just to store parties.

We want to store all that separate from say our "companies" table as a company might change name, location etc, and we're required to show/print the details from today 5+ years down the line.

We could put that in a separate table, but that means 8 joins to fetch data for a single record just there. And that's just one piece of the declarations, and at least 3-4 of these parties our customers want to see in overview grids and be able to filter on there.

Now tack on 20+ years of doing this, with various official systems coming and going with different technical requirements, and it adds up.

▲sgarland 265 days ago

> but that means 8 joins to fetch data

Without seeing your schema, that sounds like a bit much, but maybe not. That is kind of the point of a relational DB.

I always attempt to normalize as high as possible first, and then denormalize IFF it’s necessary for performance. Postgres has a default join limit (i.e. where it shifts to the genetic algorithm for joins, and may not be optimal) of 8, but I’ve still seen it do fine far above that.

▲magicalhippo 263 days ago

If we normalized as much as possible we'd be looking at 20-30 joins just for the main view of a declaration. Main overview grid would have probably 10+ joins. And keep in mind almost all of these would have data, so instead of inserting one row you're now inserting 20-30 rows + updating corresponding foreign key indexes.

Think performance would be pretty crap, and developer experience as well.

▲mkesper 266 days ago

Sounds like it's rife to convert to JSON fields maybe?

▲sgarland 266 days ago

You do not want to do this, I assure you. The TOAST / DE-TOAST overhead alone would wreck performance, plus OP said it was update-heavy; Postgres can’t update a JSON object on disk, it just rewrites the entire blob.

▲syncsynchalt 266 days ago

> Postgres can’t update a JSON object on disk, it just rewrites the entire blob.

Postgres does this for _all_ row updates; it must, for transaction isolation to work.

▲hashhar 266 days ago

I'm not sure, there's also HOT-updates (heap-only tuples). It's an optimization where data in modified in place if none of the modified columns are part of an index and maybe other conditions but I don't remember it all too well.

▲syncsynchalt 265 days ago

Did not know this; thank you!

▲pphysch 266 days ago

I strongly disagree. This is an excellent application for JSONB. The user outlined that the records are for historical compliance purposes and therefore aren't frequently updated after the initial phase.

A schema of `party1 JSONB, party2 JSONB, ...`, or even `parties JSONB` would likely be far more maintainable and sensible than the "table-as-spreadsheet" approach.

▲magicalhippo 266 days ago

If we could limit us to only PostgreSQL we could perhaps do that for new modules. Sadly we have some customers which want to use their own MSSQL database.

Though we'll see what the future holds. PostgreSQL is interesting due to cost and features, and many of the large customers are shifting their POV on how they want to run things, almost doing 180's in some cases.

▲rtuin 267 days ago

Glad you asked!

This system contains measurements and state of physical devices (time series). It’s designed for both heavy write and read, with slight emphasis on write. Each table is one type of device and contains 1 to 5 different measurements/states. But here’s the trick: because data is queried with minimum bucket size of 15minutes I figured we could just create a column for each measurement + quarter of the day (i.e. measure0000, measure0015), so that’s 100 columns for each measurement (96 quarter + 4 for DST), include the date in the key, et voila: excellent write performance (because it’s mainly UPDATE queries) and good read performance.

Okay, the queries to make sense of the data aren’t pretty, but can be generated.

I find it really cool how effective this is for time-series data without Postgres extensions (we’re on RDS).

▲napsterbr 267 days ago

One interesting thing your team may want to look into (if you haven't already) is compression.

Of course there are a multitude of variables we don't have access from the outside, but Postgres only compresses data that is TOASTed, and based on your description of the table, the data is not being TOASTed (and therefore not being compressed).

Instead, if you could somehow pack your timeseries entries into an array, you would get the benefits of compression automatically.

Given your write performance requirements, using an array may be out-of-question (and you may get too much overhead from dead tuples) -- but who knows? Always a good idea to benchmark.

I actually considered mentioning this at the post but figured it was too long already and could be the material for a future one :)

▲rtuin 266 days ago

This is interesting! Turning the values into arrays is not suitable for our base-tables, but might be for the several materialization tables we run.

▲aidos 267 days ago

The updates wouldn’t be in place anyway because of the Postgres mvcc approach, right?

I can see the advantage in terms of just needing a single tuple for a reads. So a timestamp + value model would likely take twice as much heap space than your approach?

Given that you’re probably always just inserting new data you could use a brin index to get fast reads on the date ranges. Would be interesting to see it in action and play around to see the tradeoffs. The model you’ve settled on sounds like it would be a pain to query.

▲sgarland 267 days ago

> The updates wouldn’t be in place anyway because of the Postgres mvcc approach, right?

They might be if the columns being updated weren’t indexed [0], but since it sounds like at least one is, no, not in-place.

Though interestingly, your comment on BRIN indexes is quite relevant, as that’s the one type of index that HOT can still work with.

[0]: https://www.postgresql.org/docs/current/storage-hot.html

▲rtuin 266 days ago

Good observation. The updates are as much HOT-updates as possible. I wasn't familiar with BRIN-indexes before, so I have to look into that.

At first glance our solution follows a similar approach, let me elaborate:

- no index columns are updated ever, only inserted

- all tables are partitioned based on date (partition range is 1 month)

- for some tables there is another layer of partitioning (3 sub-partitions, based on one specific column)

- finding an appropriate fillfactor is important to improve the speed of UPDATE statements (HOT-updates)

- standard vacuum / auto vacuum settings work great for us so far.

- to improve ANALYZE performance, set column statistics of value-only columns to 0.

▲sevensor 266 days ago

Fascinating. How reliable is the measurement apparatus? Or to put it another way, how tolerant of missed measurements is this scheme? I’ve been in a similar situation in a past life and we took a different approach. I was never satisfied with how we ended up doing it though.

▲rtuin 266 days ago

I see your point. Some of our measurements are counters and interpolate missing data via certain rules. We store these interpolations in a separate table with exactly the same format and join the two tables directly when querying the data. Since the real value always takes precedence over the interpolated value this is just a COALESCE(realmeasure0000, interpolation0000) statement.

This is super fast when taking advantage of postgres' partition-wise joins.

▲sevensor 266 days ago

That’s really neat, thanks for explaining!

▲sgarland 267 days ago

I assume you looked at the various Postgres functions, like width_bucket()?

I’ve never ran a PG-based TSDB so I’m reluctant to pass judgment one way or the other, but as I hope you can understand, “we have hundreds of columns in a table” is alarming at first glance.

▲rtuin 266 days ago

The schema is organized by how we need to query it which is mostly by fetching all measurements for an entire day or month. Querying statistics on those metrics is not often needed, maybe a few times per week or day, so we do that in Snowflake if necessary. Or create a specialized streaming data flow when it the stats are queried more often.

▲epgui 267 days ago

I don't know what they're doing, but highly-denormalized tables are very common, and sometimes even the norm, depending on how things are set up, especially in OLAP contexts.

▲sgarland 267 days ago

Highly denormalized tables are often the norm simply because the tables weren’t properly normalized to begin with, and the data model wasn’t properly done, such that reasonable joins are overly difficult.

OLAP is of course its own problem, and most of the best practices for OLTP do not apply.

▲silvestrov 266 days ago

Wild guess: this is a left-over from the old days of Sun SPARC processors which could only access doubles on 8-byte aligned memory addresses.

For Intel processors you just paid a performance penality for unaligned accesses but SPARC processors would generate a TRAP for unaligned accesses.

I'd really like to see some performance testing of alignment for modern Intel/ARM CPUs. My guess is that tightly packing would be better as more data would fit into caches and you would have less i/o.

Modern CPUs are very different from back in the 90's when PostgreSQL was designed. Posgresql has traditionally been very slow to stop supporting old OSes that are no longer used.

▲didgetmaster 267 days ago

It's been awhile since I dug into the bowels of PG; but the overall row size with respect to page size used to be important. If you had a table where every row took up 4097 bytes and the page size was 8K; the your disk footprint for that table was double. Only one row fit in a page and the other (nearly) half of each page was wasted.

If that is still true, then alignment issues could push you over the edge if you are close to it.

▲remus 267 days ago

Very interesting, but I think the author overstates the importance of alignment a little. Unless your data/indexes are already of a challenging size (or you expect them to be imminently) for your hardware then fiddling with byte alignment details feels like a fairly premature optimisation.

Disk is cheap, memory is plentiful, your time is expensive etc.

▲egnehots 267 days ago

As mentioned in the article, it's a good idea to consider this when creating a new table, since it's essentially a free optimization. However, it's probably not worth the hassle of reordering a production table for that.

> Disk is cheap, memory is plentiful, but your time is expensive, etc.

One thing to keep in mind, though, is that while you often have plenty of disk space, RAM is still relatively expensive. It's also divided into many smaller buffers, such as working memory and shared buffers, which are not that large. These optimizations help to fit more data into cache.

However, what the article said about alignment being important for indexes is somewhat misleading. Reordering an index field is not the same as reordering columns in a table. Beside having to rewrite queries, it also changes the access pattern and the time required to access the data, which is often much more significant than the space saved. Indexes are, by nature, a tradeoff where you give up space to gain time, so this mindset doesn't really apply there.

▲napsterbr 267 days ago

Hey, author here.

> Indexes are, by nature, a tradeoff where you give up space to gain time, so this mindset doesn't really apply there.

I agree that (re)aligning indexes are a different beast entirely, but (as mentioned in my recommendation) ideally the developer should keep this in mind when creating the index initially.

Factors like cardinality and even readability should take precedence over perfect alignment, but all else being equal, aligning your indexes from the very moment they are introduced in the codebase is the ideal scenario IMO.

▲sgarland 267 days ago

> Disk is cheap, memory is plentiful, your time is expensive etc.

Spend 30 minutes one day playing around with Postgres, trying different column combinations out. Boom, you now know how best to order columns. This doesn’t seem like a big ask.

The flip side is that changing data at scale is HARD, so if you put things like this off, when you do finally need to squeeze bytes, it’s painful.

Also, memory is absolutely not plentiful. That’s generally the biggest bottleneck (or rather, the lack of it then makes IO the bottleneck) for an RDBMS, assuming you have connection pooling and aren’t saturating the CPU with overhead.

▲koolba 267 days ago

> Disk is cheap, memory is plentiful, your time is expensive etc.

Taking the time to know the in memory sizing for your data types is well worth it. Taking the time to think about the types to use and sorting them by size is also minimal and well worth it.

It may make sense for the system to do it automatically for newly created tables. But maybe not as it’s possible you’d want the data layout to match some existing structure.

▲epgui 267 days ago

> Disk is cheap, memory is plentiful, your time is expensive etc.

Index size is not solely a storage concern. I also don't really care about how much disk space I pay for, but sometimes I care a lot about how long it takes to vacuum a table.

▲SoftTalker 267 days ago

This is the sort of thing that good DBAs used to know about and manage, but nowadays that isn't a fashionable job.

▲sgarland 267 days ago

It’s coming back in the form of DBRE (hi, it’s me), but unfortunately actual core RDBMS knowledge is often lacking. RDBMS administration is legitimately difficult, as they have a million knobs to turn, and a million gotchas that can tank performance if you aren’t careful. I should mention here that this difficulty really only occurs at scale. If your DB has a few million rows, as long as your queries are indexed it’s unlikely that you’ll experience problems. You probably won’t have optimal performance, but it’s likely to be good enough.

Personally, I love it, but also find it frustrating. The amount of times I’ve presented “here’s your problem, here’s how to fix it, sorry it’s gonna suck” only to be told “nah, we’ll just upsize the instance” is far too many.

▲OliverJones 267 days ago

Good material!

It has to be said, the order of columns in correctly designed multicolumn BTREE indexes is governed by the shape of the queries the indexes support.

So don't arbitarily reorder columns in your indexes to handle alignment.

▲SchwKatze 266 days ago

Since in most applications the order doesn't matter, why postegres don't make itself the recording to archive better data alignment instead the padding approach?

▲koolba 267 days ago

The content itself is fine but the unnecessary image above the label “ You have the right to remain aligned” with the usual AI-generated garbage text cheapens the entire article.

▲loloquwowndueo 267 days ago

Are you kidding, that image is hilarious.

(Articles that intersperse annoying memes every 3 paragraphs with obnoxious gif animations are much much worse).

▲tiffanyh 267 days ago

How much of this article is rooted in hard drives with physically mechanical spinning disk latency assumptions, vs ssd/nvme?

▲nzach 266 days ago

As the sibling said: "not at all".

But I see why this could be confusing. In the article he wrote:

> Postgres will happily add padding to the underlying data in order to make sure it is properly aligned at the physical layer. Having the data aligned ensures faster access time when retrieving pages from disk.

And this is correct. The problem is that "physical layer" refers to the physical memory layout and how things are loaded into the CPU. And not how they are stored in the disk(mostly).

I'm not expert in this subject, but as far I understand the main factor for this kind of behavior is the way a CPU reads data from its cache, i.e. 1 line(64bytes) at a time. And this is why we always pad to factors of 64(2, 4, 8, 16, 32).

This is the first time I read about this in the context of PG, but I've already encoutered the same issue in C and Go. So for me this is just a new manifestation of the same underlying problem.

https://mecha-mind.medium.com/demystifying-cpu-caches-with-e...

▲pornel 266 days ago

None at all. The alignment is for cheap and easy access of data once it is in memory. It's probably rooted in PostgreSQL being written in C where aligned access is trivial, and dereference of an unaligned pointer is Undefined Behavior and requires more clunky code instead.

▲delduca 267 days ago

Could the Django ORM deal with this?

▲sgarland 267 days ago

Yes, but it doesn’t that I’m aware of. Certainly able to override the classes yourself, though. Make a lookup table of column type / length and then reorder based on that.

▲jbkkd 267 days ago

It can't right now, but this is a doable extension