But I would disagree with the takeaway to focus on optimizing your indexes and not your tables. The reason is that the order of columns in a multi-column index is highly meaningful and intentional in order to support match on a range of values for the last column in the index. The way databases work you can only utilize a multi-column index on (customer_id int4, date timestamp) if have an equality match on customer_id, like "WHERE customer_id = 1 AND BETWEEN '2024-01-01' and '2025-01-01'". If you reorder these columns in the index to put the larger date column first, then, sure, you save space in the index, but you also make it worthless – it will never be used by the query above. As such, optimizing a multi-column index is only useful when all the columns are queried for equality rather than a range.
In contrast, when you are creating a NEW table you might not think hard about the order of the columns in the table, and especially not about the data-sizes of each column and their alignment. But doing so at the time you create the table can be tremendously beneficial if it is going to be very large. It is important to note that you not only save space on-disk, but in precious RAM when the tuples are loaded.
It's not as fast as a multi-column index, but the savings of not having to worry about all the combinations of columns that can be queried together could well be worth it.
- https://www.postgresql.org/docs/release/8.1.0/
- https://www.postgresql.org/docs/current/indexes-bitmap-scans...
On small/medium tables and lowish throughout though, yeah it’s often good enough and avoids having many indexes for specific use cases (which is a cost in itself, in memory/cpu/storage)
1) https://en.wikipedia.org/wiki/FoxPro
It is difficult to find a complete explanation for Rushmore nowadays, from what I remember, it would create a bitmap where each bit represented the nth record of the table you wanted to search, then with a single, fast sequential scan it would set the nth bit to 1 if the record satisfied all clauses of your search, 0 otherwise.
Try to see if this makes any sense to you: http://www.foxpert.com/docs/howfoxproworks.en.htm
https://www.postgresql.org/message-id/CAH2-Wzmn1YsLzOGgjAQZd...
It's great this is improving, but this is a fairly narrow improvement. Personally, the multi-column indexes I use would not be improved by this change since column 'a' does not store a "reasonably small" distribution of values.
While the query will take much longer (for me, it was about 47 msec vs 0.6 msec for 2,000,000 rows), it still uses the index.
Similarly, while normally you wouldn’t expect a query using predicate Y to use an index defined on columns (X, Y, Z) if predicate X is also not referenced, Postgres may choose to do so, depending on table stats, and the relative difference between random_page_cost and seq_page_cost. I’ve seen it happen before.
> While the query will take much longer (for me, it was about 47 msec vs 0.6 msec for 2,000,000 rows), it still uses the index.
I'd argue that something being 78x slower can make it pretty useless, but it is indeed at least used, in some cases.
Having the data aligned ensures faster access time when retrieving pages from disk.
Byte-level alignment cannot possibly have anything to do with retrieving pages from disk, simply because the unit of retrieval is the whole page. From the hardware/OS perspective, a page is just an opaque blob of bytes (comprised from one or more blocks on the physical drive).Only after these bytes have reached RAM does the byte-level alignment play a role, because CPU works slower on misaligned data.
The article itself then goes on to illustrates the above (and seemingly contradict itself):
SQLite does not pad or align columns within a row. Everything is tightly packed together using minimal space. Two consequences of this design:
SQLite has to work harder (use more CPU cycles) to access data within a row once it has that row in memory.
SQLite uses fewer bytes on disk, less memory, and spends less time moving content around because there are fewer bytes to move.
Here is the full(er) quote:
Postgres will happily add padding to the underlying data in order to make sure it is properly aligned at the physical layer. Having the data aligned ensures faster access time when retrieving pages from disk.
This might be misunderstood as "Postgres adds padding to speed-up disk I/O", which is the opposite of what actually happens. Padding slows-down I/O but speeds-up the CPU processing afterwards.SQLite made the opposite tradeoff.
Postgres will be performing padding for alignment to improve processing speed once data is in local memory – CPUs are usually much faster at reading & writing aligned data⁰. This is trading off memory use and IO efficiency for CPU gain, which is the right optimisation if you assume that your core working set fits nicely into RAM and that your CPU(s) have large enough cache that you don't create the same problem there¹. Other DBs don't do this padding at all either because they didn't think of it or, more likely in the case of the big ones, because they are optimising more for IO than being concerned about CPU bottlenecks, or perhaps they natively rearrange the fields where it makes a difference instead of being beholden to the column ordering given by the user².
----
[0] in fact some architectures don't directly support unaligned access at all, though probably not any architectures Postgres supports
[1] causing extra cache evictions if cache segment width aligns badly with the padding such that less data fits in the available cache
[2] if the user needs to care about physical ordering like this, you have a leaky abstraction
It could then happily report back the data in the order that the DB Admin/developer expects.
Ordering columns for performance might be a different order of reccomendation.
The author explains this very well, it’s a good read! I’ve learned about this padding little over a year ago, while I was designing a data intensive application with a colleague. I was skeptical about the advantage at first, but for our specific design, where we have 100 to 480+ columns in one table it makes a huge difference on table store size. Not so much on the indexes, though.
I’m sorry, what? Why?
So just to give an example of how field count "blows up", there can be up to 8 different parties related to a single declaration, and for each we want to have for display/print purposes our internal id, id in customers system, organization number, name, address, postal code, location, country code, contact person, contact person phone and email.
So we're talking about almost 100 fields just to store parties.
We want to store all that separate from say our "companies" table as a company might change name, location etc, and we're required to show/print the details from today 5+ years down the line.
We could put that in a separate table, but that means 8 joins to fetch data for a single record just there. And that's just one piece of the declarations, and at least 3-4 of these parties our customers want to see in overview grids and be able to filter on there.
Now tack on 20+ years of doing this, with various official systems coming and going with different technical requirements, and it adds up.
Though we'll see what the future holds. PostgreSQL is interesting due to cost and features, and many of the large customers are shifting their POV on how they want to run things, almost doing 180's in some cases.
This system contains measurements and state of physical devices (time series). It’s designed for both heavy write and read, with slight emphasis on write. Each table is one type of device and contains 1 to 5 different measurements/states. But here’s the trick: because data is queried with minimum bucket size of 15minutes I figured we could just create a column for each measurement + quarter of the day (i.e. measure0000, measure0015), so that’s 100 columns for each measurement (96 quarter + 4 for DST), include the date in the key, et voila: excellent write performance (because it’s mainly UPDATE queries) and good read performance.
Okay, the queries to make sense of the data aren’t pretty, but can be generated.
I find it really cool how effective this is for time-series data without Postgres extensions (we’re on RDS).
I can see the advantage in terms of just needing a single tuple for a reads. So a timestamp + value model would likely take twice as much heap space than your approach?
Given that you’re probably always just inserting new data you could use a brin index to get fast reads on the date ranges. Would be interesting to see it in action and play around to see the tradeoffs. The model you’ve settled on sounds like it would be a pain to query.
At first glance our solution follows a similar approach, let me elaborate:
- no index columns are updated ever, only inserted
- all tables are partitioned based on date (partition range is 1 month)
- for some tables there is another layer of partitioning (3 sub-partitions, based on one specific column)
- finding an appropriate fillfactor is important to improve the speed of UPDATE statements (HOT-updates)
- standard vacuum / auto vacuum settings work great for us so far.
- to improve ANALYZE performance, set column statistics of value-only columns to 0.
They might be if the columns being updated weren’t indexed [0], but since it sounds like at least one is, no, not in-place.
Though interestingly, your comment on BRIN indexes is quite relevant, as that’s the one type of index that HOT can still work with.
[0]: https://www.postgresql.org/docs/current/storage-hot.html
This is super fast when taking advantage of postgres' partition-wise joins.
I’ve never ran a PG-based TSDB so I’m reluctant to pass judgment one way or the other, but as I hope you can understand, “we have hundreds of columns in a table” is alarming at first glance.
Of course there are a multitude of variables we don't have access from the outside, but Postgres only compresses data that is TOASTed, and based on your description of the table, the data is not being TOASTed (and therefore not being compressed).
Instead, if you could somehow pack your timeseries entries into an array, you would get the benefits of compression automatically.
Given your write performance requirements, using an array may be out-of-question (and you may get too much overhead from dead tuples) -- but who knows? Always a good idea to benchmark.
I actually considered mentioning this at the post but figured it was too long already and could be the material for a future one :)
OLAP is of course its own problem, and most of the best practices for OLTP do not apply.
If that is still true, then alignment issues could push you over the edge if you are close to it.
Personally, I love it, but also find it frustrating. The amount of times I’ve presented “here’s your problem, here’s how to fix it, sorry it’s gonna suck” only to be told “nah, we’ll just upsize the instance” is far too many.
Disk is cheap, memory is plentiful, your time is expensive etc.
> Disk is cheap, memory is plentiful, but your time is expensive, etc.
One thing to keep in mind, though, is that while you often have plenty of disk space, RAM is still relatively expensive. It's also divided into many smaller buffers, such as working memory and shared buffers, which are not that large. These optimizations help to fit more data into cache.
However, what the article said about alignment being important for indexes is somewhat misleading. Reordering an index field is not the same as reordering columns in a table. Beside having to rewrite queries, it also changes the access pattern and the time required to access the data, which is often much more significant than the space saved. Indexes are, by nature, a tradeoff where you give up space to gain time, so this mindset doesn't really apply there.
> Indexes are, by nature, a tradeoff where you give up space to gain time, so this mindset doesn't really apply there.
I agree that (re)aligning indexes are a different beast entirely, but (as mentioned in my recommendation) ideally the developer should keep this in mind when creating the index initially.
Factors like cardinality and even readability should take precedence over perfect alignment, but all else being equal, aligning your indexes from the very moment they are introduced in the codebase is the ideal scenario IMO.
Spend 30 minutes one day playing around with Postgres, trying different column combinations out. Boom, you now know how best to order columns. This doesn’t seem like a big ask.
The flip side is that changing data at scale is HARD, so if you put things like this off, when you do finally need to squeeze bytes, it’s painful.
Also, memory is absolutely not plentiful. That’s generally the biggest bottleneck (or rather, the lack of it then makes IO the bottleneck) for an RDBMS, assuming you have connection pooling and aren’t saturating the CPU with overhead.
Taking the time to know the in memory sizing for your data types is well worth it. Taking the time to think about the types to use and sorting them by size is also minimal and well worth it.
It may make sense for the system to do it automatically for newly created tables. But maybe not as it’s possible you’d want the data layout to match some existing structure.
Index size is not solely a storage concern. I also don't really care about how much disk space I pay for, but sometimes I care a lot about how long it takes to vacuum a table.
It has to be said, the order of columns in correctly designed multicolumn BTREE indexes is governed by the shape of the queries the indexes support.
So don't arbitarily reorder columns in your indexes to handle alignment.
(Articles that intersperse annoying memes every 3 paragraphs with obnoxious gif animations are much much worse).
But I see why this could be confusing. In the article he wrote:
> Postgres will happily add padding to the underlying data in order to make sure it is properly aligned at the physical layer. Having the data aligned ensures faster access time when retrieving pages from disk.
And this is correct. The problem is that "physical layer" refers to the physical memory layout and how things are loaded into the CPU. And not how they are stored in the disk(mostly).
I'm not expert in this subject, but as far I understand the main factor for this kind of behavior is the way a CPU reads data from its cache, i.e. 1 line(64bytes) at a time. And this is why we always pad to factors of 64(2, 4, 8, 16, 32).
This is the first time I read about this in the context of PG, but I've already encoutered the same issue in C and Go. So for me this is just a new manifestation of the same underlying problem.
https://mecha-mind.medium.com/demystifying-cpu-caches-with-e...