Do other ISAs have anything similar?
ARM cores also have a similar set of promises.
Perhaps this indicates a missing concept in our compilation and execution of code, metadata that demands a section is not optimized.
It doesn't carry any security guarantees (Rust's documentation on their wrapper for it explicitly strikes out "cryptographic or security" purposes) but one could imagine a similar intrinsic with stronger abilities.
I mean, I can walk you through how my c# gets compiled down to IL but start going lower and definitely by the time you’re in assembly and machine code and it’s as inscrutable to me as is how AI is spitting out answers.
The days when someone could explain how every piece of these machines worked are long gone
But other than that yes, with techniques like load value prediction and memory-data dependent prefetching there is little you can do at the codegen level.
I can recognize some shared concepts and a general vibe maybe but I am materially incapable of reading it without hundreds of hours of study
What I’m pointing out though is that I don’t actively know it currently and am still able to produce software and earn a significant income despite that. I don’t know that everyone not understanding how AI works or even how the software it produces is supposed to work, adds a new type of risk to society
> 1. Compilers are applying optimization techniques that are heuristically good for performance on general software, but happen to leak information through timing-based sidechannels when used on secret data. Since general-purpose CPUs and compilers are applied to a wide variety of tasks, compilers have no reason to stop doing such optimizations.
> 2. Constant-time coding techniques mostly aim at fooling the compiler, to prevent it from applying these optimizations. Since compilers keep getting smarter, theses techniques lose their efficiency.
> 3. Just-in-time (JIT) compilers have access to runtime data; they can, and will, use such information to perform extra optimizations that static compilers cannot do, and further destroy the developer’s attempt at achieving constant-time code.
> 4. JIT compilation is becoming pervasive and is now employed in various contexts, in particular inside CPUs. Such compilers can do the same optimizations as “external” JIT compilers, but are not publicly documented, and the output of their “machine” code translation cannot be inspected to detect deviations from constant-time execution.
> 5. Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen.
I imagine it’s not easy* easy, because it requires every layer to cooperate, but if there’s demand I think it would be done.
IPv6 is in a similar situation where it requires widespread adoption, but I also suspect most people have issues with it and there isn’t much demand because NATs and whatever support for cellular networks has made it unnecessary.
Compilers like Clang actually generate terrible code; it's expected that a sufficiently smart optimizer (of which LLVM is a member) will clean it up anyway, so Clang makes no attempt to generate good code. Rust is similar. For example, a simple for-loop's induction variable is stored/loaded to an alloca (ie stack) on every use, it isn't an SSA variable. So one of the first things in the optimization pipeline is to promote those to SSA registers/variables. Disabling that would cost a ton of perf just right there, nevermind the impact on instruction combining/value tracking/scalar evolution, and crypto is pretty perf sensitive after security.
BTW, Clang/LLVM already has such a function-level attribute, `optnone`, which was actually added to support LTO. But it's all or nothing; LLVM IR/Clang doesn't have the info needed to know what instructions are timing sensitive.
Note that memset_s() is an existing special case already. https://en.cppreference.com/w/c/string/byte/memset
> Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen
It's not mentioned in the paper, but hardware description languages provide strong guarantees on timing-related characteristics on the semiconductor level. Hardware companies are willing to provide user access to FPGAs, as evidenced by the userspace FPGA subsystem in the Linux kernel as well.[1] You can buy Versal SOCs right now that combine ARM CPUs with FPGAs on one die, run Linux on it, and outsource your crypto tasks to the FPGA.
I see a future where we give up on constant-time coding for CPUs and move crypto libraries to FPGAs, given the hundreds of billions of dollars in our economy reliant on strong encryption. That'd be a world in which every corporate laptop is required to have an FPGA for compliance purposes.
It'd be interesting to know if anyone has tried this.
[1]https://www.phoronix.com/news/FPGA-User-Space-Interface-AMD
Many high-frequency trading companies use FPGAs over ASICs for similar reasons. FPGAs are more expensive but allow them to have full control over the algorithms implemented and doesn't require giving secret information to a foundry.
In other words, eliminate the impedance mismatch between responsibility and control for developing a secure system.
It'll be cheaper to implement cryptography on an ASIC. But the author of this paper wants to control every single aspect of the encryption/decryption process. Said developer can confidently say the system is secure. You can't say you've delivered a secure system if you're getting your ASICs from another company that doesn't provide implementation details because it'd (justifiably) give others a competitive advantage.
Question I'd have is whether the cost difference between ASICs/FPGAs is worth it for the majority of applications. $1 or $10 on every CPU might mean a world in which every laptop has an FPGA, but $100? What about for server-side applications? Would a hyperscaler spend $1000 extra on every rack if it allowed guaranteed constant-time encryption?
The easiest migration from FPGA to ASIC is to just make a chip with the same circuit elements and wire segments, but instead of making switches, you just short out connections in the "on" state and leave the rest open.
Meanwhile Windows is requiring that every laptop have a small crypto processor for its own compliance processes (i.e. bitlocker).
In terms of moving the data over to the FPGA, I have no idea. But if it's all on the same die, it doesn't seem like there's a big physical attack surface that's different than just doing stuff on the CPU.
Abstractions cannot send messages before they receive them. So any delay you add at the top must be magnified as you go down. The exception to this is if the contents of the message require different behaviors for different payloads, in which case they are correct. But encrypted payloads are opaque to the layers they are sent through. You can observe who the message was sent to, and know the maximum amount of data the conversation might have contained. But not a very clear idea of how long it took to build the message, unless you’ve already compromised the machine.
So practice your martial arts on the bow of a rowboat. Balance, Danielsan.
The whole timing thing is essentially an issue of getting off-balance as in martial arts and the solution is that there are actions you simply never do, and ones that look like they lack economy of motion because so much of the work is in keeping your center of gravity from shifting outside of your base. Tai chi is infinitely better at protecting you from concussing yourself on black ice than it is at protecting you from a mugger.
So if you have a conditional move for one calculation, then move a to b if !c and move it to d if c. And when the compilers chase that down in another generation, or someone notices that the L1 cache timing is different if d gets cold, then use both calculations in the output. Calculate complements where the runtime of a + !a = 1 to several decimal points.
Will it work for eternity? Unlikely. Will it work for a decade? Probably.
ChaCha20 or Kyber might be using that. Although, I don’t know if any cryptographic algorithm depend on vector only calculation as a base for their strength.
Alternatively, I suppose that specialised cryptographic CPUs could be made to follow strict standards ensuing constant time runtime.
Is that actually guaranteed? Having to do the whole vector doesn't necessarily mean the time can't vary for different values for complex operations. (like division)
I can't find the details, but https://gleissen.github.io/papers/iodine.pdf at least mentions "Dually, if timing variability is unavoidable, e.g., in SIMD or floating-point units, making this variability explicit can better inform..." so it sounds like simd is at least situation specific here.
So I would think if you treat all of the bytes as one SIMD instruction, do a SIMD compare, and then fire another operation as a result… if you could guarantee that the third operation would have to fire for at least one of the vector values every time, then I think you’d have your solution.
> SHA-1 and AES also vary by the round
Vary what by the round? SHA-1 ( https://www.nayuki.io/res/cryptographic-primitives-in-plain-... ) is structured very similarly to MD5, with one big difference being that SHA-1 doesn't vary the rotation by the round number. AES has a public sequence of round constants that gets applied to the key expansion, but otherwise doesn't do anything special to the ciphertext per round.
The logic that can cause timing issues include using a memory lookup table for the S-box and finite field multiplication in AES, as well data-dependent rotations in rare ciphers like https://en.wikipedia.org/wiki/RC5 and RC6.
The mention of CMOV is good. Additional tricks are available through vectorization - necessarily all elements of the vector have to execute at the same speed, which gives you extra options to put crafted data in a different column to force computation to proceed at a known rate.
We should not forget that some CPUs offer dedicated instructions - e.g. Intel AES-NI.
Another solution is to just stop trying to do it on the general purpose processor altogether and move it to some sort of HSM or TPM. Put your eggs in a smaller, more defensible basket. In a world of GPUs and AI accelerators it's easier to ask for a crypto accelerator (or rather decelerator!).
Is there any good work on constant-time crypto on the GPU?
If anyone manages to make homomorphic encryption viable, that provides another solution, at a huge performance penalty.
AES-NI, the SHA instructions, and constant-time subsets of instructions are generally good enough that you can do this in assembly.
1. Clock how long the operation takes on any type of data. By operation, it would be a function call maybe for a whole block or buffer.
2. Add a small margin of time to that.
3. Modify the function call to make sure it has waited that long before returning a response. That is true whether it finished early or not.
The result prevents the function call from leaking timing information.
High-assurance, security certification in the 1980's also required mitigating covert, storage channels. That would require, at a minimum, overwriting any memory used during the call and all the CPU registers before returning to the caller.
https://security.stackexchange.com/a/96493/271113
For example, consider the case of a cache-timing leak (rather than the classical "string compare" one). You'd have to have a lot of control over the behavior of your operating environment to guarantee it doesn't leak bits of your secret from cache misses.
When you add a delay to your processing, unless the same tasks being executed, I would expect power analysis leakage and random jitter from the kernel's scheduler to reveal that fact. It might be a more costly analysis to detect, but it's still there.
Generally, I think this is a doomed line of reasoning.
The previous comment was suggesting making sure that every code path takes the same amount of time, not adding a random delay (which doesn't work).
And while I agree that power-analysis attacks etc. are still going to apply, the over-arching context here is just timing-analysis
I'm not just echoing the argument made by the link, though. I'm adding to it.
I don't think the "cap runtime to a minimum value" strategy will actually help, due to how much jitter your cap measurements will experience from the normal operation of the machine.
If you filter it out when measuring, you'll end up capping too low, so some values will be above it. For a visualization, let's pretend that you capped the runtime at 80% what it actually takes in the real world:
function biased() {
return Math.max(0.8, Math.random())
}
let samples = [];
for (let i = 0; i < 1000; i++) {
samples.push(biased());
}
// Now plot `samples`
Alternatively, let's say you cap it sufficiently high that there's always some slack time at the end.Will the kernel switch away to another process on the same machine?
If so, will the time between "the kernel has switched to another process since we're really idling" to "the kernel has swapped back to our process" be measurable?
It's better to make sure your actual algorithm is actually constant-time, even if that means fighting with compilers and hardware vendors' decisions.
Not in the presence of DVFS, it turns out: https://www.hertzbleed.com/hertzbleed.pdf
Recently, people have come up with partitioned caches to deal with this. I don't know if they exist in production. A simple strategy is turning off shared caches while running processes of different, security levels on their own cores. Also, investing in multi-core and many-core architectures for this.
Finally, many of us pushed for randomized execution or scheduling to throw off the timing of specific things. Combined with fine-grained processes (eg separation kernels), that should reduce what they can do a lot.
AES cache-timing was broken over a network (but required, like, 2^28 samples).
I wouldn't bet the farm on this line of thinking providing resilience. It might be just annoying enough for an attacker to not really bother. (Or maybe only if the target is, like, a software cryptocurrency wallet with enough value to loot if they're successful.)
most of the points brought up in the article have been happening already for decade(s), it's not something that 'will soon' happen.
I wonder if they’re fast enough…
That doesn't solve the hardware-JIT-level problem, but it would appear to solve the compiler-level problem.
ETA: I only read up to p. 3, but I did search for the text "volatile" and didn't find it.
This would be something like a 'secret' keyword that would qualify integer types; i.e. in C you could have a variable of type 'secret unsigned int' and the compiler would reject optimizations that would reveal timing information about the variable's contents, while optimizing the rest of the program's non-secret variables. Values can be promoted to secret but not demoted. Code that intends to reveal secrets (e.g. to write them out to secure storage) must explicitly cast away the secretness.
AFAIK Golang has crypto/subtle, but that's more special functions to do certain things in a data-independent way and not pervasive language support for it (and, presumably, they have to keep updating the compiler to not optimize that specific module).
My guess is that bitslicing only gets you so far.