I'm not a CPU designer but shouldn't this be points that one could discover using higher-level simulators? Ie before even needing to do FPGA or gate-level sims?
If so, are they doing a SpaceX thing where they iterate fast with known less-than-optimal solutions just to gain experience building the things?
But there's a tradeoff. It looks like they've chosen small area/low power over absolute speed. Which may be entirely valid for whatever use case they're aiming at.
Note from the git history is that this is basically a 2021 design. https://github.com/XUANTIE-RV/openc910/commits/main/
https://linuxgizmos.com/dev-kit-debuts-risc-v-xuantie-c910-s...
Android was shown running on an earlier C910 board (ICE EVB) for the same THead ICE test chip already by January 2021:
https://www.hackster.io/news/alibaba-s-t-head-releases-open-...
Vperl, apparently
Also although technically it's open source if it's generated verilog then isn't that a lot less useful than the code that was used to generate the rtl?
I remember this post,by an ARM engineer, who was highly critical of the RISC-V ISA:
The details would be in the microarchitecture, which would not be specified by RISC-V.
> This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.
> It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.
Just to take one example. Yes, on ARM and x86 you can often do array indexing in one instruction. And then it is broken down into several µops that don't run any faster than a sequence of simpler instructions (or if it's not broken down then it's the critical path and forces a lower clock speed just as, for example, the single-cycle multiply on Cortex-M0 does).
Plus, an isolated indexing into an array is rare and never speed critical. The important ones are in loops where the compiler uses "strength reduction" and "code motion out of loops" so that you're not doing "base + array_offset + indexelt_size" every time, but just "p++". And if the loop is important and tight then it is unrolled, so you get ".. = p[0]; .. = p[1]; .. = p[2]; .. = p[3]; p += 4" which RISC-V handles perfectly well.
"But code size!" you say. That one is extremely easily answered, and not with opinion and hand-waving. Download amd64, arm64, and riscv64 versions of your favourite Linux distro .. Ubuntu 24.04, say, but it doesn't matter which one. Run "size" on your choice of programs. The RISC-V will always be significantly smaller than the other two -- despite supposedly being missing important instructions.
A lot of the criticisms were of a reflexive "bigger is better" nature, but without any examination of HOW MUCH better, or the cost in something else you can't do instead because of that. For example both conditional branch range and JAL/JALR range are criticised as being limited by including one or more 5 bit register specifiers in the instruction through having "compare and branch" in a single instruction (instead of condition codes) and JAL/JALR explicitly specifying where to store the return address instead of having it always be the same register.
RISC-V conditional branches have a range of ±4 KB while arm64 conditional branches have a range of ±1 MB. Is it better to have 1 MB? In the abstract, sure. But how often do you actually use it? 4 KB is already a very large function -- let alone loop -- in modern code. If you really need it then you can always do the opposite condition branch over an unconditional ±1 MB jump. If your loop is so very large then the overhead of one more instruction is going to be far down in the noise .. 0.1% maybe. I look at a LOT of compiled code and I can't recall the last time I saw such a thing in practice.
What you DO see a lot of is very tight loops, where on a low end processor doing compare-and-branch in a single instruction makes the loop 10% or 20% faster.
This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.
For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.
But that is all missing the point of "true but irrelevant".
You can't just compare the speed of an isolated scaled indexed load/store. No one runs software that consists only, or even mostly, of isolated scaled indexed load/store.
You need to show that there is a measurable and significant effect on overall execution speed of the whole program to justify the extra hardware of jamming all of that into one instruction.
A good start would be to modify the compiler for your x86 or Arm to not use those instructions and see if you can detect the difference on SPEC or your favourite real-world workload -- the same experiment that Cocke conducted on IBM 370 and Patterson conducted on VAX.
But even that won't catch the possibility that a RISC-V CPU might need slightly more clock cycles but the processor is enough simpler that it can clock slightly higher. Or enough smaller that you can use less energy or put more cores in the same area of silicon.
And as I said, in the cases where the speed actually matters it's probably in a loop and strength-reduced anyway.
It's so lazy and easy to say that for every single operation faster is better, but many operations are not common enough to matter.
Though I'd imagine that just having the extra cycle conditionally for indexed load/store instrs would still be better than having a whole extra instruction take up decode/ROB/ALU resources (and the respective power cost), or the mess that comes with instruction fusion.
And with RISC-V already requiring a 12-bit adder for loads/stores, thus and an increment/decrement for the top 52 bits, the extra latency of going to a full 64-bit adder is presumably quite a bit less than a full separate 64-bit adder. (and if the mandatory 64+12-bit adder already pushed the latency up by a cycle, a separate shNadd will result in two cycles of latency over the hypothetical adderless case, despite 1 cycle clearly being feasible!)
Even if the RISC-V way might be fine for tight loops, most code isn't such. And ideally most tight loops doing consecutive loads would vectorize anyway.
We're in a world where the latest Intel cores can do small immediate adds at rename, usually materializing them in consuming instructions, which I'd imagine is quite a bit of overhead for not that much benefit.
I'll also note that only x86 can do base + scaled index + constant offset in one instruction. Arm needs two instructions, just like RISC-V.
Just ran a quick benchmark - seems Haswell handles "mov rbx, QWORD PTR [rbx+imm]" with 4c latency if there's no chain instructions (5c latency in all other cases, including indexed load without chain instrs, and "mov rbx, QWORD PTR [rbx+rcx*8+0x12345678]" always). So even with existing cases where the indexed load pushes it over to the next cycle, there are cases where the indexed load is free too.
And indexed loads aren't a "here or there", they're a pretty damn common thing; like, a ton more common than most instructions in Zbb/Zbc/Zbs.
But I have a hard time imagining that my general point of "if there's headroom for a full 64-bit adder in the AGU, adding such is very cheap and can provide a couple percent boost in applicable programs" is far from true. Though the register file port requirement might make that less trivial as I'd like it to be.
But yeah, modern x86-64 doesn't have any difference between indexed and plain loads[0], nor Apple M1[1] (nor even cortex-a53, via some local running of dougallj's tests; though there's an extra cycle of latency if the scale doesn't match the load width, but that doesn't apply to typical usage).
Of course one has to wonder whether that's ended up costing something to the plain loads; it kinda saddens me seeing unrolled loops on x86 resulting in a spam of [r1+r2*8+const] addresses and the CPU having to evaluate that arithmetic for each, when typically the index could be moved out of the loop (though at the cost of needing to pointer-bump multiple pointers if there are multiple), but x86 does handle it so I suppose there's not much downside. Of course, not applicable to loads outside of tight loops.
I'd imagine at some point (if not already past 8-wide) the idea of "just go wider and spam instruction fusion patterns" will have to yield to adding more complex instructions to keep silicon costs sane.
[0]: https://uops.info/table.html?search=%22mov%20(r64%2C%20m64)%...
[1]: https://dougallj.github.io/applecpu/measurements/firestorm/L... vs https://dougallj.github.io/applecpu/measurements/firestorm/L...
How would you demonstrate that besides actually building such a chip?
What is the disadvantage that a country has if they only have access to computer technology from the 2010s? They will still make the same airplanes, drones, radars tanks and whatever.
It seems to me that it is nice to have SOTA manufacturing capability for semi-conductors but not necessary.
If played well, it could even let them win the AI race even if they and everyone else have to struggle for a decade.
At the same cost and speed? Volume matters.
> is crucial to the national security
National security isn't just about military power. Without the latest chips, e.g. if there were sanctions, it could impact the economy. The nation can be become insecure e.g. by means of more and more people suffering from poverty.
Eventually there'll be fully autonomous drones and how competitive they'll be will be directly proportional to how fast they can react relative to enemy drones. All other things being equal, the country with faster microchips will have better combat drones.
That's very unlikely imo. When it comes to drones.. no matter how fast your computation is, there are other bottlenecks like how fast the motors can spin up, how fast the sensor readings are, how much battery efficient they are etc...
Right now the 8 bit ESCs are still as competitive as 32 bit ESC, a lot of the "follow me" tasks were using lot less computational power than what your typical smartphone these days offers...
A drone moves faster, but I don't think that changes the calculations.
There doesn't seem to be a great interest in having (small) drones shoot things yet, all the current uses seem to be:
- drone itself is the munition
- drone is the spotter for other ground based artillery
- drone dropping unguided munitions (e.g. grenades)
"Large" drones (aircraft rather than quadcopter) seem to follow the same rules as manned aircraft and engage with guided or unguided munitions of their own. If the drone is cheap enough then "drone as munition" seems likely to win.
China did not want to integrate. China has been seeking strategic independence in its economy by developing alternative layers of global economic ties, including the Belt and Road Initiative, PRC-centered supply chains, and emerging country groupings for longer than US.
Too much technological ties with China are seen as a potential vulnerability. It's not just technology itself, but it's importance in trade and economy. If the US or its allies have value chains tightly integrated with China on strategic components, it creates dependence.
If anything, China's rise is a stabilizing factor for the whole world. It balances the aggression originating from the United States.
Try running a LLM on hardware from 2010.
Training a frontier model probably not. Again not clear what is the strategic benefit of having access to a frontier LLM model vs a more compact & less capable one.
That said, isn't the C910 with a critical buggy vector block?
It is amazing acheivement in a saturated market. The road for reaching a fully mature and performant large RISC-V implementation is still long (to catch up with the other ones)...
... but a royalty free ISA is priceless, seriously.
- If points are (significantly) higher than comments: the submission is very niche or highly technical that a lot of people can appreciate but only a few can meaningful comment. See right now here 120 vs 14
- If comments are higher than points or levitating towards 1:1 ratio: casual topic or flamewar (politics). See the DOGE post on the front page now, 1340 vs 2131
That being said I think "healthy" posts have a 1.5:1 - 2:1 ratio
We also just saw Arm sue one of their customers following an acquisition of another of Arm's customers, and try to make them destroy IP that was covered by both customers' licenses. Nobody wants to deal with licensing, and when the licensor is that aggressive it makes open alternatives all the more compelling, even if they're not technically on-par.
And Softbank is all in on Team USA.
getting funded by softbank is probably a good proxy indicator for companies losing their competitive edge.
Arm is still in control. Some went to form another company however.
Stop being so squareand embrace the future maaaaann.