BioHacker News | Alibaba/T-HEAD's Xuantie C910: An open source RISC-V core

▲Alibaba/T-HEAD's Xuantie C910: An open source RISC-V core(chipsandcheese.com)

221 points by mfiguiere 160 days ago | 14 comments

Yet I also feel the things C910 does well are overshadowed by executing poorly on the basics. The core’s out-of-order engine is poorly balanced, with inadequate capacity in critical structures like the schedulers and register files in relation to its ROB capacity. CPU performance is often limited by memory access performance, and C910’s cache subsystem is exceptionally weak. The cluster’s shared L2 is both slow and small, and the C910 cores have no mid-level cache to insulate L1 misses from that L2. DRAM bandwidth is also lackluster.

I'm not a CPU designer but shouldn't this be points that one could discover using higher-level simulators? Ie before even needing to do FPGA or gate-level sims?

If so, are they doing a SpaceX thing where they iterate fast with known less-than-optimal solutions just to gain experience building the things?

▲pjc50 160 days ago

Quite likely, yes. It should be possible to make estimates of how much your cache misses are going to impact speed.

But there's a tradeoff. It looks like they've chosen small area/low power over absolute speed. Which may be entirely valid for whatever use case they're aiming at.

Note from the git history is that this is basically a 2021 design. https://github.com/XUANTIE-RV/openc910/commits/main/

▲brucehoult 160 days ago

No, that's when they open sourced it. It was designed in 2018/early 2019 and picked up the May 2019 RVV spec. By late 2021 I already had a commercially sold C910 dev board (RVB ICE).

https://linuxgizmos.com/dev-kit-debuts-risc-v-xuantie-c910-s...

Android was shown running on an earlier C910 board (ICE EVB) for the same THead ICE test chip already by January 2021:

https://www.hackster.io/news/alibaba-s-t-head-releases-open-...

▲gnfargbl 160 days ago

There are some comparative benchmarks on a slightly older post: https://chipsandcheese.com/p/a-risc-v-progress-check-benchma...

▲JoachimS 160 days ago

Another amazing analysis by Chester Lam. I'm astounded of the cadence and persistence, and at the same time the depth and comprehensiveness.

▲nxobject 160 days ago

I swear there's one brillaint chip journalist/analyst at every moment that has the Mandate of Heaven to do brilliant things. Anand Lal Shimpi was that person once, then Ian Cutress...

▲IanCutress 160 days ago

We need to get Chester on the podcast more :)

▲clamchowder 160 days ago

Oh that should be fun. Would have to fit it around work though

▲fcanesin 160 days ago

Apache-2 open source: https://github.com/XUANTIE-RV/openc910

▲klelatti 160 days ago

Open source but my impression is that this contains verilog generated by another program?

▲RobotToaster 160 days ago

https://github.com/XUANTIE-RV/openc910/issues/9

Vperl, apparently

▲klelatti 160 days ago

Thank you, that's very helpful.

▲pjc50 160 days ago

Could you point at some examples? It's not unusual for things to be code-generated and it doesn't necessarily impact the license.

▲klelatti 160 days ago

All the RTL basically. It’s in a directory called gen_rtl (generated RTL?) and has remarkably few comments for such a complex code base.

Also although technically it's open source if it's generated verilog then isn't that a lot less useful than the code that was used to generate the rtl?

▲unit149 160 days ago

[dead]

▲torginus 160 days ago

Has it been demonstrated that RISC-V is architecturally suitable for making chips that equal the performance of high-end x86 and ARM designs?

I remember this post,by an ARM engineer, who was highly critical of the RISC-V ISA:

https://news.ycombinator.com/item?id=24958423

▲gpderetta 160 days ago

As long as it looks vaguely like any other register based ISAs, generally there is very little in an architecture that would prevent making an high performance implementation. Some details might make it more difficult, but Intel has shown very effectively that with enough thrust even pigs can fly.

The details would be in the microarchitecture, which would not be specified by RISC-V.

▲Havoc 160 days ago

If I was an arm engineer I’d be critical of something threatening my job too

▲klelatti 160 days ago

For clarity from the original analysis

> This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.

> It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.

▲brucehoult 160 days ago

The criticisms there are at the same time 1) true and 2) irrelevant.

Just to take one example. Yes, on ARM and x86 you can often do array indexing in one instruction. And then it is broken down into several µops that don't run any faster than a sequence of simpler instructions (or if it's not broken down then it's the critical path and forces a lower clock speed just as, for example, the single-cycle multiply on Cortex-M0 does).

Plus, an isolated indexing into an array is rare and never speed critical. The important ones are in loops where the compiler uses "strength reduction" and "code motion out of loops" so that you're not doing "base + array_offset + indexelt_size" every time, but just "p++". And if the loop is important and tight then it is unrolled, so you get ".. = p[0]; .. = p[1]; .. = p[2]; .. = p[3]; p += 4" which RISC-V handles perfectly well.

"But code size!" you say. That one is extremely easily answered, and not with opinion and hand-waving. Download amd64, arm64, and riscv64 versions of your favourite Linux distro .. Ubuntu 24.04, say, but it doesn't matter which one. Run "size" on your choice of programs. The RISC-V will always be significantly smaller than the other two -- despite supposedly being missing important instructions.

A lot of the criticisms were of a reflexive "bigger is better" nature, but without any examination of HOW MUCH better, or the cost in something else you can't do instead because of that. For example both conditional branch range and JAL/JALR range are criticised as being limited by including one or more 5 bit register specifiers in the instruction through having "compare and branch" in a single instruction (instead of condition codes) and JAL/JALR explicitly specifying where to store the return address instead of having it always be the same register.

RISC-V conditional branches have a range of ±4 KB while arm64 conditional branches have a range of ±1 MB. Is it better to have 1 MB? In the abstract, sure. But how often do you actually use it? 4 KB is already a very large function -- let alone loop -- in modern code. If you really need it then you can always do the opposite condition branch over an unconditional ±1 MB jump. If your loop is so very large then the overhead of one more instruction is going to be far down in the noise .. 0.1% maybe. I look at a LOT of compiled code and I can't recall the last time I saw such a thing in practice.

What you DO see a lot of is very tight loops, where on a low end processor doing compare-and-branch in a single instruction makes the loop 10% or 20% faster.

▲clamchowder 160 days ago

"don't run any faster than a sequence of simpler instructions"

This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.

For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.

▲brucehoult 159 days ago

Ok, there exist cores that don't have a penalty for scaled indexed addressing (though many do). Or is it that they don't have any benefit from non-indexed addressing? Do they simply take a clock speed hit?

But that is all missing the point of "true but irrelevant".

You can't just compare the speed of an isolated scaled indexed load/store. No one runs software that consists only, or even mostly, of isolated scaled indexed load/store.

You need to show that there is a measurable and significant effect on overall execution speed of the whole program to justify the extra hardware of jamming all of that into one instruction.

A good start would be to modify the compiler for your x86 or Arm to not use those instructions and see if you can detect the difference on SPEC or your favourite real-world workload -- the same experiment that Cocke conducted on IBM 370 and Patterson conducted on VAX.

But even that won't catch the possibility that a RISC-V CPU might need slightly more clock cycles but the processor is enough simpler that it can clock slightly higher. Or enough smaller that you can use less energy or put more cores in the same area of silicon.

And as I said, in the cases where the speed actually matters it's probably in a loop and strength-reduced anyway.

It's so lazy and easy to say that for every single operation faster is better, but many operations are not common enough to matter.

▲dzaima 158 days ago

So your argument isn't that it's irrelevant, but rather that it might be irrelevant, if you happen to have a core where the extra latency of a 64-bit adder on the load/store AGU pushes it just over to the next cycle.

Though I'd imagine that just having the extra cycle conditionally for indexed load/store instrs would still be better than having a whole extra instruction take up decode/ROB/ALU resources (and the respective power cost), or the mess that comes with instruction fusion.

And with RISC-V already requiring a 12-bit adder for loads/stores, thus and an increment/decrement for the top 52 bits, the extra latency of going to a full 64-bit adder is presumably quite a bit less than a full separate 64-bit adder. (and if the mandatory 64+12-bit adder already pushed the latency up by a cycle, a separate shNadd will result in two cycles of latency over the hypothetical adderless case, despite 1 cycle clearly being feasible!)

Even if the RISC-V way might be fine for tight loops, most code isn't such. And ideally most tight loops doing consecutive loads would vectorize anyway.

We're in a world where the latest Intel cores can do small immediate adds at rename, usually materializing them in consuming instructions, which I'd imagine is quite a bit of overhead for not that much benefit.

▲brucehoult 158 days ago

No, my argument is that even if load with scaled indexed addressing takes a cycle longer, it's a rare enough thing given a good compiler and, yes, in many cases vector/SIMD processing, that you are very unlikely to actually be able to measure a difference on a real-world program.

I'll also note that only x86 can do base + scaled index + constant offset in one instruction. Arm needs two instructions, just like RISC-V.

▲dzaima 158 days ago

My point with vectorization was that the one case where indexed loads/stores are most defendably unnecessary is also the case where you shouldn't want scalar mem ops in the first place. Thus meaning that many scalar mem ops would be outside of tight loops, and outside of tight loops is also where unrolling/strength reduction/LICM to reduce the need of indexed loads is least applicable.

Just ran a quick benchmark - seems Haswell handles "mov rbx, QWORD PTR [rbx+imm]" with 4c latency if there's no chain instructions (5c latency in all other cases, including indexed load without chain instrs, and "mov rbx, QWORD PTR [rbx+rcx*8+0x12345678]" always). So even with existing cases where the indexed load pushes it over to the next cycle, there are cases where the indexed load is free too.

▲brucehoult 158 days ago

And outside of tight loops is where a cycle here or there is irrelevant to the overall speed of the program. All the more so if you're going to have cache or TLB misses on those loads.

▲dzaima 158 days ago

I quite heavily disagree. Perhaps might apply to programs which do spend like 90% of their time in a couple tight loops, but there's tons of software that isn't that simple (especially web.. well, everything, but also compilers, video game logic, whatever bits of kernel logic happen in syscalls, etc), instead spending a ton of time whizzing around a massive mess. And you want that mess to run as fast as possible regardless of how much the mess being a mess makes low-level devs cry. If there's headroom in the AGU for a 64-bit adder, I'd imagine it's an extremely free good couple percent boost; though the cost of extra register port(s) (or logic of sharing some with an ALU) might be annoying.

And indexed loads aren't a "here or there", they're a pretty damn common thing; like, a ton more common than most instructions in Zbb/Zbc/Zbs.

▲brucehoult 157 days ago

This is not a discussion that can be resolved in the abstract. It requires actual experimentation and data and pointing at actual physical CPUs differing only in this respect and compare the silicon area, energy use, MHz achieved, and cycles per program.

▲dzaima 157 days ago

It's certainly not a thing to be resolved in the abstract, but it's also far from thing to be ignored as irrelevant in the abstract.

But I have a hard time imagining that my general point of "if there's headroom for a full 64-bit adder in the AGU, adding such is very cheap and can provide a couple percent boost in applicable programs" is far from true. Though the register file port requirement might make that less trivial as I'd like it to be.

▲dzaima 160 days ago

Note that Zba's sh1add/sh2add/sh3add take care of the problem of separate shift+add.

But yeah, modern x86-64 doesn't have any difference between indexed and plain loads[0], nor Apple M1[1] (nor even cortex-a53, via some local running of dougallj's tests; though there's an extra cycle of latency if the scale doesn't match the load width, but that doesn't apply to typical usage).

Of course one has to wonder whether that's ended up costing something to the plain loads; it kinda saddens me seeing unrolled loops on x86 resulting in a spam of [r1+r2*8+const] addresses and the CPU having to evaluate that arithmetic for each, when typically the index could be moved out of the loop (though at the cost of needing to pointer-bump multiple pointers if there are multiple), but x86 does handle it so I suppose there's not much downside. Of course, not applicable to loads outside of tight loops.

I'd imagine at some point (if not already past 8-wide) the idea of "just go wider and spam instruction fusion patterns" will have to yield to adding more complex instructions to keep silicon costs sane.

[0]: https://uops.info/table.html?search=%22mov%20(r64%2C%20m64)%...

[1]: https://dougallj.github.io/applecpu/measurements/firestorm/L... vs https://dougallj.github.io/applecpu/measurements/firestorm/L...

▲dist-epoch 160 days ago

And people have been even more critical about x86 architecture. It doesn't matter that much.

▲Narishma 160 days ago

> Has it been demonstrated that RISC-V is architecturally suitable for making chips that equal the performance of high-end x86 and ARM designs?

How would you demonstrate that besides actually building such a chip?

▲teleforce 160 days ago

Can we have open source and RISC-V inserted into the title to be more descriptive?

▲dang 160 days ago

I've put them up there now. Is it better?

▲teleforce 159 days ago

Thanks Dang, better now

▲whatever1 160 days ago

Apart from the lip service that has obvious reasons to support that latest chip technology is crucial to the national security, I fail to understand why this is the case.

What is the disadvantage that a country has if they only have access to computer technology from the 2010s? They will still make the same airplanes, drones, radars tanks and whatever.

It seems to me that it is nice to have SOTA manufacturing capability for semi-conductors but not necessary.

▲gostsamo 160 days ago

On the top of my head, you would like to cut costs for materials and nuclear research, big data analytics, all kinds of simulations, machine learning tasks that might not be llm-s, but still give you intelligence advantage, economic security in being able to provide multiple services on lower prices. If necessary, one can throw money, people, and other resources at a problem, but those could be spent elsewhere for higher return on investment. Especially, if you have multiple compute intensive tasks on the queue, you might have to prioritize and to deny yourself certain capabilities as a result. So, I'd say that it is not any one single task that needs cutting edge compute, it is the capability to perform multiple tasks at the same time on acceptable prices that is important.

▲qwertox 160 days ago

Once China has a CPU that is really good enough for most critical tasks, it might as well start dealing with Taiwan in order to let other countries see how well they progress if they no longer have the manufacturing capabilities of TSMC and others at their disposal.

If played well, it could even let them win the AI race even if they and everyone else have to struggle for a decade.

▲pantalaimon 160 days ago

They already have Loongarch which is faster then current RISC-V designs.

▲janice1999 160 days ago

And Zhaoxin which is equivalent to 2015 Intel.

https://en.wikipedia.org/wiki/Zhaoxin

▲re-thc 160 days ago

> They will still make the same airplanes, drones, radars tanks and whatever.

At the same cost and speed? Volume matters.

> is crucial to the national security

National security isn't just about military power. Without the latest chips, e.g. if there were sanctions, it could impact the economy. The nation can be become insecure e.g. by means of more and more people suffering from poverty.

▲ 160 days ago

▲logicchains 160 days ago

>They will still make the same airplanes, drones, radars tanks and whatever.

Eventually there'll be fully autonomous drones and how competitive they'll be will be directly proportional to how fast they can react relative to enemy drones. All other things being equal, the country with faster microchips will have better combat drones.

▲05 160 days ago

Alternatively, the country which has the biggest drone manufacturer in the world that can sell a $200 drone[0] capable of following a human using a single camera and sending the video in real time over 20 km using the same inhouse designed chipset both for AI control and video transmission [1] would probably win.

[0] https://www.dji.com/neo

[1] https://fpvwiki.co.uk/dji-neo-fpv-drone

▲saidinesh5 160 days ago

> All other things being equal, the country with faster microchips will have better combat drones

That's very unlikely imo. When it comes to drones.. no matter how fast your computation is, there are other bottlenecks like how fast the motors can spin up, how fast the sensor readings are, how much battery efficient they are etc...

Right now the 8 bit ESCs are still as competitive as 32 bit ESC, a lot of the "follow me" tasks were using lot less computational power than what your typical smartphone these days offers...

▲logicchains 160 days ago

Current drones are very limited compared to what they could do with a lot more processing power and future hardware developments. E.g. imagine a drone that could shoot a moving target hundreds of metres away in the wind, while it itself was moving very fast.

▲nolist_policy 160 days ago

Battleships could do that in the 1960s without any silicon: https://m.youtube.com/watch?v=s1i-dnAH9Y4

A drone moves faster, but I don't think that changes the calculations.

▲pjc50 160 days ago

The difficult bit for the drone is probably spotting the target.

There doesn't seem to be a great interest in having (small) drones shoot things yet, all the current uses seem to be:

- drone itself is the munition

- drone is the spotter for other ground based artillery

- drone dropping unguided munitions (e.g. grenades)

"Large" drones (aircraft rather than quadcopter) seem to follow the same rules as manned aircraft and engage with guided or unguided munitions of their own. If the drone is cheap enough then "drone as munition" seems likely to win.

▲blacksmith_tb 160 days ago

Not that I am eager to see armed drones, but I would have thought recoil would be a hard problem for such a light vehicle?

▲nabla9 160 days ago

In the early 2000s, bringing China into the global community was widely seen as a strategic decision. The Bill Clinton and George W. Bush administrations supported integrating China’s economy into the international rules-based system.

China did not want to integrate. China has been seeking strategic independence in its economy by developing alternative layers of global economic ties, including the Belt and Road Initiative, PRC-centered supply chains, and emerging country groupings for longer than US.

Too much technological ties with China are seen as a potential vulnerability. It's not just technology itself, but it's importance in trade and economy. If the US or its allies have value chains tightly integrated with China on strategic components, it creates dependence.

▲buyucu 160 days ago

This is dishonest. China didn't spend the last 20 years invading multiple countries, committing acts of mass-murder and destabilizing the whole Middle-East.

If anything, China's rise is a stabilizing factor for the whole world. It balances the aggression originating from the United States.

▲whatever1 160 days ago

Ok to be fair, for the past ~3000 years the Middle East has not shown any evidence that it can be stable.

▲pjc50 160 days ago

The Ottoman empire lasted for longer than the United States currently has.

▲bzzzt 160 days ago

> What is the disadvantage that a country has if they only have access to computer technology from the 2010s?

Try running a LLM on hardware from 2010.

▲whatever1 160 days ago

Running LLMs I am absolutely positive I could do with my Dell r810 cluster back in 2010, if I had access to DeepSeek.

Training a frontier model probably not. Again not clear what is the strategic benefit of having access to a frontier LLM model vs a more compact & less capable one.

▲Almondsetat 160 days ago

Government office computers need LLMs?

▲pjc50 160 days ago

At this rate they're going to get them whether they need them or not. Big push in the west for "AI everywhere" e.g. Microsoft Copilot; UK has some ill-defined AI push https://www.bbc.co.uk/news/articles/crr05jykzkxo

▲logicchains 160 days ago

Does anyone know why 910 is used for both this and Huawei's 910 AI chip? Does 910 have some special meaning in Chinese, or is it just a coincidence?

▲Alifatisk 160 days ago

They used words like ”ultra” high performance, I am not that knowledgeable to confirm this, but is it really that?

▲gnfargbl 160 days ago

From the benchmarks on another of the blog posts, I very roughly estimate this to be about 1/50 the performance of a Ryzen 7950X for CPU-bound tasks not requiring vector instructions. For vectorizable workloads it will be much slower due to the lack of software SIMD support.

▲stonogo 160 days ago

Only compared to other RISC-V processors.

▲hassleblad23 160 days ago

I wonder if the Nvidia stock can rally as much now as it did in the past.

▲sylware 160 days ago

This site is so good (for now...).

That said, isn't the C910 with a critical buggy vector block?

It is amazing acheivement in a saturated market. The road for reaching a fully mature and performant large RISC-V implementation is still long (to catch up with the other ones)...

... but a royalty free ISA is priceless, seriously.

▲ 160 days ago

▲spaceman_2020 160 days ago

87 points and no comments? Strange!

▲haunter 160 days ago

My usual HN observation:

- If points are (significantly) higher than comments: the submission is very niche or highly technical that a lot of people can appreciate but only a few can meaningful comment. See right now here 120 vs 14

- If comments are higher than points or levitating towards 1:1 ratio: casual topic or flamewar (politics). See the DOGE post on the front page now, 1340 vs 2131

That being said I think "healthy" posts have a 1.5:1 - 2:1 ratio

▲simion314 160 days ago

why strange, you really wanted to see my comment "nice article with good research"? but I am not a hardware guy so it might be filled with mistakes.

▲misiek08 160 days ago

I think „scary” is the best word. If of course rumor about choosing RISC-V over ARM is true! We just saw big win of ARM over Intel's multi-decade scam and yet here we are with another split from ARM, because of politics. Scary to see how stupid, talking people, convincing other people to hate without real reasons can lead to such stories…

▲wren6991 160 days ago

> We just saw big win of ARM over Intel's multi-decade scam

We also just saw Arm sue one of their customers following an acquisition of another of Arm's customers, and try to make them destroy IP that was covered by both customers' licenses. Nobody wants to deal with licensing, and when the licensor is that aggressive it makes open alternatives all the more compelling, even if they're not technically on-par.

▲buyucu 160 days ago

RISC-V is open source. Everyone wins when we switch from Arm to RISC-V. Well, everyone except Arm.

▲stonogo 160 days ago

I'm less convinced the RISC-V move is about politics and more convinced it's about not paying for ARM licenses.

▲Malidir 160 days ago

Arm is Softbank.

And Softbank is all in on Team USA.

▲buyucu 160 days ago

softbank is usually the dumb money on the poker table, funding bad ideas long after anyone intelligent leaves them. wework is probably the best example.

getting funded by softbank is probably a good proxy indicator for companies losing their competitive edge.

▲JoachimS 160 days ago

At least in the USA, Europe. Remember ARM China going rouge: https://news.ycombinator.com/item?id=28329731

▲martinsnow 160 days ago

I don't see how they went rouge? https://www.datacenterdynamics.com/en/news/arm-lays-off-70-s...

Arm is still in control. Some went to form another company however.

▲pantalaimon 160 days ago

But they don't have access to new designs and no way to add new architecture extensions that could be widely adopted.

▲Mistletoe 160 days ago

China is already red.

▲martinsnow 160 days ago

Arm the company is based in the UK and thr UK wants to rejoin the EU. I don't think you should be so sure about that.

▲Malidir 160 days ago

RISC-V is open source.

Stop being so squareand embrace the future maaaaann.