Editor:Ana Hu
China exportsemi net
Recently, Karen Heyman, technical editor of Semiconductor Engineering magazine, had in-depth discussions with several senior industry figures, including technical experts from Keysight, Arteris, Rambus, Cadence, and Siemens . The content of this issue will focus on "the future development of memory". The China Overseas Semiconductor website has compiled the key contents of this dialogue for readers:
Picture: Storage technology experts discuss the future development of memory
How will CXL 1 and UCIe 2 play into the future of memory, especially given data transfer costs?
Randy White, Keysight Memory Solutions Program Manager : The main goals of UCIe (Unified Chiplet Interconnect Express) are interoperability, as well as reducing costs and increasing yields. From the beginning, we're going to get better overall metrics with UCIe, and that's going to translate not just to memory, but to other IP blocks as well. In terms of CXL (Compute Express Link), with the emergence of many different architectures that focus more on artificial intelligence and machine learning, CXL will play a role in managing and minimizing costs. Total cost of ownership is always a JEDEC architectural metric, while power consumption and performance are respectively. CXL is basically optimized for partitioned computing architecture, reducing over-engineering and designing around latency issues.
Frank Schirrmeister, VP of Solutions and Business Development , Arteris : If you look at network on-chip, such as AXI (Advanced eXtensible Interface) or CHI (Coherent Hub Interface), or OCP (Open Compute Project), you will find These are all on-chip connectivity variants. However, when you use off-die or off-chip, PCIe and CXL are the protocols for these interfaces. CXL has various usage models, including some understanding of consistency between different components. On Open Compute Projectforum, when talking about CXL, all they talk about is the memory-attached usage model.
And UCIe will always be one of the options for chip-to-chip connectivity. On the memory side, UCIe can be used in a chiplet environment, where you have an initiator and a target that tags additional memory. UCIe and its latency across all connectivity methods play an important role in how you structure your architecture to get data in a timely manner. AI/ML architecture relies heavily on data input and output. We haven't figured out the memory wall yet, so you have to choose architecturally wise where to save your data from a system perspective.
Steven Woo, Distinguished Inventor of Rambus : One of the tough challenges is that the data sets are getting larger and larger, and one of the problems that CXL can help solve is the ability to add more memory on the nodes themselves. These processes have an increasing number of cores. Each of these cores requires a certain amount of memory capacity. On top of that, the data sets are getting larger and we need more memory capacity per node. There are many models in use today. So we're seeing more use of spreading data and computation between many nodes, especially in artificial intelligence, using large models trained on many different processors. Protocols like CXL and UCIe provide technologies that help processors flexibly change how they access data. Both have the flexibility to implement and access data sharing across multiple nodes in the way that makes the most sense to them, and address things like memory walls. issues as well as power and latency issues.
Frank Ferro , Director of Cadence Product Management Group : Regarding CXL, we have said a lot about the memory pool. On a more practical cost side, because of the size of the servers and chassis in the data center, while you can stick more memory in there, it's a cost load. When you get into CXL 3.0, the ability to take your existing infrastructure and continue to expand is really important to avoid these shallow memory situations, where your processor can't access the memory. CXL also adds another layer of memory so now you don't have to use storage/SSD, minimizing latency. As for UCIe, with the advent of high-bandwidth memory and these very expensive 2.5D structures, UCIe may be a way to help separate these memories and reduce costs. For example, if you have a large processor (GPU or CPU) and you have memory very close to it, such as wanting high-bandwidth memory, then you are going to have to put a fairly large footprint within the silicon layer, or within the layer technology. This will increase the cost of the overall system because you have to have a piece of silicon to host the CPU, DRAM, and any other components you might want to install. With the chiplet I can put the memory on my own 2.5D and then I can put the processor on a cheaper motherboard and connect it via UCIe. This is a very interesting usage model for how to reduce costs.
Jongsin Yun, Siemens EDA memory technology expert : At IEDM, there was a lot of discussion about artificial intelligence and different memories. Artificial intelligence has been rapidly increasing processing parameters, growing approximately 40-fold in less than five years. Therefore, large amounts of data need to be processed by artificial intelligence. However, the performance of DRAM and the underlying communication have not achieved such a big improvement, only about 1.5 to 2 times our improvement every two years, which is obviously lower than the actual demand for AI improvement. An example of an attempt to improve communication between memory and chips. There is a large gap between the data supply of memory and the data demand of AI computing power that still needs to be solved.
How does memory help us solve power consumption and laptop issues?
Randy White, keysight memory solutions program manager : A power problem is a memory problem. 50% of the costs in the data center come from memory, or just I/O, or refresh management and cooling maintenance. We're talking about volatile memory, specifically DRAM. As we said, data volumes are huge, workloads are getting tighter, and speeds are getting faster, all of which means higher performance. As we scale, many initiatives have been taken to meet the storage bandwidth required to support continued critical volumes. Power increases accordingly. We used a few tricks along the way, including power rails that reduced voltage regulation and improved I/O square function. We are experimenting with using more body groups to make memory flush management more efficient, which also improves overall throughput.
A few years ago, a customer came to us wanting to make significant changes to how JEDEC specified memory and in terms of temperature ranges. LPDDR has a wider range and has different temperature classifications, but for the most part we're talking about commodity DDR because that's the capacity increase and is pretty much common in data centers. This customer wants to propose to JEDEC that if we could reduce the operating temperature of DRAM by 5 degrees (even though our refresh rate increases with temperature), we would reduce the power generation of 3 coal-fired plants per year. This increase in power needs to be supported. So what is done at the device level will translate into the macro at the global power plant level. Also, at the architectural level, overprovisioning in memory design has been around for quite some time. We have introduced this PMIC (power management IC) and the voltage therefore regulation is done at the module level. We have onboard temperature sensors, so now the system needs to monitor the temperature within the warehouse. Now you have specific efficient module and device temperatures and thermal management to do more.
Frank Schirrmeister, VP of Solutions and Business Development, Arteris : If DRAM were a person, it would definitely be challenged by society because people don’t want to communicate with it. While it's very important, no one wants to talk to it - or wants to talk to it very rarely - because of the costs involved in latency and power consumption. For example, in an AI/ML architecture you want to avoid massive increases in cost, which is why everyone is asking if the data can be stored locally or moved in a different way. Do I systematically arrange my architecture so that computing elements receive data at the right time in the pipeline? That's why it's important. It has all the data. But when you optimize for latency, you also need to optimize for power consumption. From a system perspective, you obviously want to minimize access. This has very interesting implications for the data transfer architecture of NoCs, as people want to carry data, save data in various local storages, and design their architecture from a social perspective to facilitate portability and reduce access to DRAM.
Frank Ferro , Director of Product Management Group, Cadence : When we adopt different AI architectures, many of the dangerous goals are to try to keep more local or even avoid using DRAM altogether. Some companies use this as their value. If you don't need to strip the chip, your consumption and performance will increase by orders of magnitude. We have already discussed the size of the data model. They are so large and cumbersome to connect that they may not be practical. However, you can perform more operations on the chip, which saves money. Even the concept of HBM is very broad in its intentions and very slow in its thinking. If you look at previous generations of HBM, you'll see that they had DDR, with speeds around 3.2GB. Now they're up to 6GB, but that's still relatively slow for a very wide DRAM, and in this generation they even lowered the I/O voltage to 0.4 to try to lower the I/O. If you can slow down the DRAM, you can also save power. Now you're using memory, placing it very close to the processor. Then you get a larger thermal footprint in a smaller area. You're improving some things, but making other things more severe.
Frank Schirrmeister, VP of Solutions and Business Development, Arteris : In my opinion, IBM's Arctic AI architecture is an interesting example. If you look at it from an energy efficiency perspective, most of the memory is essentially on the chip, but not everyone can do it. Essentially, it's an extreme situation where we want to create as little as possible. Damage and deliver as much damage as possible on the chip. IBM's research shows it is effective.
Steven Woo, Distinguished Inventor of Rambus : When you think about DRAM, you have to think very strategically about how to use it. You have to have a sufficiently adequate memory hierarchy between what's above you (i.e. SRAM) and what's below you (i.e. the disk hierarchy). For any of these elements in the memory hierarchy, you don't want to if you can avoid it. Move large amounts of data. When you do move it, you need to make sure that you use the data as much as possible to spread out that header. Very good at responding to some key needs. If you look at the development of technologies such as low-power DRAM and HBM, they are responses to the fact that standard memory does not meet certain performance parameters, such as power efficiency. Some of the advances people are talking about, especially with artificial intelligence becoming a big driver, are not only improving performance but also improving energy efficiency - for example, trying to take DRAM and put it directly on the processor, which will help Looking ahead, vendors will respond by focusing on architectural changes, not just incremental changes such as low-power roadmaps, but also larger changes.
Are there other ways memory can help with latency issues besides what we've been discussing?
Randy White, Keysight Memory Solutions Program Manager : We are launching compute that will address many of the needs of edge computing. Furthermore, the obvious benefit of CXL is that we no longer pass data, but pointers to memory addresses, which is more efficient and will reduce overall latency.
Frank Schirrmeister, VP of Solutions and Business Development , Arteris : There's a power issue here as well. We have CXL, CHI, PCIe - all of these items have to work well on and across chips, especially in a chiplet environment. Imagine that in the background, your data via AXI or CHI is running peacefully on the chip and now you want to transfer it from chiplet to chiplet. You suddenly have to start changing things. From a power perspective, this has implications. Everyone talks about the open chiplet ecosystem and communication between different players. In order to achieve this, you need to ensure that the conversion must be done all the time. It reminds me of the old days when there were five different video formats, clear different audio formats, all of which needed to be converted. You want to avoid this because of the power consumption and increased latency. From a NoC point of view, if I try to get data from memory and I need to insert a block somewhere, because I need to pass UCIe another chip to get the memory connected to the other chip, reaching increasing cycles. Because of this, the role of the architect is becoming increasingly important. From a latency and low power perspective, you want to avoid transitions. It's just a door and doesn't add anything. If only everyone spoke the same language.
1. What is CXL ?
CXL: Compute Express Link, this technology is a new high-speed interconnect technology designed to provide higher data throughput and lower latency to meet the needs of modern computing and storage systems. It was originally launched jointly by Intel, AMD and other companies, and has received a lot of support from companies including Google, Microsoft and others. The goal of CXL is to solve the memory gap between CPU and device, device and device.
2. What is UCIe ?
UCIe: Unified Chiplet Interconnect Express, is a comprehensive specification that can be used immediately as the basis for new designs while laying a solid foundation for future specification development. Contrary to other specifications, UCIe defines a complete die-to-die interconnect stack, ensuring interoperability of compatible devices, which is a mandatory requirement to enable the multi-die system market.
The copyright of the article belongs to the original author, and it is reproduced only for the exchange and sharing of information and technology.