# Design of 4x100Gb/s Single-Ended PAM-4 Voltage-Mode Transmitter for Memory Interface

Park, Jae-Koo

Department of Electrical and Electronic

Engineering

Graduate School

**Yonsei University** 

# Design of 4x100Gb/s Single-Ended PAM-4 Voltage-Mode Transmitter for Memory Interface

Advisor: Prof. Choi, Woo-Young

# **A Dissertation**

Submitted to the Department of Electrical and Electronic

Engineering and the Committee on Graduate School

of Yonsei University in Partial Fulfillment of the

Requirements for the Degree of

Doctor of Philosophy

Park, Jae-Koo

**June 2025** 

# Design of 4x100Gb/s Single-Ended PAM-4 Voltage-Mode Transmitter for Memory Interface

# This Certifies that the Dissertation of Park, Jae-Koo is Approved

| Committee Chair         | Choi, Woo-Young |
|-------------------------|-----------------|
| Committee Member        | Jung, Seong-Ook |
| Committee Member        | Park, Kwanseo   |
| Committee Member        | Seo, Yung-Hun   |
| <b>Committee Member</b> | Kim, Hyeran     |

Department of Electrical and Electronic Engineering
Graduate School
Yonsei University
June 2025

# TABLE OF CONTENTS

| List of Figures. | List of Figures |                                             | iv  |
|------------------|-----------------|---------------------------------------------|-----|
| List of Tables   | •••••           |                                             | xi  |
| Abstract         | •••••           |                                             | xii |
| CHAPTER 1        | Intr            | oduction                                    | 1   |
| 1.1 M            | Iotiva          | tion                                        | 1   |
| 1.2 T            | hesis           | Organization                                | 5   |
| CHAPTER 2        | Bac             | kground                                     | 6   |
| 2.1 P.           | AM-4            | Signaling                                   | 6   |
| 2                | .1.1            | Multi-Level Signaling                       | 6   |
| 2                | .1.2            | Binary Code vs Thermometer Code             | 10  |
| 2                | .1.3            | Level Separation Mismatch Ratio             | 12  |
| 2.2 F            | eed-F           | orward Equalization                         | 13  |
| 2                | .2.1            | UI-Spaced Feed-Forward Equalization         | 13  |
| 2                | .2.2            | Fractional-Spaced Feed-Forward Equalization | 15  |
| 2.3 N            | 1emoi           | y Interface                                 | 18  |
| 2                | .3.1            | Stub Series Terminated Logic                | 18  |
| 2                | .3.2            | On-Die Termination                          | 19  |
| 2                | .3.3            | Pseudo Open-Drain                           | 20  |
| 2                | .3.4            | Low-Voltage Swing Termination Logic         | 22  |
| 2.4 C            | locki           | ng Architecture                             | 23  |
| 2                | .4.1            | Full-Rate Clocking                          | 23  |
| 2                | .4.2            | Half-Rate Clocking                          | 25  |

| 2.                          | .4.3 Quarter-Rate Clocking                      | 27               |
|-----------------------------|-------------------------------------------------|------------------|
| CHAPTER 3                   | Design of an 80-Gb/s LVSTL Transmitter With a P | Pulse Width Pre- |
| Emphasis and a              | ı 4-Tap FFE                                     | 30               |
| 2.4.3 Quarter-Rate Clocking |                                                 | 30               |
| Emphasis and a 4-Tap FFE    |                                                 | 33               |
| 3.                          | .2.1 Principle of Pulse Width Pre-Emphasis      | 33               |
| 3.                          | .2.2 Transition Encoder                         | 39               |
| 3.                          | .2.3 4:1 MUX With Pulse Width Generator         | 44               |
| 3.3 Im                      | aplementation of TX Sub-Block Circuits          | 47               |
| 3.                          | .3.1 Parallel PRBS Generator                    | 47               |
| 3.                          | .3.2 Re-timer and 8:4 FFE MUX                   | 52               |
| 3.                          | .3.3 Driver                                     | 55               |
| 3.4 Q                       | uarter-Rate Clocking                            | 58               |
| 3.                          | .4.1 Clocking Architecture                      | 58               |
| 3.                          | .4.2 Clock Distribution                         | 60               |
| 3.5 M                       | easurement Results                              | 63               |
| 3.6 Co                      | onclusion                                       | 70               |
| CHAPTER 4                   | Design of an 4x100-Gb/s/pin POD Transmitter V   | Vith Quadrature  |
| Clock Error Co              | orrector Using Pre-Coded-Data Pattern           | 71               |
| 4.1 De                      | CC/QEC Prior Arts                               | 72               |
| 4.2 Pr                      | re-Coded-Data Pattern Based DCC/QEC             | 79               |
| 4.                          | .2.1 Principle of DCC                           | 80               |
| 4.                          | .2.2 Principle of QEC                           | 82               |
| 4.                          | .2.3 DCC/QEC Implementation                     | 83               |
| 4.                          | .2.4 Analog-Digital Co-simulation               | 88               |
| 4.3 Ci                      | ircuits Implementation                          |                  |
| 4.4 M                       | easurement Results                              | 94               |

| 4.5 Conclusion  |                                   | 100 |
|-----------------|-----------------------------------|-----|
| CHAPTER 5       | Conclusion                        | 101 |
| Bibliography.   |                                   | 103 |
| Abstract in K   | orean                             | 107 |
| List of Publica | ations                            | 109 |
| Inter           | national Journal Papers           | 109 |
| Inter           | national Conference Presentations | 110 |
| Pater           | nts                               | 111 |

# **List of Figures**

| Fig. 1-1. Memory data bandwidth trend [1].                                                  |
|---------------------------------------------------------------------------------------------|
| Fig. 1-2. The evolution of the number of state-of-the-art models over the years, along with |
| the AI accelerator memory capacity [2]                                                      |
| Fig. 2-1. Eye diagrams of (a) PAM-2 (NRZ), (b) PAM-3, (c) PAM-4 signaling                   |
| Fig. 2-2. Power spectral density of NRZ and PAM-4 signaling                                 |
| Fig. 2-3. Timing diagram for (a) binary code and (b) thermometer code                       |
| Fig. 2-4. Level separation mismatch ratio of PAM-4 signaling                                |
| Fig. 2-5. Block diagram of conventional FFE                                                 |
| Fig. 2-6. Frequency response of 2-tap UI-spaced FFE                                         |
| Fig. 2-7. Power spectral density of NRZ                                                     |
| Fig. 2-8. Block diagram of 2-tap fractional spaced FFE                                      |
| Fig. 2-9. Frequency response of 2-tap fractional spaced FFE                                 |
| Fig. 2-10. DDR1 SDRAM motherboard termination                                               |

| Fig. 2-11. (a) SSTL and (b) POD signaling                                                |          |
|------------------------------------------------------------------------------------------|----------|
| Fig. 2-12. Block diagram and timing diagram of full-rate clocking                        | ŀ        |
| Fig. 2-13. Block diagram and timing diagram of half-rate clocking                        | ;<br>)   |
| Fig. 2-14. Block diagram and timing diagram of quarter-rate clocking                     | }        |
| Fig. 3-1. Overall architecture of the TX                                                 | ļ.       |
| Fig. 3-2. Block diagram of PWPE.                                                         | <b>,</b> |
| Fig. 3-3. Frequency response of PWPE with pulse width and coefficient modulation 34      | ŀ        |
| Fig. 3-4. Frequency response of PWPE passing through an ideal 1st order channel 35       | ;        |
| Fig. 3-5. Frequency response of fractional-spaced FFE passing through an ideal 1st order | ſ        |
| channel35                                                                                | ;        |
| Fig. 3-6. Block diagram of capacitive peaking driver                                     | ·<br>)   |
| Fig. 3-7. Frequency responses of capacitive peaking and pulse width pre-emphasis 37      | ı        |
| Fig. 3-8. Schematic of current-starved inverter-based delay unit                         | )        |
| Fig. 3-9. Simulation result of the current-starved inverter-based delay unit             | )        |
| Fig. 3-10. Block diagram of the transition encoder                                       |          |

| Fig. 3-11. Timing diagram of the transition encoder and waveforms of the serialized main             |
|------------------------------------------------------------------------------------------------------|
| data and encoder output                                                                              |
| Fig. 3-12. Schematic of the 4:1 MUX and balanced NAND gate                                           |
| Fig. 3-13. Simulated waveforms of 4:1 MUX output at 40Gb/s with feedback equalizer (a)               |
| on and (b) off45                                                                                     |
| Fig. 3-14. Schematic of the 4:1 MUX with clock delay unit                                            |
| Fig. 3-15. Simulated delay versus bias voltage of delay unit                                         |
| Fig. 3-16. (a) Block diagram of 7-bit LFSR for PRBS-7, (b) transition matrix $T$ 48                  |
| Fig. 3-17. General form of $\mathbf{m} \times \mathbf{m}$ transition matrix $\mathbf{T}$ for $m > n$ |
| Fig. 3-18. Block diagram and <i>TP</i> for (a) 8-bit parallel PRBS-7 generator and (b) 16-bit        |
| parallel PRBS-7 generator50                                                                          |
| Fig. 3-19. (a) <i>TP</i> and (b) block diagram for 16-bit parallel PRBS-31 generator51               |
| Fig. 3-20. (a) Block diagram and (b) timing diagram of re-timer                                      |
| Fig. 3-21. (a) Block diagram of 8:4 FFE MUX. (b) Look-up table for 4-tap FFE cursors.                |
| 53                                                                                                   |

| Fig. 3-22. Timing diagram of the 8:4 FFE MUX for pre-tap and main-tap configuration. 54    |
|--------------------------------------------------------------------------------------------|
| Fig. 3-23. Schematic of (a) 1-stacked and (2) 2-stacked LVSTL PAM-4 driver55               |
| Fig. 3-24. Operation of each driver based on the four levels of PAM-4 driver 57            |
| Fig. 3-25. (a) Test bench for estimating the bandwidth of a CMOS buffer and (b) simulation |
| results of input-to-output pulse width distortion                                          |
| Fig. 3-26. Block diagram of the clock distribution and sub-circuits                        |
| Fig. 3-27. Simulated quadrature clock phase difference of 4-stage PPF                      |
| Fig. 3-28. Simulated operating range of (a) DCC and (b) QEC DCDL with 5-bit DCC/QEC        |
| control code61                                                                             |
| Fig. 3-29. (a) The die photograph. (b) Measurement setup                                   |
| Fig. 3-30. Measured waveform using quadrature clock patterns at 36-Gbaud: (a) before and   |
| (b) after DCC/QEC calibration. 64                                                          |
| Fig. 3-31. Measured PAM-4 RLM at 20-Gb/s with PAM-4 ZQ calibration used in [21]. 65        |
| Fig. 3-32. (a) Measured pulse responses and (b) insertion losses derived from measured     |
| pulse responses at 40-Gbaud without ESD and 36-Gbaud with ESD65                            |

| Fig. | 3-33. Measured 80-Gb/s eye diagram without ESD : with PWPE (a) off and (b) on.       |
|------|--------------------------------------------------------------------------------------|
|      | Measured 72-Gb/s eye diagram with ESD : with PWPE (c) off and (d) on                 |
| Fig. | 3-34. Measured pulse response and (b) insertion losses derived from measured pulse   |
|      | responses at 28-Gb/s with 80-inch and 120-inch cables                                |
| Fig. | 3-35. Measured 56-Gb/s PAM-4 eye diagrams: (a) 80-inch and (b) 120-inch cables       |
|      | without FFE and PWPE, (c) 80-inch cable with FFE only, (d) 80-inch cable with FFE    |
|      | & PWPE, (e) 120-inch cable with FFE only, and (f) 120-inch cable with FFE & PWPE     |
|      | 68                                                                                   |
| Fig. | 3-36. Measured power breakdown based on simulation at 80-Gb/s                        |
| Fig. | 4-1. DCC/QEC prior art in ref [22]72                                                 |
| Fig. | 4-2. Another DCC/QEC prior art in ref [5]                                            |
| Fig. | 4-3. Block diagram of (a) pulse generator and (b) 4:1 MUX. (c) Timing diagram of the |
|      | 4:1 MUX with quadrature clock phase error                                            |
| Fig. | 4-4. Timing diagram of the driver outputs in incorrect DCC operation                 |
| Fig  | 4-5 Timing diagram of the driver outputs in incorrect OEC operation 76               |

| Fig. 4-6. Simulated eye diagrams of (a) differential output and (b) single-ended ou | ıtput with |
|-------------------------------------------------------------------------------------|------------|
| remaining duty cycle error. Simulated eye diagrams of (c) differential output       | at and (d) |
| single-ended output with remaining quadrature error.                                | 77         |
| Fig. 4-7. Timing diagram illustrating changes in output average values of 1010      | and 0101   |
| patterns under clock phase shift conditions.                                        | 78         |
| Fig. 4-8. Timing diagram for duty cycle error detection principle                   | 80         |
| Fig. 4-9. Timing diagram for quadrature phase error detection principle             | 82         |
| Fig. 4-10. Block diagram of the overall loop of the proposed DCC/QEC                | 83         |
| Fig. 4-11. Flow chart of DCC.                                                       | 85         |
| Fig. 4-12. Flow chart of QEC.                                                       | 87         |
| Fig. 4-13. Duty cycle error correction calibration sequence.                        | 89         |
| Fig. 4-14. Quadrature phase error correction calibration sequence.                  | 89         |
| Fig. 4-15. Simulation result of DCC loop.                                           | 90         |
| Fig. 4-16. Simulation result of QEC loop.                                           | 90         |
| Fig. 4-17. Top-level block diagram of the 4-channel TX                              | 91         |

| Fig. 4-18. Schematics of the 4:1 MUX and single-ended driver with equalizer92     |
|-----------------------------------------------------------------------------------|
| Fig. 4-19. (a) Schematic and (b) simulation results of the DCC/QEC DCDL93         |
| Fig. 4-20. Measured 7GHz quadrature clock patterns for 28-Gbaud rate for CH A~D94 |
| Fig. 4-21. Measured 22GHz full-rate clock pattern                                 |
| Fig. 4-22. Measured 56-Gb/s PAM-4 eye diagram with PRBS-31 pattern (a) before     |
| DCC/QEC and (b) after DCC/QEC95                                                   |
| Fig. 4-23. Measured 4 channel 100-Gb/s PAM-4 eye diagram                          |
| Fig. 4-24. Measured 128-Gb/s PAM-4 eye diagram at 1.4V VDD96                      |
| Fig. 4-25. Micrograph of the TX                                                   |
| Fig. 4-26. Power breakdown per channel at 100-Gb/s                                |

# **List of Tables**

| Table 2-1 | Signal rates and signal modulation by wireline interface standard | 7  |
|-----------|-------------------------------------------------------------------|----|
| Table 2-2 | 2bit-3bit binary to thermometer conversion table                  | 11 |
| Table 2-3 | Comparison table of clocking architecture                         | 29 |
| Table 3-1 | Comparison table of Capacitive Peaking and PWPE                   | 38 |
| Table 3-2 | Performance Summary Table of the State-of-the-Art TX              | 69 |
| Table 4-1 | Performance Summary Table of the TX with DCC/OEC                  | 99 |

#### **Abstract**

# Design of High-Speed Single-Ended Voltage-Mode PAM-4 Transmitter for Memory Interface

With the rapid advancement of AI, cloud services, and machine learning technologies, the demand for high-performance computing (HPC) is growing rapidly. In parallel, the performance requirements of DRAM, the primary system memory, are also increasing. However, the performance gap between processors and memory continues to widen annually, making memory a critical bottleneck in overall system performance. To address this issue and expand the bandwidth of the memory interface, this paper proposes two single-ended transmitter (TX) architectures based on four-level pulse amplitude modulation (PAM-4).

PAM-4 signaling reduces the voltage margin to one-third that of NRZ signaling, making it difficult to secure a sufficient signal-to-noise ratio (SNR) in low-voltage swing termination logic (LVSTL) environments. Additionally, as the data rate increases, intersymbol interference (ISI) in the channel becomes more severe. Although de-emphasis can improve SNR by compensating for ISI, it also reduces signal amplitude. In LVSTL structures with inherently limited signal swing, excessive DC gain loss due to de-emphasis further degrades SNR.

To overcome these limitations, the first TX adopts a low-swing LVSTL-based single-

ended structure that integrates a 4-tap reconfigurable feed-forward equalizer (FFE) and pulse width pre-emphasis (PWPE). By minimizing the tap weights of de-emphasis and using PWPE to compensate for the remaining ISI, the transmitter effectively enhances SNR while mitigating swing reduction. Fabricated in a 28nm CMOS process, the TX achieves 80 Gb/s with an eye height of 27 mV, an eye width of 0.16 UI, an RLM of 0.99, an energy efficiency of 3.06 pJ/b, and a chip area of 0.045 mm<sup>2</sup>.

Building on this foundation, a second TX employing a pseudo open drain (POD) structure is developed to further simplify the architecture and enhance performance. The POD structure enables higher signal swing, allowing sufficient ISI compensation using only a simple 2-tap de-emphasis while reducing capacitive load at the output for higher speed operation. Additionally, to address duty cycle errors and quadrature phase errors inherent in quarter-rate clocking, this work proposes a novel phase correction technique that directly detects and compensates clock errors at the driver output using pre-encoded data patterns. The final 4-channel TX, fabricated in a 28nm CMOS process, achieves energy efficiencies of 1.25 pJ/b at 4×100 Gb/s and 0.99 pJ/b at 4×56 Gb/s, with a perchannel area of 0.066 mm².

*Keywords:* feed-forward equalization, four-level pulse amplitude modulation, transmitter, voltage-mode, single-ended, low voltage swing terminated logic, pulse width pre-emphasis, duty cycle correction, quadrature phase error correction, pseudo open drain

## **CHAPTER 1 Introduction**

#### 1.1 MOTIVATION

Dynamic Random-Access Memory (DRAM) has become an indispensable component in modern computing systems. From personal devices such as smartphones and laptops to high-performance computing applications like data centers, artificial intelligence (AI), and machine learning (ML), DRAM serves as the backbone of memory storage and retrieval processes. Its ubiquitous use across diverse fields underscores its significance in achieving efficient data handling and processing.

In response to ever-growing computational demands, DRAM has been developed and specialized according to specific requirements. DRAM can be broadly categorized into four types: double data-rate (DDR) synchronous DRAM (SDRAM), optimized for high-speed

and high-capacity memory needed in PCs, servers, and data centers; graphic DDR (GDDR) SDRAM, used in GPUs requiring high-speed data processing and bandwidth; low power DDR (LPDDR) SDRAM, designed for low power consumption and high efficiency in mobile and IoT devices; and high bandwidth memory (HBM), tailored for ultra-fast data processing and high bandwidth in AI, HPC, and data centers. As shown in Fig. 1-1, while the objectives of each product differ, improvements in data bandwidth have been a consistent requirement across all product generations, aligning with advancements in computing capabilities.

HBM has achieved performance enhancements by increasing the number of pins.



Source: ISSCC trend 2024

Fig. 1-1. Memory data bandwidth trend [1].

Through advanced packaging technologies like silicon interposers, the width of trace lines has been drastically improved, allowing a significant increase in the number of pins within the same area, thereby substantially enhancing bandwidth. GDDR, on the other hand, has focused on increasing data speed to improve data bandwidth. It has optimized operating voltage for high-speed operation and enhanced the package channel characteristics between the GPU and GDDR from a data rate perspective, resulting in higher data rates. This stands in contrast to DDR products, which were developed in dual inline memory module (DIMM) form for general-purpose use. While DDR prioritizes stability and high capacity, it exhibits significantly lower data rates compared to GDDR. GDDR7 further advances bandwidth by incorporating three level pulse amplitude modulation (PAM-3) technology, which encodes



Fig. 1-2. The evolution of the number of state-of-the-art models over the years, along with the AI accelerator memory capacity [2].

about 1.57 bits per symbol, improving transmission efficiency by 50% compared to traditional non-return-to-zero (NRZ) signaling.

Despite these advancements, a critical challenge persists: the rate of improvement in DRAM performance has not kept pace with the exponential growth in computing demands. As shown in Fig. 1-2, the number of parameters in AI's large transformer models has increased rapidly by 410 times every two years, while single GPU memory has only doubled approximately every two years. The growth rate of memory bandwidth and capacity is relatively slower than the advancement speed required by AI. This disparity has resulted in a phenomenon commonly referred to as the memory bottleneck or memory wall. The memory bottleneck occurs when the speed at which data can be retrieved from or written to DRAM lags the speed at which processors can compute, thereby limiting overall system performance.

To mitigate this memory bottleneck, continuous improvements in memory interface technologies are essential. This paper proposes a four level pulse amplitude modulation (PAM-4) signaling low voltage swing termination logic (LVSTL) transmitter (TX) capable of transmitting 2 bits of data per symbol to enhance memory interface performance. LPDDR uses LVSTL to achieve low-power operation. However, LVSTL TXs, which operate with a low swing, ensure sufficient signal-to-noise ratio (SNR) is challenging. To address this issue, we propose an LVSTL PAM-4 TX that minimizes the use of deemphasis for channel inter-symbol interference (ISI) mitigation to prevent excessive reduction in signal swing. Instead, additional channel equalization is performed using a pulse width pre-emphasis (PWPE) driver, ensuring sufficient SNR can be achieved.

Additionally, this paper proposes a correction method for quadrature clock phase errors, which is essential for high-speed TXs. Quarter-rate clocks, commonly used in high-speed SerDes, require careful management of both duty cycle errors and quadrature phase errors. Conventional methods, as described in references [3], [4], [5], use detection nodes in the middle of the clock path to detect phase errors or rely on clock patterns for calibration. However, detecting phase errors in the clock path leaves residual phase errors occurring after the detection node, and using clock patterns can correct duty cycle errors in differential outputs but leaves such errors uncorrected in single-ended drivers like those used in memory interfaces. To address this issue, we propose a duty cycle error correction (DCC) and quadrature phase error correction (QEC) method that uses data patterns capable of detecting clock phase errors, enabling correction even in single-ended drivers.

### 1.2 THESIS ORGANIZATION

This thesis is organized as follows: Chapter 2 discusses background knowledge related to the technologies used in the TX. This background includes an introduction to PAM-4 signaling, feed-forward equalization techniques, termination logic used in DRAM interfaces, and the characteristics of various clocking schemes employed in TXs. Chapter 3 focuses on the LVSTL PAM-4 TX. It describes the overall architecture of the TX, explains the operating principles of PWPE, and circuit implementation. Additionally, this chapter presents measurement results and a comparison with state-of-the-art TXs. Chapter 4 covers the pseudo open drain (POD) PAM-4 TX. It outlines the TX's architecture and explains the operating principle of DCC/QEC technique based on pre-coded data patterns. The chapter also presents measurement results. Finally, Chapter 5 provides a summary of the two proposed TXs and concludes this thesis.

# **CHAPTER 2 Background**

#### 2.1 PAM-4 SIGNALING

# 2.1.1 Multi-Level Signaling

Various wireline interfaces have traditionally used NRZ modulation schemes. NRZ signaling transmits either a 0 or 1 per symbol, making it efficient from an SNR perspective and allowing for relatively simple TX and receiver designs. However, as the demand for higher data bandwidth has grown, NRZ signaling methods have encountered significant challenges.

Table 2-1 presents the data rates and signal modulation schemes defined by commonly used wireline interface standards in the industry. Recently, many wireline interface standards have adopted PAM techniques, which transmit multiple data bits per

symbol, to overcome the bandwidth limitations of NRZ. This trend is also evident in the DRAM product line. For example, Micron has experimentally applied PAM-4 in GDDR6X [6], and GDDR7 has officially standardized PAM-3 as its signaling method [7].

Fig. 2-1 shows the eye diagrams for PAM-2, PAM-3, and PAM-4. PAM-2 is another name for NRZ. NRZ can transmit 1 bit per symbol, while PAM-3 can theoretically transmit up to 1.58 bits per symbol (since  $log_23 \approx 1.58$ ). However, in practical standards like USB4 Gen4, 11 bits are encoded into 7 symbols, resulting in 1.57 bits per symbol. PAM-4 can

Table 2-1
Signal rates and signal modulation by wireline interface standard

| Applications     | Data Rate   | Fundamental<br>Frequency | Note  |
|------------------|-------------|--------------------------|-------|
| USB2.0           | 480Mbps     | 240MHz                   |       |
| USB3.2 Gen2      | 10Gbps      | 5GHz                     |       |
| USB4             | 20Gbps      | 10GHz                    |       |
| USB4v2           | 25.6GBaud   | 12.8GHz                  | PAM-3 |
| DDR5             | 6.4Gbps     | 3.2GHz                   |       |
| LPDDR5           | 8.8Gbps     | 4.4GHz                   |       |
| GDDR7            | 22GBaud     | 11GHz                    | PAM-3 |
| PCIe Gen4        | 16Gbps      | 8GHz                     |       |
| PCIe Gen5        | 32Gbps      | 16GHz                    |       |
| PCIe Gen6        | 32GBaud     | 16GHz                    | PAM-4 |
| MIPI M-PHY       | 23.3Gbps    | 11.65GHz                 |       |
| Gear5            | 23.3Gbps    | 11.03GHZ                 |       |
| MIPI C-PHY       | 6GSa/s      | 3GHz                     |       |
| MIPI D-PHY       | 9GSa/s      | 4.5GHz                   |       |
| IEEE 802.3ck     | 53.125GBaud | 26.0625GHz               | PAM-4 |
| IEEE 802.3dj     | 106.25GBaud | 53.125GHz                | PAM-4 |
| HDMI 2.1         | 12Gbps      | 6.0GHz                   |       |
| DisplayPort v2.1 | 20Gbps      | 10GHz                    |       |



Fig. 2-1. Eye diagrams of (a) PAM-2 (NRZ), (b) PAM-3, (c) PAM-4 signaling.

transmit 2 bits per symbol.

By using PAM, multiple bits can be transmitted per symbol, allowing for higher bandwidth at the same symbol rate as the modulation order increases. However, the eye height margin decreases as the modulation order increases, with it halving for PAM-3 and reducing to one-third for PAM-4. Consequently, at the same symbol rate, PAM-3 suffers an SNR degradation of -6 dB, and PAM-4 experiences a degradation of -9.5 dB compared to NRZ.

Despite this SNR degradation, PAM-4 can perform better than NRZ in environments with significant channel loss. Fig. 2-2 illustrates the power spectral density (PSD) of NRZ and PAM-4 at the same data bandwidth. Since PAM-4 transmits 2 bits per symbol, its Nyquist frequency is half that of NRZ. Therefore, if the channel loss at NRZ's Nyquist frequency exceeds that of PAM-4's Nyquist frequency by more than 9.5 dB, the SNR degradation of PAM-4 can be compensated by the reduced channel loss. In such conditions, PAM-4 signaling can offer advantages over NRZ signaling.



Fig. 2-2. Power spectral density of NRZ and PAM-4 signaling

# 2.1.2 Binary Code vs Thermometer Code

The simplest method for configuring a PAM-4 driver is to design the MSB driver with twice the strength of the LSB driver to transmit data. Fig. 2-3 shows the timing diagram for two configurations: one where the PAM-4 driver is operated using an MSB/LSB structure, and another where the driver operates based on 2-bit to 3-bit thermometer decoding.

When the driver is operated using binary coding, data transitions between '10' and '01' cause simultaneous transitions in both the MSB and LSB. In such cases, even minor timing mismatches can lead to the MSB and LSB drivers turning on or off simultaneously for a moment, resulting in glitches in the output waveform. In contrast, if binary-to-thermometer encoding is applied as shown in Table 2-2 the thermometer-coded driver prevents simultaneous opposite-direction transitions, thereby avoiding glitches in the output waveform.



Fig. 2-3. Timing diagram for (a) binary code and (b) thermometer code.

Table 2-2
2bit-3bit binary to thermometer conversion table

| Decimal - | Binary |     | Thermometer |    |    |
|-----------|--------|-----|-------------|----|----|
|           | MSB    | LSB | T2          | T1 | Т0 |
| 3         | 1      | 1   | 1           | 1  | 1  |
| 2         | 1      | 0   | 1           | 1  | 0  |
| 1         | 0      | 1   | 1           | 0  | 0  |
| 0         | 0      | 0   | 0           | 0  | 0  |

# 2.1.3 Level Separation Mismatch Ratio

In NRZ signaling, level separation mismatch ratio (RLM) is not a consideration. However, when using PAM-3 or PAM-4 signaling, RLM must be considered. Fig. 2-4 illustrates what RLM represents in a PAM-4 signaling. This RLM is defined in IEEE 802.3 Clause 94 as the ratio of the smallest level separation to the ideal case [8]. If the RLM is less than 1, the eye margin deteriorates further, which means the TX should be designed to maintain a good RLM. Additionally, IEEE 802.3 recommends that the RLM be 0.92 or higher.



Fig. 2-4. Level separation mismatch ratio of PAM-4 signaling.

## 2.2 FEED-FORWARD EQUALIZATION

# 2.2.1 UI-Spaced Feed-Forward Equalization

Feed-Forward Equalization (FFE) is a widely used technique in high-speed serial links to mitigate ISI caused by channel loss. FFE operates at the TX, where it pre-compensates for the expected ISI by transmitting a set of weighted taps that account for the channel characteristics. As the signal passes through the channel, this pre-compensation helps ensure that the ISI is minimized at the receiver.

Fig. 2-5 illustrates a block diagram of a conventional UI-spaced FFE. The input signal is delayed by one symbol interval (T) for each tap, and each delayed signal is multiplied by a specific weight. The weighted signals are then summed together and transmitted. The



Fig. 2-5. Block diagram of conventional FFE.

impulse response of a 2-tap FFE is represented by

$$h(t) = C_0 \cdot \delta(t) - C_1 \cdot \delta(t - T). \tag{2.1}$$

The transfer function in the frequency domain using the Laplace transform is given by

$$H(s) = \frac{Y(s)}{X(s)} = C_0 - C_1 \cdot e^{-sT}.$$
 (2.2)

By substituting s with  $j\omega$  in Equation (2.2), the magnitude of the frequency response is obtained as

$$|H(j\omega)| = \sqrt{C_0^2 + C_1^2 - 2 \cdot C_0 \cdot C_1 \cos(\omega T)}.$$
 (2.3)

The DC gain of the 2-tap de-emphasis FFE is given by  $C_0 - C_1$ , while the maximum gain occurs when  $\omega T = \pi$ , where the first gain peak at  $C_0 + C_1$  is generated. At this point,  $f_{MAX}$  corresponds to  $f_{Nyquist}$ . Assuming  $\beta$  is defined as  $C_1/C_0$ , Fig. 2-6 illustrates the frequency response of the 2-tap FFE for different values of  $\beta$ . The emphasis gain is expressed as  $20 \log((C_0 + C_1)/(C_0 - C_1))$  with the unit in dB.



Fig. 2-6. Frequency response of 2-tap UI-spaced FFE.

# 2.2.2 Fractional-Spaced Feed-Forward Equalization

Fig. 2-7 presents the power spectrum of the NRZ signal. Up to the Nyquist frequency, 78% of the total signal power is contained, and up to twice the Nyquist frequency, 90% of the total power is included. The UI-spaced FFE inherently exhibits gain peaking at the Nyquist frequency. Therefore, to maximize power transmission, it is necessary to shift the gain peak region toward a higher frequency range.

Fig. 2-8 illustrates the block diagram of the 2-tap fractional spaced FFE, where a denotes the fractional coefficient. The impulse response of a 2-tap fractional FFE is expressed as

$$h(t) = C_0 \cdot \delta(t) - C_1 \cdot \delta(t - \alpha T). \tag{2.4}$$



Fig. 2-7. Power spectral density of NRZ.



Fig. 2-8. Block diagram of 2-tap fractional spaced FFE.

Applying the Laplace transform yields

$$H(s) = \frac{Y(s)}{X(s)} = C_0 - C_1 \cdot e^{-s\alpha T}.$$
 (2.5)

The magnitude of the frequency response is given by

$$|H(j\omega)| = \sqrt{C_0^2 + C_1^2 - 2 \cdot C_0 \cdot C_1 \cos(\omega \alpha T)}$$
. (2.6)

Fig. 2-9 shows the frequency response of the 2-tap fractional spaced FFE when  $\beta$  is set to 0.2. The results demonstrate that the use of a fractional FFE extends the equalization effect beyond the Nyquist frequency into higher frequency bands.



Fig. 2-9. Frequency response of 2-tap fractional spaced FFE

### 2.3 MEMORY INTERFACE

# 2.3.1 Stub Series Terminated Logic

As shown in Fig. 2-10, the signal reflection is attenuated by terminating at  $0.5 \times \text{VDD}$  on the board, along with the memory module inserted into the motherboard. This method was used in DDR1, which supported up to four modules. Since this configuration formed a series stub, it was named Stub Series Terminated Logic (SSTL). As DRAM data rates increased, the SSTL structure, which relied on external termination resistors, became insufficient for suppressing reflections. Consequently, from DDR2 onward, it was replaced with on-die termination (ODT).



Fig. 2-10. DDR1 SDRAM motherboard termination.

## 2.3.2 On-Die Termination

DDR1 supports a maximum data rate of 400 Mbps, while DDR2 and DDR3 achieve 800 Mbps and 1600 Mbps, respectively. With increasing data rates, minimizing signal reflections became essential. To address this, external termination was eliminated, and on-die termination (ODT), where termination is integrated within each DRAM die, was introduced from DDR2 data line.

# 2.3.3 Pseudo Open-Drain

SSTL employs a termination method based on 0.5 × VDD, requiring both a pull-up and a pull-down transistor. As data rates increase, reducing parasitic capacitance at the interface becomes essential. To achieve this, a pseudo open-drain (POD) architecture, like an open-drain structure operating at VDD, was introduced to minimize parasitic capacitance. POD functions similarly to an open-drain configuration, where voltage is pulled down from VDD, allowing for a significant reduction in the size of the pull-up transistor and improving signal speed. As a result, the POD structure has been adopted since DDR4 and continues to be implemented in GDDR7.

Additionally, POD reduces power consumption. As shown in Fig. 2-11, SSTL always generates a current path regardless of whether the data is 1 or 0, whereas in POD, current flows only when the data is 0, effectively lowering overall power consumption.



Fig. 2-11. (a) SSTL and (b) POD signaling.

# 2.3.4 Low-Voltage Swing Termination Logic

The LPDDR4 series, which prioritizes power consumption reduction, adopts LVSTL architecture to minimize power dissipation by reducing signal swing. LVSTL employs VSS as the termination voltage, thereby eliminating static current generation when the data state is '0'. Furthermore, when combined with the data bus inversion (DBI) technique, LVSTL effectively reduces the proportion of transmitted '1's while increasing the occurrence of '0's, leading to additional power savings. This approach also facilitates a more efficient reduction of the driver supply voltage.

The LVSTL driver, implemented using NN transistors, operates at a VDD of 0.6V, significantly reducing power consumption. However, as LVSTL operates at a lower VDD compared to POD, the output swing is inevitably smaller. Consequently, to ensure an adequate SNR, a more refined FFE technique must be incorporated.

#### 2.4 CLOCKING ARCHITECTURE

# 2.4.1 Full-Rate Clocking

One of the critical considerations in high-speed serializer design is the clocking architecture, as the clock is an essential element for data serialization. The clocking architecture involves a trade-off among circuit complexity, data bandwidth, and power consumption. Fig. 2-12 shows the block diagram and timing diagram of a TX employing full-rate clocking. The full-rate clock, generated from an external or internal voltage controlled oscillator (VCO), is divided and used to properly re-time and serialize the data. The serialized DIN, synchronized with the divided CK2, passes through a full-rate D flip-flop (DFF), ensuring a timing error-free data eye. As a result, the Dout contains only a low level of random jitter and deterministic jitter inherent to the full-rate clock.

However, as the data rate increases, ensuring tSETUP between the serialized Din and the full-rate clock, clock dividing, and full-rate DFF operation leads to excessive power consumption and poses significant challenges in TX design.



Fig. 2-12. Block diagram and timing diagram of full-rate clocking.

# 2.4.2 Half-Rate Clocking

The increase in power consumption and design complexity associated with full-rate clocking has been mitigated through half-rate clocking. As shown in Fig. 2-13, half-rate clocking utilizes a half-rate clock generated from an external or internal VCO, enabling data serialization through a 2:1 multiplexer (MUX) without requiring DFF re-timing, before transmitting it to the driver.

As depicted in Fig. 2-13 timing diagram, since no data re-timing is performed, any duty cycle error in CK2, which is used in the 2:1 MUX, degrades the horizontal eye margin of DOUT. To address this issue, DCC circuit must be incorporated into the clock path.

Various approaches can be employed for DCC, but the most used method involves detecting the clock duty cycle error within the clock path and applying feedback to correct it.



Fig. 2-13. Block diagram and timing diagram of half-rate clocking.

## 2.4.3 Quarter-Rate Clocking

As data bandwidth continues to increase, even half-rate clocking experiences significant power consumption in clock distribution. To mitigate this issue, quarter-rate clocking can be an efficient alternative. Fig. 2-14 illustrates the block diagram and timing diagram of a TX employing quarter-rate clocking. In this architecture, four data streams are serialized using quadrature clocks: CK0, CK90, CK180, and CK270.

As observed in the timing diagram, the quadrature clocks are derived by further dividing the half-rate clock into a quarter-rate clock. The use of a quarter-rate clock facilitates multiplexing margin during serialization. Additionally, in high-speed TXs, clock distribution accounts for a significant portion of the total power consumption. By reducing the clock rate to quarter-rate, it is possible to design a power-efficient clock distribution network. Furthermore, since the clock frequency is reduced to half of the half-rate clock, it exhibits lower sensitivity to clock jitter and noise.

However, the final 4:1 MUX is more complex to design compared to a 2:1 MUX, and it suffers from increased output capacitance. Additionally, the clock lines double in number, and an extra QEC circuit is required.

Table 2-3 compares the characteristics of full-rate, half-rate, and quarter-rate clocking architecture. As demonstrated, the selection of a clocking architecture in high-speed TX design must consider multiple factors, including operating frequency, design complexity, and power efficiency.





Fig. 2-14. Block diagram and timing diagram of quarter-rate clocking.

Table 2-3
Comparison table of clocking architecture

|                                  | Full-rate Clocking | Half-rate Clocking | Quarter-rate<br>Clocking |  |
|----------------------------------|--------------------|--------------------|--------------------------|--|
| Power Consumption                | High               | Moderate           | Low                      |  |
| Clock Distribution<br>Complexity | High               | Moderate           | Low                      |  |
| Multiplexing Margin              | Small              | Moderate           | Large                    |  |
| Design Complexity                | Moderate           | Moderate           | High                     |  |
| Additional Circuitry             | -                  | DCC                | DCC/QEC                  |  |

# CHAPTER 3 Design of an 80-Gb/s LVSTL Transmitter With a Pulse Width Pre-Emphasis and a 4-Tap FFE

In an LVSTL driver, excessive de-emphasis is often applied to mitigate channel-induced ISI. However, this approach significantly attenuates the output swing, making it challenging to maintain sufficient eye margin. To address this issue, de-emphasis was minimized, and instead, PWPE was implemented using an auxiliary driver, thereby reducing the degradation of the output swing.

To implement PWPE, transition information was extracted from parallel data through an encoding process and subsequently serialized to drive the auxiliary driver. Additionally, to enhance the equalization controllability of the pre-emphasis, a fractional-spaced FFE technique was adopted.

#### 3.1 TRANSMITTER ARCHITECTURE

Fig. 3-1 presents the overall architecture of the TX. A differential 10-GHz external clock is first processed through a four-stage active poly-phase filter (PPF), which is implemented using a CMOS inverter-based structure. The resulting quadrature clocks are then passed through a digitally controlled delay line (DCDL) for DCC and QEC before being buffered and fed into the 4:1 MUX. To achieve precise timing alignment between

the main and auxiliary drivers, a T-tree clock distribution strategy is employed. The quadrature clock is divided by a factor of 2 in the divider to ensure an appropriate timing margin between D4 and C4 in the 4:1 MUX. To achieve this, the quadrature C8 used in the 8:4 FFE MUX and the re-timer is adjusted via clock rotators. Similarly, the C8PRBS employed in the pattern generator utilizes a clock selector to choose a clock with the appropriate phase, thereby securing the data-to-clock setup time in the re-timer. The pattern generator produces either a 16-bit parallel pseudo-random binary sequence with a length of 2<sup>31</sup>–1 (PRBS-31) or a user-defined 16-bit MSB/LSB pattern. The two pairs of 8-bit data from the pattern generator are expanded into three pairs of 8-bit parallel data through a 2bit to 3-bit binary-to-thermometer encoding process. The resulting three pairs of 8-bit parallel thermometer-coded data then undergo transition encoding, extracting transition information and resulting in six pairs of 8-bit parallel data representing rise and fall transitions. The encoded data pass through a re-timer, followed by serialization through the 8:4 FFE MUX and 4:1 MUX. The final serialized data stream is then driven by a twostacked N-over-N driver. The output network incorporates ESD protection along with a Tcoil, which helps to mitigate for bandwidth degradation caused by ESD diode capacitance.



Fig. 3-1. Overall architecture of the TX.

### 3.2 PULSE WIDTH PRE-EMPHASIS

# 3.2.1 Principle of Pulse Width Pre-Emphasis

To eliminate ISI without reducing the signal swing, PWPE is employed. PWPE operates by activating an auxiliary driver for fractional pulse width only during signal transitions, thereby accelerating data transitions while maintaining the same data level. The block diagram of PWPE is illustrated in Fig. 3-2.

The main signal, x(t), passes through a pulse generator that produces a pulse width of  $\alpha T$  whenever a transition occurs. The transition signal, y(t), is then weighted by a coefficient  $\beta$  and summed with the main signal to generate z(t). Here,  $\alpha T$  represents the pulse width, where  $\alpha$  denotes the fractional coefficient and T represents the period, while  $\beta$  defines the weight of the auxiliary driver. The impulse response of this output can be expressed as

$$h(t) = (1 + 0.5\beta) \cdot \delta(t) - 0.5\beta \cdot \delta(t - \alpha T). \tag{3.1}$$



Fig. 3-2. Block diagram of PWPE.

Applying the Laplace transform, the transfer function is derived as

$$H(s) = \frac{Z(s)}{X(s)} = (1 + 0.5\beta) - 0.5\beta \cdot e^{-s\alpha T}.$$
 (3.2)

Substituting s with  $j\omega$  to obtain the magnitude of the frequency response results in

$$|H(j\omega)| = \sqrt{(1 + 0.5\beta)^2 + (0.5\beta)^2 - 2 \cdot (1 + 0.5\beta) \cdot 0.5\beta \cdot \cos(\omega \alpha T)}.$$
 (3.3)

The frequency response obtained from Equation (3.3) is depicted in Fig. 3-3, demonstrating that there is no gain loss at DC due to equalization, while gain boosting occurs beyond the Nyquist frequency. Examining the frequency response after passing through an ideal 1st order low pass filter channel, as described in Fig. 3-4, it can be observed that a pulse width of 0.8 UI results in greater bandwidth expansion compared to a pulse width of 1 UI.

The frequency response of the 2-tap fractional-spaced FFE described in Fig. 2-9, when applied to an ideal first-order channel, is shown in Fig. 3-5. By appropriately setting the parameters  $\alpha$  and  $\beta$ , a flat frequency response can be achieved even in the high-frequency region. However, increasing  $\beta$  excessively leads to severe DC gain reduction due to deemphasis, making it challenging to maintain a sufficient SNR in especially LVSTL operation. Therefore, by partially compensating for channel loss through PWPE



Fig. 3-3. Frequency response of PWPE with pulse width and coefficient modulation.



Fig. 3-4. Frequency response of PWPE passing through an ideal 1st order channel.



Fig. 3-5. Frequency response of fractional-spaced FFE passing through an ideal 1st order channel.

before applying de-emphasis, it becomes possible to ensure an adequate SNR even in lowswing drivers.

In addition to pre-emphasis, capacitive peaking is another technique that enables channel equalization without degrading signal swing. The block diagram of the driver incorporating capacitive peaking is shown in Fig. 3-6. Since the capacitive peaking driver

appears as an



Fig. 3-6. Block diagram of capacitive peaking driver.

open circuit at low frequencies, the TX output swing is preserved and remains equivalent to that of a conventional driver without capacitive peaking. The frequency response of the driver can be analyzed by deriving the transfer function from the input signal X(s) to the output signal Y(s) via Laplace transformation, as shown in Equation (3.4).

$$H(s) = \frac{Y(s)}{X(s)} = \frac{R_T\{s(R_D + R_{EQ})C_C + 1\}}{R_D(sR_{EQ}C_C + 1)(sR_TC_L + 1) + R_T\{s(R_D + R_{EQ})C_C + 1\}}.$$
 (3.4)

Upon examining the transfer function, assuming  $C_C \gg C_L$ ,  $R_D = R_T$ , and that the main driver strength is typically greater than that of the auxiliary driver (i.e.,  $R_D < R_{EQ}$ ), the zero (Z) is located at  $1/((R_D + R_{EQ})C_C)$ , the dominant pole ( $P_1$ ) is positioned at  $1/(R_{EQ}C_C)$ , and the second pole ( $P_2$ ) arises at  $1/(R_TC_L)$ . These frequency locations follow the order:  $Z < P_1 < P_2$ . To achieve sufficient equalization gain peaking, the Z and the  $P_1$  must be well separated. However, in capacitive peaking structures,  $R_D < R_{EQ}$ , which causes the Z and  $P_1$  to be located in close proximity, thereby limiting the achievable gain peaking.

Fig. 3-7 illustrates the simulation results comparing the frequency responses of capacitive peaking and PWPE. The capacitive peaking configuration assumes  $R_D = 50 \Omega$ ,  $C_C = 1 \text{ pF}$ , and a load of  $C_L = 0.1 \text{ pF}$  in parallel with  $R_T = 50 \Omega$ . The PWPE driver uses



Fig. 3-7. Frequency responses of capacitive peaking and pulse width pre-emphasis.

a pulse width of 25 ps with equal driver strength, sweeping  $R_{EQ}$  from 50 to 200  $\Omega$ . The results confirm that capacitive peaking exhibits a limited equalization gain compared to PWPE due to the coupling between its zero and dominant pole. Additionally, in capacitive peaking structures, tuning the peaking frequency requires modifying the passive capacitor value, whereas PWPE enables more flexible control via pulse width adjustment. While capacitive peaking offers lower implementation complexity, its limited equalization gain, the parasitic capacitance associated with physical capacitors (typically 5–10%), and poor controllability make PWPE a more attractive candidate. A comparative summary of the advantages and disadvantages of capacitive peaking and PWPE is presented in Table 3-1.

Table 3-1
Comparison table of Capacitive Peaking and PWPE

| Category                  | Capacitive Peaking | PWPE     |  |
|---------------------------|--------------------|----------|--|
| Implementation Complexity | Low                | Moderate |  |
| Area Overhead             | Moderate           | Moderate |  |
| Equalization Performance  | Limited            | Good     |  |
| Controllability           | Low                | High     |  |

### 3.2.2 Transition Encoder

To perform PWPE, it is necessary to add a transition data signal when the final output data undergoes a transition. In the case of de-emphasis, fractional-spaced FFE can be achieved by applying a delay to the clock used in the 4:1 MUX for the post tap and then combining main tap and post tap at the output stage, as described in [9], [10]. However, this approach is not suitable for pre-emphasis, which adds a pulse width synchronously with the main tap timing, unlike de-emphasis.

An alternative method involves generating the pulse width by performing an AND or OR operation between the full-rate main tap and the delayed post tap after 4:1 serialization, as suggested in [11]. The back-end implementation of pre-emphasis, as referenced in [11], enables pulse width generation without requiring additional serializer stages. This approach is energy-efficient, relying solely on simple logic gate operations. However, this method imposes constraints on maximizing the data rate. Fig. 3-8 illustrates a current-starved inverter-based delay unit used to generate the delayed post tap. When a single-stage current-starved inverter is used to delay the signal, the signal slope significantly degrades, leading to internal ISI. Moreover, the delay varies depending on the data pattern.



Fig. 3-8. Schematic of current-starved inverter-based delay unit.

To address this, a two-stage current-starved inverter configuration with intermediate buffering is employed to maintain a sharp signal slope. Assuming that the propagation delay induced by the gate stages is compensated via a replica delay buffer on the main tap, the actual delay under various bias voltages is simulated across different data rates to ensure distortion-free pulse shaping. The relationship between bias voltage and resulting delay time, along with the tunable range per UI at various data rates, is presented in Fig. 3-9. As shown in Fig. 3-9, as the data rate of the serialized main tap increases, the allowable delay time without introducing signal distortion decreases rapidly. Consequently, the tunable range per UI is significantly reduced at higher data rates. To achieve a 20% tunable range per UI for a full-rate 40 Gbps serialized data stream, at least five such delay units would be required in series. Consequently, the full-rate data would need to traverse more than 20 inverter gate stages, and the main data path would require equivalent propagation delay compensation. In contrast, in a 4:1 MUX structure, the pulse width generated by delaying a quarter-rate clock has only half the Nyquist frequency of the full-rate data. Moreover, due to the deterministic nature of the clock pattern, internal ISI is inherently reduced, enabling implementation with only a single delay unit. Therefore, implementing pre-emphasis in the back-end requires excessive delay lines to secure a sufficient tuning range without internal



Fig. 3-9. Simulation result of the current-starved inverter-based delay unit.

ISI, which makes the design more susceptible to power supply-induced jitter (PSIJ). Such an approach is unsuitable for high-speed transmitter designs.

To maintain equalization flexibility while ensuring timing alignment between the main data and pre-emphasis data, an encoding approach is adopted at the front-end of the data path, as illustrated in Fig. 3-10. The transition encoder in Fig. 3-10 takes 8-bit parallel input data signals. Considering the serialization process in the next stage, the transition is detected by performing an XOR operation between the current and previous data before



Fig. 3-10. Block diagram of the transition encoder.

serialization. Additionally, based on whether the current data is '0' or '1', it is classified as either a rising transition or a falling transition data. Since D8[0] does not have previous data, the previous D8[7] is stored in a shift register as D8<sub>PRE</sub>[7] for transition detection.

Due to data skew introduced during encoding, which may cause sampling timing margin issues in the next-stage re-timer, the shift register in the transition encoder operates with a delayed clock. This clock delay is implemented by replicating the propagation delay (tPD) of the pattern generator and the thermometer encoder, ensuring timing consistency.

Fig. 3-11 presents an example timing diagram illustrating both the output data from the transition encoder when receiving 8-bit parallel input data signals and the final output data after serialization in the subsequent stage. Initially, D8[0:7] is processed by performing an XOR operation with the previously stored D8PRE[7], followed by an AND operation, which generates D8R[0:7] and D8F[0:7]. These signals then pass through the re-timer and serialization process, ultimately forming D1R and D1F. Comparing these signals with the main data D1 confirms that the transition signals are enabled only at the transition points. The pulse width for PWPE is generated and adjusted in the 4:1 MUX at the next stage.



Fig. 3-11. Timing diagram of the transition encoder and waveforms of the serialized main data and encoder output.

## 3.2.3 4:1 MUX With Pulse Width Generator

To implement PWPE using the transition data generated by the transition encoder, a pulse width generator is required. Fig. 3-12 illustrates the 4:1 MUX used for serializing the main data. To suppress data dependent skew caused by the connection order of the signals fed into the NAND gate, a balanced NAND gate is employed. Additionally, in full-rate multiplexing, the 4:1 MUX utilizes a 2-stage series transistor structure, which introduce internal ISI at high data rate. To mitigate this issue, feedback equalizers are incorporated at D02, D13, and D0123 to enhance the bandwidth of the 4:1 MUX. Since the feedback equalizer consumes static current, it is activated only during high-speed operation.

Fig. 3-13 presents a simulated 40-Gb/s eye diagram with a PRBS-31 NRZ pattern at node D1, demonstrating that enabling the feedback equalizer improves the bandwidth of the 4:1 MUX.



Fig. 3-12. Schematic of the 4:1 MUX and balanced NAND gate.



Fig. 3-13. Simulated waveforms of 4:1 MUX output at 40Gb/s with feedback equalizer (a) on and (b) off.

The 4:1 MUX used for the auxiliary driver, as presented in Fig. 3-14, also functions as a pulse width generator. The pulse width is generated by applying a delayed clock to the AND gate input port, which is originally supplied with VDD in a conventional 4:1 MUX.

In UI-spaced PWPE operation, the delay unit remains disabled, holding the delay unit output at a high state. In contrast, in fractional-spaced PWPE operation, the delay unit is enabled, allowing the pulse width to be generated.

According to simulation results in Fig. 3-15, the delay unit can delay the clock to below 20 ps, but the minimum achievable pulse width is 20 ps, due to the bandwidth limitation of the 4:1 MUX.



Fig. 3-14. Schematic of the 4:1 MUX with clock delay unit.



Fig. 3-15. Simulated delay versus bias voltage of delay unit.

#### 3.3 IMPLEMENTATION OF TX SUB-BLOCK CIRCUITS

#### 3.3.1 Parallel PRBS Generator

The PRBS generator is a pattern generator used in the TX design to generate random data patterns. PRBS is commonly utilized in bit error rate (BER) measurement equipment, where PRBS data is transmitted and compared with the received data to determine the BER. Additionally, PRBS is widely used in high-speed links such as PCIe, USB4, and DDR5 to assess link stability.

PRBS sequences are typically generated using a linear feedback shift register (LFSR), with common types including PRBS-7, PRBS-15, and PRBS-31. These sequences are derived from PRBS primitive polynomials to generate binary sequences.

To use PRBS as a random data source in the TX serializer, the parallel PRBS must match the serializer's multiplexing ratio. For instance, an 8:1 serializer requires the generation of an 8-bit parallel PRBS. Conventionally, an LFSR-based PRBS generator produces only one bit per clock cycle. A naive approach would be to de-serialize the serial PRBS output into an m-bit parallel PRBS. However, this method requires the LFSR clock frequency to match the final data rate, making it impractical for high-speed operation.

Thus, an m-bit parallel PRBS generator that inherently operates at 1/m of the final data rate is necessary. This can be efficiently implemented using a transition matrix representation [12].

Let  $D_k$  denote the state vector of an LFSR-based PRBS generator, where  $D_k = [D_{-1}, D_{-2}, ... D_{-n}]^T$ , then, the transition matrix T describes the relationship between the

next state  $D_{k+1}$  and the current state  $D_k$ , which can be expressed as:

$$\mathbf{D}_{k+1} = \mathbf{T} \cdot \mathbf{D}_k. \tag{3.5}$$

In equation (3.5) T can be directly obtained from the LFSR structure. For example, Fig. 3-16(a) presents the LFSR block diagram for a PRBS-7 sequence based on the polynomial  $X^7 + X^6 + 1$ . The corresponding transition matrix T is shown in Fig. 3-16(b).

When the parallel transition matrix  $T_P$  for an m-bit parallel PRBS is defined, the relationship between the next state  $D_{k+m}$  and the current state  $D_k$  can be expressed as:

$$\boldsymbol{D}_{k+m} = \boldsymbol{T}_{\boldsymbol{P}} \cdot \boldsymbol{D}_k \tag{3.6}$$



(a)

$$T = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix}$$
(b)

Fig. 3-16. (a) Block diagram of 7-bit LFSR for PRBS-7, (b) transition matrix **T**.

where  $T_P$  can be derived using Equation (3.5). In Equation (3.6), the parallel transition matrix  $T_P$  is obtained by multiplying the transition matrix T m-times, which can be represented as:

$$T_P = T^m. (3.7)$$

$$T = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{n-1} & 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & \cdots & \cdots & \cdots & \cdots & \cdots & 0 \\ 0 & 1 & \ddots & & & & \vdots & & \vdots \\ \vdots & \ddots & \ddots & \ddots & & & \vdots & \vdots \\ \vdots & & \ddots & \ddots & \ddots & & \vdots & \vdots \\ 0 & 0 & \cdots & \cdots & 0 & 1 & 0 & 0 \\ 0 & 0 & \cdots & \cdots & 0 & 0 & 1 & 0 \end{bmatrix} \end{bmatrix}$$

Fig. 3-17. General form of  $m \times m$  transition matrix T for m > n.

If m is smaller than the highest polynomial order n, T can be used directly as an  $n \times n$  matrix. If m is greater than n, a  $m \times m$  matrix T should be constructed, as illustrated in Fig. 3-17.

The resulting block diagram and  $T_P$  for 8-bit parallel PRBS-7 generator is presented in Fig. 3-18(a), while the 16-bit block diagram and  $T_P$  of PRBS-7 generator is shown in Fig. 3-18(b).

For the TX design in this work, PRBS-31 based on the polynomial  $X^{31} + X^{28} + 1$ 



Fig. 3-18. Block diagram and  $T_P$  for (a) 8-bit parallel PRBS-7 generator and (b) 16-bit parallel PRBS-7 generator.

is used. To generate a 16-bit parallel PRBS31 generator supporting 8-bit MSB/LSB PRBS, the transition matrix  $T_P$  shown in Fig. 3-19(a) is utilized, and the corresponding block diagram is illustrated in Fig. 3-19(b).



Fig. 3-19. (a)  $T_P$  and (b) block diagram for 16-bit parallel PRBS-31 generator.

### 3.3.2 Re-timer and 8:4 FFE MUX

The re-timer is a circuit that aligns parallel data with each clock phase before multiplexing to ensure an appropriate clock-to-data timing margin. It is implemented using latches, and its block diagram is shown in Fig. 3-20(a), while the timing diagram is presented in Fig. 3-20(b).

To enhance driver efficiency, the main driver is divided into five segments depicted in Fig. 3-1. When employing de-emphasis FFE, each segment is assigned a tap via an 8:4 FFE MUX, allowing flexible equalization. The driver consists of five segments, where the coarse driver strength is controlled by adjusting the number of segments assigned to taps. Fine driver strength is further adjusted using a 5-bit ZQ code embedded in each segment driver. The tap configuration and polarity are managed within the 8:4 FFE MUX, enabling segment selection from pre-cursor, main, post, and post-2 cursor and toggling SIG signal.



Fig. 3-20. (a) Block diagram and (b) timing diagram of re-timer.



|            |       | Selected 8-UI Data |          |          |          |          |          |          |          |
|------------|-------|--------------------|----------|----------|----------|----------|----------|----------|----------|
|            |       | D8S[0]             | D8S[1]   | D8S[2]   | D8S[3]   | D8S[4]   | D8S[5]   | D8S[6]   | D8S[7]   |
| Tap select | PRE   | D8rtm[1]           | D8rtm[2] | D8rтм[3] | D8rtm[4] | D8rтм[5] | D8rтм[6] | D8rtm[7] | D8rtm[0] |
|            | MAIN  | D8rtm[0]           | D8rtm[1] | D8rtm[2] | D8rtm[3] | D8rtm[4] | D8rtm[5] | D8rtm[6] | D8rtm[7] |
|            | POST  | D8rtm[7]           | D8rтм[0] | D8rtm[1] | D8rtm[2] | D8rtm[3] | D8rtm[4] | D8rtm[5] | D8rtm[6] |
|            | POST2 | D8rtm[0]           | D8rtm[7] | D8rтм[0] | D8rtm[1] | D8rtm[2] | D8rtm[3] | D8rtm[4] | D8rtm[5] |

(b)

Fig. 3-21. (a) Block diagram of 8:4 FFE MUX. (b) Look-up table for 4-tap FFE cursors.

As illustrated in Fig. 3-21, the 8:4 FFE MUX comprises a 4:1 selector for the 4-tap FFE cursors, a 4-UI pulse generator for multiplexing, and 2:1 MUXs. The 8-UI aligned data D8rtm[0:7] from the re-timer is mapped to D8S[0:7] according to the tap configuration, as described in Fig. 3-21(b). The signal polarity is determined by an XNOR-based SIG signal in the 4-UI pulse generator.

Fig. 3-22 presents an example timing diagram demonstrating the operation of the 8:4 FFE MUX for pre-tap and main-tap configurations. The D4[0:3] is subsequently transmitted to the driver unit.



Fig. 3-22. Timing diagram of the 8:4 FFE MUX for pre-tap and main-tap configuration.

## 3.3.3 Driver

As discussed in Chapter 2.3, DRAM interfaces are primarily categorized into POD and LVSTL schemes. This TX employs an LVSTL driver, designed for low-power applications. Fig. 3-23 illustrates the PAM-4 LVSTL driver architecture.

In Fig. 3-23(a), the driver utilizes a 1-stacked N-over-N configuration, where the driver strength code and the serialized full-rate data are processed through a NOR operation



Fig. 3-23. Schematic of (a) 1-stacked and (2) 2-stacked LVSTL PAM-4 driver.

to drive the output stage. However, this configuration is susceptible to PSIJ due to the increased number of MUX stages in the 4:1 serialization process. Furthermore, the stacked PMOS structure in the NOR gate introduces internal ISI, particularly when operating at full rate.

To address these limitations, Fig. 3-23(b) presents an alternative 2-stacked N-over-N driver architecture, where the driver strength ZQ code and data are integrated within the two-stacked driver itself. This structure shortens the full-rate data path, making it more robust against PSIJ while also eliminating the NOR gate from the data path, thereby reducing ISI-related degradation. However, the 2-stacked structure results in an increase in series resistance and parasitic capacitance at output node. However, since the mobility of NMOS is higher than that of PMOS, the driver size is optimized to minimize this drawback. Therefore, this TX adopts a 2-stacked N-over-N driver.

For PAM-4 signal driving, the MSB/LSB data undergoes a 2-bit to 3-bit binary-tothermometer code conversion, enabling three identical parallel drivers. Each segment driver maintains an identical structure and size, ensuring uniform drive strength. The enable operation of each driver based on the four levels is depicted in Fig. 3-24.



Fig. 3-24. Operation of each driver based on the four levels of PAM-4 driver.

#### 3.4 QUARTER-RATE CLOCKING

## 3.4.1 Clocking Architecture

Voltage-mode drivers are known to consume less power than current-mode logic (CML) drivers [13]. However, due to several disadvantages compared to CML, voltage-mode drivers are still less commonly used in ultra-high-speed interfaces [14], [15], [16], [17], [18], [19].

One major limitation is simultaneous switching output (SSO). When multiple drivers switch on and off simultaneously, VDD drops while VSS rises, leading to IR drop and ground bounce. Rapid current variations, coupled with the inductance of the package and PCB, cause power integrity (PI) issues, resulting in signal jitter and timing variations.

Another challenge is impedance control. Unlike CML, which inherently facilitates impedance matching, voltage-mode drivers require careful tuning of both the driver resistance and series resistors connected in series with the voltage source. Typically, impedance matching is achieved by performing ZQ calibration to determine the appropriate driver strength code.

Lastly, voltage-mode drivers require rail-to-rail swing. To minimize driver Ron resistance, the driver gate voltage must operate at full swing, which can be challenging in high-speed interfaces.

Despite these drawbacks, DRAM interfaces have adopted single-ended voltage-mode drivers to improve bandwidth efficiency while reducing power consumption. This is achieved by increasing both the number of pins and data rate. The serializer clock also operates in full swing.

To select a clocking architecture capable of stable full swing operation, the bandwidth of CMOS buffers is evaluated, as shown in Fig. 3-25(a). In digital circuit design, a fan-out (FO) of 3–4 is typically considered optimal in terms of power, delay, and stage count. However, for high-speed clocking, the FO-2 configuration was specifically analyzed. A single-bit pulse was applied, and after 10 stages, the pulse width at half-VDD was measured to estimate the bandwidth.

Fig. 3-25(b) presents the simulated results, normalized to the input pulse width, indicating that distortion propagates at a clock frequency of approximately 13-GHz. Based on this, a quarter-rate clocking architecture is selected to support PAM-4 operation at 80-Gb/s and beyond.



Fig. 3-25. (a) Test bench for estimating the bandwidth of a CMOS buffer and (b) simulation results of input-to-output pulse width distortion.

## 3.4.2 Clock Distribution

Fig. 3-26 describes the clock distribution and sub-circuits used in the TX. The external differential clock is received with a coupling capacitor and a 50  $\Omega$  termination resistor, followed by a 4-stage active PPF to generate the quadrature clock. The quadrature clock is then processed through a DCDL, which performs DCC and QEC before being buffered and fed into the 4:1 MUX. To generate the eighth-rate clock, it passes through a clock divider composed of two latches before being distributed to the 8:4 FFE MUX, re-timer, PRBS generator, and transition encoder.

Fig. 3-27 presents the simulated operating range of the active PPF used to generate the quadrature clock. The active PPF operates as an inverter-based oscillator that frequency locks to the injected clock and generates four quadrature phase clocks. Inside the PPF cells, the clocks at each node undergo constructive interference, reducing quadrature phase error.



Fig. 3-26. Block diagram of the clock distribution and sub-circuits.



Fig. 3-27. Simulated quadrature clock phase difference of 4-stage PPF.



Fig. 3-28. Simulated operating range of (a) DCC and (b) QEC DCDL with 5-bit DCC/QEC control code.

The frequency lock range varies depending on the number of stages [20]. In this design, a 4-stage PPF is implemented, and simulations confirmed an operating range from 6.5 GHz to 18 GHz.

Fig. 3-28 presents the simulation results demonstrating the capability of the DCC/QEC DCDL for correcting duty cycle errors and quadrature phase errors in the

quarter-rate clocking architecture. As seen in Fig. 3-26, the DCC DCDL operates with a 5-bit code, where increasing the CODEDCC[4:0] weakens the pull-up strength of the first

inverter and strengthens the pull-down strength, increasing the clock duty cycle. In contrast, the QEC DCDL adjusts the phase delay by reducing both pull-up and pull-down strengths as the CODEQEC[4:0] increases. Simulations at 12.5 GHz confirm that the DCC DCDL can adjust the duty cycle by approximately 13% and the phase delay by approximately 15 ps.

## 3.5 MEASUREMENT RESULTS

The proposed TX is fabricated using a 28-nm bulk CMOS technology, with the die photograph shown in Fig 3-29(a). The core area of the TX is 0.045 mm<sup>2</sup>. Fig. 3-29(b) presents the measurement setup, where an MP1800A serves as the clock source, providing a differential clock to the chip via RF probes. The TX output is measured using an RF probe connected to an MSOV334A oscilloscope.

Fig. 3-30 illustrates the waveform obtained by setting the quadrature clock pattern as the data pattern and verifying it at 72-Gb/s with DCC/QEC calibration. The quadrature phase error is reduced from 3.6 ps to 0.3 ps through the DCC/QEC DCDL.



Fig. 3-29. (a) The die photograph. (b) Measurement setup



Fig. 3-30. Measured waveform using quadrature clock patterns at 36-Gbaud: (a) before and (b) after DCC/QEC calibration.

Fig. 3-31 presents the TX's PAM-4 signal transmission, where ZQ calibration is performed to maintain a 50  $\Omega$  impedance matching between the driver Ron and termination resistor [21]. This impedance control ensures an RLM of 0.99, which is a critical metric for vertical margin in PAM-4 signaling. According to the IEEE 802.3 standard, an RLM of at least 0.96 is required for PAM-4 operation[8].



Fig. 3-31. Measured PAM-4 RLM at 20-Gb/s with PAM-4 ZQ calibration used in [21].



Fig. 3-32. (a) Measured pulse responses and (b) insertion losses derived from measured pulse responses at 40-Gbaud without ESD and 36-Gbaud with ESD.

Fig. 3-32 presents the pulse response of the TX with and without ESD protection, where the derived channel insertion loss, obtained through FFT analysis, measured -4.79 dB at 20-GHz and -6.14 dB at 18-GHz.

Fig. 3-33 presents the eye diagrams of the TX with and without ESD protection under PRBS-31 pattern, evaluating the impact of PWPE activation. Without PWPE, a sufficiently open eye diagram could not be achieved, whereas enabling PWPE effectively improves the eye opening, demonstrating its capability to increase data bandwidth without reducing signal swing. The TX without ESD protection achieves an 80-Gb/s eye diagram, whereas



Fig. 3-33. Measured 80-Gb/s eye diagram without ESD: with PWPE (a) off and (b) on.

Measured 72-Gb/s eye diagram with ESD: with PWPE (c) off and (d) on.

the ESD-protected TX exhibits parasitic capacitance effects, limiting the measured eye diagram to 72-Gb/s. The measured eye openings are 27 mV / 0.16 UI at 80 Gbps and 31 mV / 0.17 UI at 72-Gb/s.

Fig. 3-34 presents the pulse response measurements after connecting 80-inch and 120-inch cables. The insertion loss at the Nyquist frequency (14 GHz) for a 56-Gb/s PAM-4 data rate is -7.3 dB for the 80-inch cable and -11.4 dB for the 120-inch cable.



Fig. 3-34. Measured pulse response and (b) insertion losses derived from measured pulse responses at 28-Gb/s with 80-inch and 120-inch cables.

Fig. 3-35(a) and (b) show the eye diagrams when both FFE and PWPE are disabled, resulting in a completely closed eye. Fig. 3-35(c) and (e) demonstrate that enabling only FFE allows eye opening at 80-inches cable. However, at 120-inches cable, where the channel loss is higher, the LVSTL driver experiences excessive swing reduction, leading to insufficient SNR. Fig. 3-35(d) and (f) show that by reducing the FFE de-emphasis coefficient and utilizing PWPE, an adequate signal swing is maintained while still achieving channel equalization, successfully opening the eye diagram.

Fig. 3-36 presents the power breakdown measured at an 80-Gb/s data rate, detailing power consumption by each circuit block based on simulation results. The 4:1 MUX consumed the highest power at 133.9 mW, followed by the clock distribution network at 43 mW, the 8:4 FFE MUX at 33.5 mW, and the pattern generator and encoders at 31.7 mW.

Table 3-2 provides a performance comparison with state-of-the-art TXs, demonstrating that the proposed TX achieves competitive data rates per pin with reasonable energy efficiency.



Fig. 3-35. Measured 56-Gb/s PAM-4 eye diagrams: (a) 80-inch and (b) 120-inch cables without FFE and PWPE, (c) 80-inch cable with FFE only, (d) 80-inch cable with FFE & PWPE, (e) 120-inch cable with FFE only, and (f) 120-inch cable with FFE & PWPE.



Fig. 3-36. Measured power breakdown based on simulation at 80-Gb/s.

Table 3-2
Performance Summary Table of the State-of-the-Art TX

|                                 | [14]                  | [21]                  | [16]                  | [22]                  | [23]                  | [24]                  | This Work             |
|---------------------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
| Technology                      | 10nm                  | 65nm                  | 14nm                  | 40nm                  | 40nm                  | 28nm                  | 28nm                  |
| Supply (V)                      | 0.85/1.0/1.5          | 1.0/0.6               | 1.2                   | 1.0/1.2               | 1.0/0.6               | 1.2                   | 1.0/0.6               |
| Data rate per pin<br>(Gb/s/pin) | 112                   | 28                    | 64                    | 56                    | 32                    | 60                    | 80                    |
| Signaling                       | Differential<br>PAM-4 | Single-ended<br>PAM-4 | Differential<br>PAM-4 | Differential<br>PAM-4 | Single-ended<br>PAM-4 | Single-ended<br>PAM-4 | Single-ended<br>PAM-4 |
| Driver type                     | CML                   | VM<br>(LVSTL)         | CML<br>(tailless)     | VM<br>(SST)           | VM<br>(PN-over-NP)    | VM<br>(LVSTL)         | VM<br>(LVSTL)         |
| Equalization                    | 8-tap                 | 2-tap<br>asymmetric   | 3-tap                 | 4-tap                 | Capacitive<br>Peaking | 2-tap                 | 4-tap +<br>PWPE       |
| Clock source                    | On-chip PLL           | External              | External              | On-chip PLL           | External              | External              | External              |
| Output swing (V)                | 1.0                   | 0.3                   | 1.0                   | 1.0                   | 0.3                   | 0.6                   | 0.3                   |
| RLM                             | 0.99                  | 0.99                  | 0.99                  | 0.98                  | N/A                   | 0.98                  | 0.99                  |
| Energy efficiency<br>(pJ/bit)   | 1.88                  | 0.64                  | 1.3                   | 3.89                  | 0.51                  | 1.67                  | 3.1                   |
| Area(mm <sup>2</sup> )          | 0.088                 | 0.03                  | 0.048                 | 0.56                  | 0.005                 | N/A                   | 0.045                 |

## 3.6 CONCLUSION

A single-ended voltage-mode 80-Gb/s PAM-4 TX is demonstrated using a 28-nm CMOS process. The proposed TX incorporates two types of equalizers, a 4-tap reconfigurable FFE and PWPE, to enhance channel bandwidth and signal integrity. To implement high-speed pre-emphasis, a transition encoder is employed in the front-end datapath to encode transition information, mitigating issues such as internal ISI and data skew with the main data, which commonly occur in high-speed pre-emphasis operation. By combining PWPE and 4-tap segmented FFE, the TX minimizes SNR degradation in low-swing driver topologies. This approach is expected to be highly effective for high-speed memory interfaces utilizing LVSTL-based TXs.

# CHAPTER 4 Design of an 4x100-Gb/s/pin POD Transmitter With Quadrature Clock Error Corrector Using Pre-Coded-Data Pattern

In a quarter-rate clocking architecture, mismatches in the clock distribution process induce duty cycle errors and quadrature phase errors in the quadrature clock. These errors not only affect the timing margin during parallel-to-serial data conversion, but more critically, they degrade the eye quality of the final serialized data, reducing the horizontal eye opening. The impact of these errors is particularly pronounced in DRAM interfaces employing single-ended signaling, where the lack of common-mode noise rejection makes them more susceptible to signal degradation compared to differential signaling applications.

To address this issue, this work proposes a pre-coded data-based detection and correction technique for single-ended signaling TX, which identifies clock duty cycle errors and quadrature phase errors projected onto the TX output. Unlike conventional correction methods that rely on internal clock nodes, the proposed approach offers broader correction coverage. Additionally, this work analyzes the limitations of clock pattern-based correction techniques traditionally used in differential signaling drivers, highlighting their challenges when applied to single-ended signaling drivers, and presents a method to overcome these constraints.

## 4.1 DCC/QEC PRIOR ARTS

A widely used and straightforward method for DCC and QEC in quarter-rate clocking architectures is illustrated in Fig. 4-1, where previous works [22], [25], [26] have implemented low-pass filtering to extract clock characteristics, which are then compared using a comparator for correction. To remove comparator input offset, an auto-zeroing comparator is typically used. However, this approach relies on a key assumption that the clocks used for duty cycle comparison, such as in-phase clock (CK0) and reverse in-phase clock (CK180) or quadrature-phase clock (CK90) and reverse quadrature-phase clock (CK270), maintain a strict correlation in their duty cycles. For example, if CK0 has a duty cycle of 45%, then CK180 must have 55%, ensuring their average remains 50% for proper correction. If this correlation is violated, such as when CK0 is at 46% and CK180 is at 50%, the DCC will adjust both to 48%, resulting in incorrect duty cycle correction.



Fig. 4-1. DCC/QEC prior art in ref [22].

To prevent this, another method involves comparing the duty cycle of each clock to a half-VDD reference. However, this approach is susceptible to process variations and comparator offset errors, as half-VDD is typically generated using resistor division, which is easy to mismatches. Additionally, any clock load mismatch, deterministic capacitive coupling, or long-term variations such as negative bias temperature instability (NBTI) and positive bias temperature instability (PBTI) can introduce residual correction errors in the final output. The XOR-based duty cycle detection also faces challenges due to limited operating bandwidth and asymmetric rising/falling transitions, making QEC correction unreliable when DCC does not operate correctly.

Another approach, illustrated in Fig. 4-2, involves pre-defining half-rate or full-rate clock patterns, such as '1100' or '0011' for half-rate and '1010' or '0101' for full-rate, and detecting duty cycle or quadrature phase errors by analyzing the output node [27]. This method offers broader correction coverage as it considers all clock distortions along the entire clock and data path, from the clock source to the TX output. However, if the duty cycle correlation between complementary clocks is violated, the correction accuracy is compromised.



Fig. 4-2. Another DCC/QEC prior art in ref [5].

Interestingly, while clock distortions remain in differential drivers, their differential output eye diagram appears less affected due to the cancellation of duty cycle errors, a phenomenon not observed in single-ended drivers. Before analyzing this behavior, it is necessary to examine how quadrature clock phase error affects data multiplexing in a 4:1 MUX.

Fig. 4-3 illustrates the block diagram of a 1-UI pulse generator and 4:1 MUX, along



Fig. 4-3. Block diagram of (a) pulse generator and (b) 4:1 MUX. (c) Timing diagram of the 4:1 MUX with quadrature clock phase error.

with a timing diagram demonstrating the impact of quadrature clock phase errors. The 1-UI clock pulse is generated through an AND operation between quadrature clock pairs, such as CK270 and CK0 or CK0 and CK90. The clock pulses with 25% duty cycle drive the parallel data D4[0:3] into the 4:1 MUX, generating the serialized OUT. The transition timing of OUT is determined by whether OUT transitions from 0 to 1 or 1 to 0, as indicated by the colored clock edges. When quadrature phase errors exist, the serialized OUT eye diagram exhibits deterministic jitter and a reduced eye opening.

Fig. 4-4 presents the timing diagram of the driver output when the average duty cycle of the complementary clocks falls below 50%, resulting in incorrect DCC operation. In differential drivers, although CK0 and CK180 are corrected to a lower-than-ideal duty cycle, their differential output effectively cancels the duty cycle error, resulting in a relatively clean eye diagram without excessive jitter. However, in single-ended drivers, the



Fig. 4-4. Timing diagram of the driver outputs in incorrect DCC operation.

clock duty cycle error is directly projected onto the output, significantly reducing the eye opening.

Fig. 4-5 demonstrates a scenario where the clock duty cycle remains ideal at 50%, but a phase error persists. Using the '1010' and '0101' full-rate clock patterns, QEC method of the prior art in ref [5] appears to successfully restore the eye in differential outputs. However, in single-ended outputs, residual jitter and phase-induced eye distortions remain visible, indicating incomplete correction. While quadrature phase errors are mitigated in the differential output, they remain evident in the single-ended output.

Fig. 4-6 presents the simulated eye diagrams of differential and single-ended outputs for the two previously mentioned cases. As discussed earlier, even if distortion remains in the clock, quadrature phase errors are not observed in the differential output, whereas they



Fig. 4-5. Timing diagram of the driver outputs in incorrect QEC operation.



Fig. 4-6. Simulated eye diagrams of (a) differential output and (b) single-ended output with remaining duty cycle error. Simulated eye diagrams of (c) differential output and (d) single-ended output with remaining quadrature error.

are clearly visible in the single-ended output.

Specific data patterns are utilized instead of simple clock patterns (e.g., 1010 or 0101) because such alternating sequences make it difficult to detect quadrature phase errors. As noted in [5], the QEC method based on 1010 and 0101 patterns at the output node relies on the assumption that clock phases—CK0 and CK180, or CK90 and CK270—always shift in the same direction. Fig. 4-7 presents a timing diagram illustrating how the output average value changes under 1010 and 0101 patterns when a clock phase error occurs. When CK0 undergoes a phase shift to the right, the average voltage of the 1010 pattern decreases, while that of the 0101 pattern increases. However, if CK180 also experiences a rightward phase shift, the same output behavior is observed: the 1010 average decreases and the 0101

average increases. Therefore, in datapaths such as a 4:1 MUX that lack cross-coupled inverters, a phase shift occurring independently on either CK0 or CK180 cannot be uniquely identified using 1010 or 0101 patterns, as both cases yield identical average output trends. To cover scenarios in which phase errors occur independently in individual clocks, specific data patterns capable of defining quadrature clock phase differences is needed instead of conventional clock patterns for phase correction.



Fig. 4-7. Timing diagram illustrating changes in output average values of 1010 and 0101 patterns under clock phase shift conditions.

# 4.2 PRE-CODED-DATA PATTERN BASED DCC/QEC

To address the issues associated with conventional DCC/QEC methods, we propose a pre-coded data pattern-based DCC/QEC technique. Unlike traditional approaches that rely on clock patterns, this method utilizes pre-coded data patterns to detect duty cycle distortion and quadrature phase error directly at the TX output. As a result, this technique provides full coverage of all clock distortions and is applicable to single-ended drivers, overcoming the limitations of existing correction methods.

# 4.2.1 Principle of DCC

To correct duty cycle distortion, the proposed approach does not compare individual clock phases (CK0 to CK270) but instead establishes an internal reference to detect and correct the duty cycle error. As illustrated in the timing diagram of Fig. 4-8, the fundamental concept can be explained using CK0 as an example.

When Pattern 1 "11111100" is applied to the TX, the output passes through low-pass filter (LPF), theoretically yielding OUTLPF with an average voltage corresponding to a 75% duty cycle. This voltage is then converted into a digital code, CODE\_PTRN1, through an ADC. While common-mode level of output exists depending on the termination logic and additional corrections may be required (which will be detailed in the following chapter on DCC/QEC Implementation), this section presents the concept in a simplified manner for intuitive understanding.

According to the 4:1 MUX operation principle described in Fig. 4-3, Pattern 1



Fig. 4-8. Timing diagram for duty cycle error detection principle.

transitions to '1' on the rising edge of CK0 and transitions to '0' on its falling edge. Similarly, when Pattern 2 "00001100" is applied, the output waveform exhibits a 25% duty cycle, and the corresponding average voltage is converted into CODE\_PTRN2. This pattern, like Pattern 1, transitions to '1' on the rising edge of CK0 and to '0' on its falling edge.

CODE\_D is obtained by subtracting CODE\_PTRN2 from CODE\_PTRN1 and corresponds to waveform ③. To further refine the duty cycle correction, an additional calculation is performed by subtracting CODE\_PTRN2 from CODE\_D, yielding CODE\_DD, which corresponds to waveform ⑤.

For clarity, waveform ④ is presented as a time-shifted version of waveform ② and serves only to aid in understanding. When the duty cycle of CK0 increases, waveform ④ increases proportionally, while waveform ⑤ decreases accordingly. Since the sum of waveforms ④ and ⑤ remains constant, the DCC is performed to match the two values. The same process is applied to CK90, CK180, and CK270 by shifting the data patterns, ensuring DCC across all quadrature clock phases.

# 4.2.2 Principle of QEC

Fig. 4-9 describes the principle of quadrature phase error detection through a timing diagram. To perform QEC, Pattern 3 "11001100" is applied, generating waveform ⑥, which is then digitized into CODE\_PTRN3\_CK0. Similarly, after applying Pattern 4 "01000100", the resulting waveform ⑦ is converted into CODE\_PTRN4\_CK0.

By subtracting CODE\_PTRN4\_CK0 from CODE\_PTRN3\_CK0, waveform (8) is obtained, which is defined as CODE\_Q\_CK0. The same process is repeated for CK90, CK180, and CK270, yielding CODE\_Q\_CK90, CODE\_Q\_CK180, and CODE\_Q\_CK270, respectively.

The average value of the four CODE\_Q values is defined as CODE\_Q\_AVG, representing the ideal phase relationship among the quadrature clocks. Each CODE\_Q\_CK corresponds to the deviation in phase for each quadrature clock. QEC is performed by adjusting each quadrature clock phase so that CODE\_Q\_CK matches CODE\_Q\_AVG, ensuring optimal quadrature phase alignment.



Fig. 4-9. Timing diagram for quadrature phase error detection principle.

# 4.2.3 DCC/QEC Implementation

Fig. 4-10 presents the overall loop of the proposed DCC/QEC system in a block diagram. A finite state machine (FSM) generates pre-coded data patterns, which are transmitted to the TX for serialization and then OUT is delivered to the RX side. During this process, the low-pass filtered OUTLPF is fed into a 10-bit ADC, converting it into a digital code. Based on the principles described earlier, the DCC/QEC DCDL then adjusts the duty cycle and phase errors of the quadrature clock.

Fig. 4-11 presents the detailed flowchart of the DCC algorithm, along with the required clock patterns for each step. The first step in DCC execution is determining the low voltage level of the driver swing. Since POD drivers do not generate full-swing outputs due to the presence of pull-up termination resistors, the initial data pattern 000000000 is applied, and the corresponding digital code is stored as CODE\_0. Next, using the pre-



Fig. 4-10. Block diagram of the overall loop of the proposed DCC/QEC.

coded-data-patterns for each clock phase, the values corresponding to Pattern 1 are stored in CODE\_PTRN1, while those for Pattern 2 are stored in CODE\_PTRN2.

The OUTLPF voltage for CODE PTRN1 can be expressed as

$$V_{CODE\_PTRN1} = V_{CODE0} + (VDD - V_{CODE0}) \times (0.75 + 0.5 \times \Delta duty).$$
 (4.1)

Similarly, the OUTLPF voltage for CODE2 (OUTLPFCODE2) is given by

$$V_{CODE\_PTRN2} = V_{CODE0} + (VDD - V_{CODE0}) \times (0.25 + 0.5 \times \Delta duty). \tag{4.2}$$

Defining CODE\_DD as

$$CODE\_DD = CODE\_PTRN1 - 2 \times CODE\_PTRN2 + 2 \times CODE0. \tag{4.3}$$

The corresponding voltage representation of CODE3 is

$$V_{CODE\_DD} = V_{CODE0} + (VDD - V_{CODE0}) \times (0.25 - 0.5 \times \Delta duty).$$
 (4.4)



Fig. 4-11. Flow chart of DCC.

Since Equations (4.2) and (4.4) contain opposite signs for  $\Delta$ duty, FSM adjusts the DCC DCDL code value to ensure that the two CODE values converge.

Fig. 4-12 describes the detailed flowchart of QEC and the patterns (Pattern 3 and Pattern 4) required for QEC execution. When QEC is initiated, CODE\_PTRN3 and CODE\_PTRN4 are generated from Pattern 3 and Pattern 4, respectively, and this process is repeated for CK0 through CK270. The difference between CODE\_PTRN3 and CODE\_PTRN4 for each CK phase yields CODE\_Q, which can be expressed as

$$V_{CODE\_Q} = V_{CODE0} + (VDD - V_{CODE0}) \times (0.25 + \Delta phase). \tag{4.5}$$

Where  $\Delta$ phase represents the phase error as the difference between the rising edges of the current clock phase and the next quadrature clock phase. The sum of  $\Delta$ phase across CK0 to CK270 must be zero.

The corresponding voltage for the average CODE\_Q value is

$$V_{CODE\_Q\_AVG} = V_{CODE0} + (VDD - V_{CODE0}) \times 0.25.$$
 (4.6)

Each CODE\_Q value for each CK phase is compared against CODE\_Q\_AVG, and the QEC DCDL code is adjusted accordingly to perform QEC.



Fig. 4-12. Flow chart of QEC.

# 4.2.4 Analog-Digital Co-simulation

The DCC calibration sequence is illustrated in Fig. 4-13. First, CODE0 is applied to observe the V<sub>OL</sub> level. Then, a dedicated pattern is injected for CK0 duty cycle detection, followed by a similar pattern for CK180. Based on the detection results, the DCDL codes for CK0 and CK180 are adjusted. This process is repeated until the DCC calibration for CK0 and CK180 is completed. Afterward, the same DCC procedure is performed for CK90 and CK270.

Fig. 4-14 shows the QEC calibration sequence. Phase detection patterns are first applied to CK0 to store the CK0 phase information. Subsequently, the phases of CK90, CK180, and CK270 are sequentially measured. Once the phase values for all clocks are acquired, a reference code (CODE\_Q\_AVG) is generated, and the QEC DCDL codes are updated accordingly. This sequence is repeated to complete the QEC calibration.

To verify the operation of the DCC/QEC loop, a co-simulation was performed, integrating both the digital and analog components. Fig. 4-15 and 4-16 present the co-simulation results, demonstrating the progressive reduction of clock duty cycle error and quadrature phase error as each stage of the correction loop is executed.



Fig. 4-13. Duty cycle error correction calibration sequence.



Fig. 4-14. Quadrature phase error correction calibration sequence.



Fig. 4-15. Simulation result of DCC loop.



Fig. 4-16. Simulation result of QEC loop.

#### 4.3 CIRCUITS IMPLEMENTATION

Fig. 4-17 presents the top-level block diagram of the 4-channel TX. The clock distribution network is designed to support a wide range of data rates by employing a dual-path architecture: a high-frequency (HF) path and a low-frequency (LF) path. The HF path generates HF quadrature-phase clocks using an active PPF and a CML-to-CMOS converter, while the LF path utilizes a two-stage CMOS latch-based frequency divider to generate LF quadrature-phase clocks. The generated clocks are distributed across 4 channels, where



Fig. 4-17. Top-level block diagram of the 4-channel TX.

each channel drives its respective data path.

The data path includes a PRBS-31 generator for driver verification and a pre-code generator that delivers DCC/QEC patterns. The 16-bit parallel data is converted into 24-bit parallel data through a 2-bit to 3-bit binary-to-thermometer decoder. The data is then processed sequentially through a re-timer, an 8:4 MUX, and a 4:1 MUX, before being transmitted to the receiver (RX) via the driver output.

To enable DCC/QEC across all 4 channels using a single controller, the output of each channel is sequentially connected to an LPF through a switch. After the correction process is completed for all channels, all switches are disabled to prevent interference.

Fig. 4-18 illustrates the schematic of the 4:1 MUX and the main driver. The 4:1 MUX follows a conventional structure, while the main driver adopts a single-ended stacked POD source-series termination (SST) driver. To minimize power consumption, the number of 4:1 MUX instances are reduced by optimizing the driver segmentation. The FFE is integrated into the driver as a 2-tap de-emphasis equalizer, allowing de-emphasis operation after a three-stage delay without pulse width adjustments. The equalization function is controlled by an EQ EN signal for on/off switching. The pull-up/down driver strengths are



Fig. 4-18. Schematics of the 4:1 MUX and single-ended driver with equalizer.

controlled independently using 4-bit ZQ codes, and the equalizer provides a maximum emphasis gain of 2.2 dB.

Fig. 4-19 presents the schematic and simulation results of the DCC/QEC DCDL. The DCC and QEC each use a 6-bit digital delay code to adjust the clock phase. The DCC DCDL allows duty cycle control from 33% to 66% at a 7 GHz clock frequency, while the QEC DCDL enables phase delay adjustments of up to 20 ps.



Fig. 4-19. (a) Schematic and (b) simulation results of the DCC/QEC DCDL.

## 4.4 MEASUREMENT RESULTS

The proposed TX is fabricated using a 28-nm bulk CMOS process. The output signal is measured using RF probes, cables, an N7010A active termination adapter, and an MSOV334A oscilloscope.

Fig. 4-20 presents the measured results of the quadrature clock patterns at 7 GHz, before and after DCC/QEC, for channels A to D, corresponding to a 28-Gbaud data rate. The expanded view of channel D shows that the quadrature phase error is reduced from 6.8 ps to 0.3 ps, while the duty cycle error is reduced from 53.8% to 50.4%.

Fig. 4-21 illustrates the measured 22-GHz full-rate clock pattern, confirming the absence of deterministic jitter. The RMS jitter is measured at 440 fs.

Fig. 4-22 shows the 56-Gb/s PAM-4 eye diagram measured using a PRBS-31 pattern, before and after DCC/QEC. The quadrature clock phase error before DCC/QEC resulted in a reduced eye opening, which significantly improvs after the DCC/QEC loop is applied.



Fig. 4-20. Measured 7GHz quadrature clock patterns for 28-Gbaud rate for CH A~D.



Fig. 4-21. Measured 22GHz full-rate clock pattern.



Fig. 4-22. Measured 56-Gb/s PAM-4 eye diagram with PRBS-31 pattern (a) before DCC/QEC and (b) after DCC/QEC.

The TX operates at VDD = 1.0V, achieving an energy efficiency of 0.99 pJ/b per channel at 56-Gb/s.

Fig. 4-23 presents the measured 4x100-Gb/s PAM-4 eye diagram at VDD = 1.2V. The corresponding energy efficiency at this condition was 1.25 pJ/b.



Fig. 4-23. Measured 4 channel 100-Gb/s PAM-4 eye diagram.



Fig. 4-24. Measured 128-Gb/s PAM-4 eye diagram at 1.4V VDD.

To evaluate the maximum achievable data rate of the proposed transmitter, additional measurement is performed with an increased supply voltage. As a result, a 128-Gb/s PAM-4 eye diagram is successfully observed at a VDD of 1.4 V, demonstrating an energy efficiency of 2.4 pJ/b. The corresponding eye diagram is presented in Fig. 4-24.

Fig. 4-25 provides the micrograph of the fabricated chip, where the area per channel was measured at 0.066 mm<sup>2</sup>. Fig. 4-26 illustrates the power consumption breakdown based on simulation results and measured power data. The largest power consumption occurred in the clock distribution, while MUX and pre-driver circuits also accounted for a significant portion of the total power consumption.



Fig. 4-25. Micrograph of the TX.



Fig. 4-26. Power breakdown per channel at 100-Gb/s.

Table 4-1 compares the proposed TX with state-of-the-art designs that incorporate DCC/QEC. Although a direct numerical comparison of DCC/QEC performance is difficult, the proposed TX extends DCC/QEC correction from the clock path to the data path in a single-ended driver, making it a notable contribution to the field.

Table 4-1
Performance Summary Table of the TX with DCC/QEC

|                                 | [5]                      | [25]                 | [22]                 | [3]                  | [26]                 | This Work                 |        |
|---------------------------------|--------------------------|----------------------|----------------------|----------------------|----------------------|---------------------------|--------|
| Technology                      | 65nm                     | 28nm                 | 40nm                 | 10nm                 | 5nm                  | 28nm                      |        |
| Signaling                       | NRZ                      | PAM-4                | PAM-4                | PAM-4/6              | PAM-4                | PAM-4                     |        |
| Driver type                     | Differential<br>VM       | Differential<br>CML  | Differential<br>VM   | Differential<br>CML  | Differential<br>VM   | Single-ended VM           |        |
| Data rate per pin<br>(Gb/s/pin) | 8                        | 25                   | 56                   | 112                  | 58                   | 56                        | 100    |
| Output swing                    | 0.3Vppd                  | 0.4Vppd              | 1.0Vppd              | 1.0Vppd              | 0.9Vppd              | 0.5Vpp                    | 0.6Vpp |
| Phase error detection method    | Clock pattern            | Clock path detection | Clock path detection | Clock path detection | Clock path detection | Pre-Coded-Data<br>Pattern |        |
| DCC/QEC<br>Coverage             | Clock path<br>+Data path | Clock path           | Clock path           | Clock path           | Clock path           | Clock path<br>+Data path  |        |
| RLM                             | -                        | 0.97                 | 0.98                 | 0.99                 | 0.98                 | 0.96                      |        |
| Energy efficiency (pJ/bit)      | 1.1                      | 2.87                 | 3.89                 | 1.88                 | 0.9                  | 0.99                      | 1.25   |
| Area (mm <sup>2</sup> )         | -                        | 0.21                 | 0.56*                | 0.088*               | 0.082*               | 0.066**                   |        |

<sup>\*</sup> PLL include \*\*4 channel TX total area

## 4.5 CONCLUSION

In this TX design, we implement a single-ended PAM-4 voltage-mode TX, achieving an energy efficiency of 1.25 pJ/b at 4x100-Gb/s and 0.99 pJ/b at 56-Gb/s. The proposed DCC/QEC technique utilizes pre-coded-data patterns to detect duty cycle errors and quadrature phase errors at the output port. This approach enables the correction of clock distortions not only in the clock path but also in the data path, addressing clock distortion that may arise throughout the transmission chain.

Unlike conventional methods that rely on internal clock nodes, this technique extends clock error detection to the output port, making it applicable to both differential and single-ended drivers. Furthermore, since the proposed DCC/QEC method supports multi-channel detection, it is particularly beneficial for applications where multi-channels and extensive clock distribution networks are required, such as in DRAM interfaces.

## **CHAPTER 5 Conclusion**

This dissertation proposes two high-speed PAM-4 TX architectures to enhance data transmission rate and efficiency in memory interfaces. To address the increasing memory bottleneck caused by the growing performance gap between processors and memory, this study analyzes key challenges in single-ended low-voltage signaling and high-speed clocking architectures, presenting effective solutions.

First, in a LVSTL-based PAM-4 TX, the limited voltage margin makes it challenging to maintain sufficient SNR. Conventional de-emphasis techniques mitigate ISI but excessively reduce signal amplitude, degrading SNR. To overcome this, a PWPE-based TX is proposed, which incorporates an auxiliary driver that supplies additional current only during transient signal transitions. This enables high-frequency equalization without DC amplitude loss. Furthermore, a 4-tap reconfigurable FFE scheme is introduced for precise equalization control. Fabricated in a 28-nm CMOS process, the proposed TX achieves an eye height of 27 mV, an eye width of 0.16 UI, and an RLM of 0.99 at 80-Gb/s, with 3.06 pJ/b and 0.045 mm<sup>2</sup>.

Second, quarter-rate clocking in high-speed TXs introduces quadrature-phase errors, which do not occur in half-rate clocking. Conventional correction methods rely on intermediate clock nodes or differential outputs, which are ineffective in single-ended drivers. To address this, a pre-coded-data pattern-based DCC and QEC method is proposed. Unlike conventional methods, this approach detects and corrects clock phase errors directly at the driver output, compensating for residual phase errors throughout the entire clock path and data path. The proposed 4-channel POD PAM-4 TX, implemented in 28-nm CMOS, achieves 1.25 pJ/b energy efficiency at 100-Gb/s and 0.99 pJ/b at 56-Gb/s, with a per-

channel footprint of 0.066 mm<sup>2</sup>.

In conclusion, this study addresses two key challenges in high-speed memory interfaces and validates their effectiveness through experimental results. The PWPE-based TX improves SNR in LVSTL environments, while the data pattern-based DCC/QEC technique effectively corrects clock phase errors in single-ended drivers. These contributions provide essential guidelines for next-generation high-speed DRAM interface design, ensuring improved efficiency and reliability. The proposed techniques are expected to be widely applicable in various high-speed data transmission interfaces.

# **Bibliography**

- [1] IEEE, "ISSCC 2024 press kit," *IEEE International Solid-State Circuits Conference* (ISSCC), San Francisco, CA, USA, Feb. 2024. [Online]. Available: https://isscc.org/press-kit/
- [2] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, "AI and Memory Wall," *IEEE Micro*, vol. 44, no. 3, pp. 33–39, May-June. 2024, doi: 10.1109/MM.2024.3373763.
- [3] J. Kim *et al.*, "A 224-Gb/s DAC-Based PAM-4 Quarter-Rate Transmitter with 8-Tap FFE in 10-nm FinFET," *IEEE J Solid-State Circuits*, vol. 57, no. 1, pp. 6–20, 2022, doi: 10.1109/JSSC.2021.3108969.
- [4] Y. Shin, Y. Jo, J. Kim, J. Lee, J. Kim, and J. Choi, "28.5 A 900μW, 1-4GHz Input-Jitter-Filtering Digital-PLL-Based 25%-Duty-Cycle Quadrature-Clock Generator for Ultra-Low-Power Clock Distribution in High-Speed DRAM Interfaces," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, San Francisco, CA, USA, Feb. 2023, pp. 408–410, doi: 10.1109/ISSCC42615.2023.10067283.
- [5] Y. H. Song, H. W. Yang, H. Li, P. Y. Chiang, and S. Palermo, "An 8-16 Gb/s, 0.65-1.05 pJ/b, Voltage-mode transmitter with analog impedance modulation equalization and sub-3 ns power-state transitioning," *IEEE J Solid-State Circuits*, vol. 49, no. 11, pp. 2631–2643, 2014, doi: 10.1109/JSSC.2014.2353795.
- [6] T. M. Hollis et al., "An 8-Gb GDDR6X DRAM Achieving 22 Gb/s/pin with Single-Ended PAM-4 Signaling," *IEEE J Solid-State Circuits*, vol. 57, no. 1, pp. 224–235, 2022, doi: 10.1109/JSSC.2021.3104093.
- [7] JEDEC Solid State Technology Association, *JESD239: Graphics Double Data Rate* (GDDR7) SGRAM, Mar. 2024. [Online]. Available: https://www.jedec.org/standards-

- documents/docs/jesd239
- [8] IEEE Standard for Ethernet, IEEE Std 802.3<sup>TM</sup>-2018 (Revision of IEEE Std 802.3-2015), Jun. 2018. [Online]. Available: https://standards.ieee.org/standard/802\_3-2018.html
- [9] X. Zheng et al., "A 50-112-Gb/s PAM-4 transmitter with a fractional-spaced FFE in 65-nm CMOS," IEEE J Solid-State Circuits, vol. 55, no. 7, pp. 1864–1876, Jul. 2020, doi: 10.1109/JSSC.2020.2987712.
- [10] C. Cai et al., "A 1.4-Vppd64-Gb/s PAM-4 Transmitter with 4-Tap Hybrid FFE Employing Fractionally-Spaced Pre-Emphasis and Baud-Spaced De-Emphasis in 28-nm CMOS," in Proc. Eur. Solid-State Circuits Conf. (ESSCIRC), Grenoble, France, Sep. 2021, pp. 527–530, doi: 10.1109/ESSCIRC53450.2021.9567818.
- [11] M. S. Chen and C. K. K. Yang, "A low-power highly multiplexed parallel PRBS generator," in *Proc. IEEE Custom Integrated Circuits Conf. (CICC)*, San Jose, CA, USA, Sep. 2012, pp. 1–4, doi: 10.1109/CICC.2012.6330664.
- [12] T. O. Dickson, H. A. Ainspan, and M. Meghelli, "A 1.8pJ/b 56Gb/s PAM-4 transmitter with fractionally spaced FFE in 14nm CMOS," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, San Francisco, CA, USA, Feb. 2017, pp. 118–119, doi: 10.1109/ISSCC.2017.7870289.
- [13] W. Bae, H. Ju, K. Park, J. Han, and D. K. Jeong, "A Supply-Scalable-Serializing Transmitter with Controllable Output Swing and Equalization for Next-Generation Standards," *IEEE Transactions on Industrial Electronics*, vol. 65, no. 7, pp. 5979– 5989, 2018, doi: 10.1109/TIE.2017.2779439.
- [14] J. Kim *et al.*, "A 224-Gb/s DAC-Based PAM-4 Quarter-Rate Transmitter with 8-Tap FFE in 10-nm FinFET," *IEEE J Solid-State Circuits*, vol. 57, no. 1, pp. 6–20, Jan. 2022, doi: 10.1109/JSSC.2021.3108969.
- [15] Z. Wang et al., "An Output Bandwidth Optimized 200-Gb/s PAM-4 100-Gb/s NRZ

- Transmitter with 5-Tap FFE in 28-nm CMOS," *IEEE J Solid-State Circuits*, vol. 57, no. 1, pp. 21–31, Jan. 2022, doi: 10.1109/JSSC.2021.3109562.
- [16] Z. Toprak-Deniz et al., "A 128-Gb/s 1.3-pJ/b PAM-4 Transmitter with Reconfigurable 3-Tap FFE in 14-nm CMOS," *IEEE J Solid-State Circuits*, vol. 55, no. 1, pp. 19–26, Jan. 2020, doi: 10.1109/JSSC.2019.2939081.
- [17] J. Q. Wang et al., "7.1 A 2.69pJ/b 212Gb/s DSP-Based PAM-4 Transceiver for Optical Direct-Detect Application in 5nm FinFET," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), San Francisco, CA, USA, Feb. 2024, pp. 124–126, doi: 10.1109/ISSCC49657.2024.10454275.
- [18] M. Cusmai et al., "7.2 A 224Gb/s sub pJ/b PAM-4 and PAM-6 DAC-Based Transmitter in 3nm FinFET," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), San Francisco, CA, USA, Feb. 2024. pp. 126–128, doi: 10.1109/ISSCC49657.2024.10454558.
- [19] D. Pfaff et al., "7.3 A 224Gb/s 3pJ/b 40dB Insertion Loss Transceiver in 3nm FinFET CMOS," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), San Francisco, CA, USA, Feb. 2024. pp. 128–130, doi: 10.1109/ISSCC49657.2024.10454537.
- [20] K. H. Kim, P. W. Coteus, D. Dreps, S. Kim, S. V. Rylov, and D. J. Friedman, "A 2.6mW 370MHz-to-2.5GHz open-loop quadrature clock generator," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, San Francisco, CA, USA, Feb. 2008. pp. 458–459, doi: 10.1109/ISSCC.2008.4523255.
- [21] Y. U. Jeong, H. Park, C. Hyun, J. H. Chae, S. H. Jeong, and S. Kim, "A 0.64-pJ/Bit 28-Gb/s/Pin High-Linearity Single-Ended PAM-4 Transmitter with an Impedance-Matched Driver and Three-Point ZQ Calibration for Memory Interface," *IEEE J Solid-State Circuits*, vol. 56, no. 4, pp. 1278–1287, Apr. 2021, doi: 10.1109/JSSC.2020.3042240.
- [22] P. J. Peng, Y. T. Chen, S. T. Lai, and H. E. Huang, "A 112-Gb/s PAM-4 Voltage-

- Mode Transmitter with Four-Tap Two-Step FFE and Automatic Phase Alignment Techniques in 40-nm CMOS," *IEEE J Solid-State Circuits*, vol. 56, no. 7, pp. 2123–2131, Jul. 2021, doi: 10.1109/JSSC.2020.3038818.
- [23] J. H. Park et al., "A 32Gb/s/pin 0.51 pJ/b Single-Ended Resistor-less Impedance-Matched Transmitter with a T-Coil-Based Edge-Boosting Equalizer in 40nm CMOS," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), San Francisco, CA, USA, Feb. 2023. pp. 410–412, doi: 10.1109/ISSCC42615.2023.10067552.
- [24] J. Kim *et al.*, "A 60-Gb/s/pin single-ended PAM-4 transmitter with timing skew training and low power data encoding in mimicked 10nm class DRAM process," in *Proc. IEEE Custom Integrated Circuits Conf. (CICC)*, Newport Beach, CA, USA, Apr. 2022, pp. 1–4, doi: 10.1109/CICC53496.2022.9772814.
- [25] Y. T. Lin, T. W. Xu, and W. Z. Chen, "A 50 Gb/s PAM-4 Transmitter with Feedforward Equalizer and Background Phase Error Calibration," *IEEE Transactions* on Circuits and Systems II: Express Briefs, vol. 68, no. 8, pp. 2820–2824, 2021, doi: 10.1109/TCSII.2021.3068457.
- [26] Y. Perelman, et al., "A 116-Gb/s PAM4 0.9-pJ/b Transmitter With Eight-Tap FFE in 5-nm FinFET," *IEEE J Solid-State Circuits*, vol. 59, no. 7, pp. 2260–2271, 2024, doi: 10.1109/JSSC.2024.3351372.

### Abstract in Korean

# 메모리 인터페이스를 위한 4x100Gb/s 단일 종단 팸포 전압 모드 송신기 설계

AI, 클라우드 서비스, 머신러닝 기술의 급속한 발전에 따라 고성능컴퓨팅(High-Performance Computing, HPC)에 대한 수요가 빠르게 증가하고있으며, 이에 따라 주 메모리인 DRAM의 성능 요구도 지속적으로 상승하고있다. 그러나 프로세서와 메모리 간의 성능 격차는 해마다 심화되고 있으며,이로 인해 메모리는 전체 시스템 성능의 병목 요소로 작용하고 있다. 이러한문제를 해결하고 메모리 인터페이스의 대역폭을 확장하기 위해, 본논문에서는 4레벨 펄스 진폭 변조(Four-Level Pulse Amplitude Modulation, PAM-4)기반의 싱글 엔디드 송신기(TX) 두 구조를 제안한다.

PAM-4 방식은 NRZ 신호 대비 전압 마진이 1/3 수준으로 감소하기때문에, 저전압 스윙 종단 논리(Low-Voltage Swing Termination Logic, LVSTL)환경에서는 충분한 신호대잡음비(SNR)를 확보하기 어렵다. 또한 데이터속도가 증가함에 따라 채널에서의 심벌 간 간섭(Inter-Symbol Interference, ISI)도함께 증가하게 된다. De-emphasis는 ISI를 보상하여 SNR을 향상시킬 수 있지만,동시에 신호 진폭을 감소시킨다. 특히 LVSTL 구조에서는 제한된 스윙으로인해 de-emphasis에 따른 DC 이득 손실이 SNR 확보에 불리하게 작용할 수있다.

이를 해결하기 위해 설계된 첫 번째 송신기는 LVSTL 기반의 low-swing single-ended 구조로, 4-tap 재구성 가능한 피드포워드 이퀄라이저(FFE)와 펄스폭 프리엠퍼시스(Pulse Width Pre-emphasis, PWPE)를 결합하여 channel loss로 인한

ISI를 보상하고 SNR을 확보하였다. De-emphasis 계수의 사용을 최소화하고, 보상되지 않은 잔여 ISI는 PWPE를 통해 보완함으로써 swing 저하를 억제하고 전송 신뢰도를 확보하였다. 해당 송신기는 28nm CMOS 공정으로 제작되었으며, 80 Gb/s에서 아이 높이 27 mV, 아이 폭 0.16 UI, RLM 0.99, 에너지 효율 3.06 pJ/b, 칩 면적 0.045 mm²의 성능을 달성하였다.

이러한 기반 위에서 구조 단순화 및 성능 향상을 목표로, pseudo open drain(POD) 기반의 두 번째 송신기를 개발하였다. POD 구조는 더 큰 신호스윙을 가능하게 하여 간단한 2-tap de-emphasis만으로도 충분한 ISI 보상이가능하며, 출력단 부하 용량을 감소시켜 고속 동작이가능하도록 하였다. 또한 quarter-rate 클럭 구조에서 발생하는 duty cycle 오차 및 quadrature 위상 오차를 해결하기 위해, 드라이버 출력단에서 사전 인코딩 된 데이터 패턴을 이용하여오차를 직접 검출하고 보정하는 새로운 자동 위상 보정 기법을 제안하였다. 최종 구현된 송신기는 4채널 구조로, 28nm CMOS 공정에서 제작되었으며, 4×100 Gb/s에서 에너지 효율 1.25 pJ/b, 4×56 Gb/s에서 0.99 pJ/b, 채널당 면적 0.066 mm²의 성능을 달성하였다.

핵심 단어: 피드-포워드 이퀄라이제이션, 팸포 송신기, 전압 모드, 단일 종단, LVSTL, PWPE, 듀티 사이클 보정, 직교위상 오류 보정, POD

## **List of Publications**

## **International Journal Papers**

- [1] <u>Jae-Koo Park</u>, Dae-Won Rho, Seung-Jae Yang, and Woo-Young Choi, "An 80-Gb/s/pin Single-Ended Voltage-Mode PAM-4 Transmitter With a Pulse Width Pre-Emphasis and a 4-Tap FFE in 28-nm CMOS," *IEEE Journal of Solid-State Circuits*, Vol. 60, No. 2, pp.519-527, Feb. 2025.
- [2] <u>Jae-Koo Park</u>, Min-Hyeok Seong, Kihun Kim, Jae-Ho Lee, and Woo-Young Choi, "A Quadrature Biasing Method Based on Slope Detection for Si Mach-Zehnder Modulators," *IEEE Photonics journal*, Accepted (In press).
- [3] Dae-Won Rho, <u>Jae-Koo Park</u>, Yongjin Ji, and Woo-Young Choi, "A 32-Gb/s Si Micro-Ring Modulator Transmitter with an Integrated Code-Based Temperature Controller," *IEEE/OSA Journal of Lightwave Technology*, (under review).

## **International Conference Presentations**

- [1] <u>Jae-Koo Park</u>, Dae-Won Rho, and Woo-Young Choi, "An 88-Gb/s/pin Single-Ended PAM-4 Transmitter in 28-nm CMOS with Duty-Cycle-Error and Quadrature-Phase-Error Correction Using Pre-Coded Data Patterns," *in Proc. IEEE Int. New Circuits and Systems Conference (NEWCAS)*, Paris, France, 22-25 Jun. 2025.
- [2] Dae-Won Rho, <u>Jae-Koo Park</u>, Yongjin Ji, Seung-Jae Yang, and Woo-Young Choi, "A 4λ× 50-Gb/s Si Photonic WDM Transmitter with Code-Based Wavelength Calibration and Locking," *Optical Fiber Communication Conference (OFC) 2025*, San Francisco, USA, 30 Mar. 3 Apr. 2025.
- [3] Woo-Young Choi, Dae-Won Rho, <u>Jae-Koo Park</u>, Seung-Jae Yang, Hae-Ho Lee, and Yongjin Ji, "Invited paper: Si Photonic Ring-Resonator-Based WDM Transceivers," *Proceedings of the 30th Asia and South Pacific Design Automation Conference (ASP-DAC)*, Tokyo, Japan, 20-23 Jan. 2025.
- [4] Dae-Won Rho, <u>Jae-Koo Park</u>, Seung Jae Yang, and Woo-Young Choi, "A 80Gb/s/pin Single-Ended PAM-4 Transmitter With an Edge Boosting Auxiliary Driver and a 4-Tap FFE in 28-nm CMOS," 2023 IEEE Asian Solid-State Circuits Conference (A-SSCC), Hainan, China, 5-8 Nov. 2023.
- [5] Dae-Won Rho, <u>Jae-Koo Park</u>, Seung Jae Yang, and Woo-Young Choi, "A 40Gb/s/pin Single Ended Transmitter with Output pad Network for Memory Interface Application in 28-nm CMOS," *International SoC Design Conference Chip Design Contest (ISOCC-CDC)*, Jeju, Korea, 25-28 Oct. 2023.

#### **Patents**

- [1] <u>Jae-Koo Park</u>, and Woo-Young Choi, "전송 회로를 포함하는 반도체 장치 및 이의 동작 방법," Korea Patent (Pending), Application No. 10-2025-0050423. Apr. 17, 2025.
- [2] <u>Jae-Koo Park</u>, Dae-Won Rho, and Woo-Young Choi, "DATA TRANSMISSION CIRCUIT, SYSTEM INCLUDING THE SAME, AND DATA TRANSMISSION METHOD," USA Patent (Pending), Application No. 18/821,126. Aug. 30, 2024.
- [3] <u>Jae-Koo Park</u>, Dae-Won Rho, and Woo-Young Choi, "광 변조기의 온도 제어 장치 및 이를 포함하는 광 링크 장치," Korea Patent (Pending), Application No. 10-2024-0129429. Sep. 24, 2024.
- [4] <u>Jae-Koo Park</u>, Dae-Won Rho, and Woo-Young Choi, "데이터 송신 회로, 이를 포함하는 시스템 및 데이터 송신 방법," Korea Patent (Pending), Application No. 10-2023-0154569. Nov. 9, 2023.
- [5] <u>Jae-Koo Park</u>, Dae-Won Rho, and Woo-Young Choi, "전압 모드 송신기의 성능 개선을 위한 엣지 부스팅 보조 드라이버와 그 구조," Korea Patent (Pending), Application No. 10-2023-0117633. Sep. 5, 2023.