that was supposed to run at 600 Mbps but it didn't
To quickly read out the data from the column-parallel ADCs I had to design a serial interface with multiple high-speed LVDS data links. The possession of a fast driver is indispensable in such situations.
I wanted to avoid reading out the SRAM in parallel with the data conversion process so as to reduce the overall ADC noise as much as possible. Thus, in order to fit within the system's timing requirements this meant that the data readout should occur as fast as possible. I estimated that a serial link of about 500-600 Mbps LVDS should suffice in reading out the 13 kbit on-chip SRAM via 16 serial LVDS pairs. It's worth mentioning that the described here design approach was stuffed-up for about a week, excluding some previous quests for architecture exploration began. So, just as a side note, don't expect reading miracles, the Tx turned out to be just "good enough".
Architecture
My initial strategy was to try implement a book-design, hence the classic four switch dual current mirror driver, with the typical CMFB scheme using a high-ohmic resistive divider. Turns out because the supply voltage of 1.2V for this process (0.13um) I could not bring the current sources in proper saturation margins without extensive optimizations, for which I had virtually no time. Thus, I ended up using the following mastermind scheme:
The transistors PM4 and PM6 act as switchable current sources and thus should theoretically improve the LVDS voltage swing headroom. These are switched using a passive switched capacitor scheme which uses the NFETs (NM5, NM6, NM7 and NM39) as well as two MIM capacitors. The top plates of the MIM caps are connected to vbias, while their bottom plates are driven by the complementary data line signals d and d_b. This forms the dual complementary switchable current sources, whose gate voltages span between vbias and VDD.
Switched current source
Theoretically this should all work very well, except that the Cgs of the current source dampens the gate switching, imposing the use of a large feedback (pump) capacitor which introduces even more parasitics, hence lowering the bandwidth. The theoretically maximum swing on the gate of PM4 and PM6 is determined by the damping factor:
$$\Delta V = \frac{C_{p}}{C_{p}+C_{fb}} V_{pwr}$$Following that relation I estimated that a cap of about 550 fF should suffice keeping the damping factor low. However, assuming that one takes care of one of the parasitics and damping around PM4 and PM6, there is a secondary issue and that is the speed of the pump switches and bias generation.
Bias dynamics
Normally when designing a biasing network, one tries to keep currents in the "bias" and "conditioning" branches low as the latter are kind of "wasted" and do no other work than generating a bias voltage. The bias voltage generation and distribution in the current scheme however exhibits fast dynamics and therefore has to either be buffered (without any offset which involves more circuitry) or a simpler solution is cranking-up the current in the bias branch itself so it can charge the large feedback caps before the LVDS eye has settled. The good news here is that after initial settling only a small charge from the capacitor is lost, used for charging and discharging Cgs parasitics of the PM4/6. Assuming that the parasitics steal about 10% of the charge on Cfb we can check if a current of about 128 uA should suffice to quickly replenish Cfb, and whether that Class A solution would be good enough for the purpose.
$$ C_{replenish} = 0.1 \times 500 fF = 50 fF $$ $$ \text{Absolute voltage when fully charged = Vdd - Vt = 700 mV}$$ $$ \text{Deprived voltage is 10 %}\approx 70 mV$$ $$ \text{Quiescent current needed to replenish cap for 200 ps} $$ $$ V = \frac{I \times t}{C} \Leftrightarrow I = \frac{V \times C}{t} = \frac{0.07 [V] \times 500.10^{-15} [F]}{200.10^{-12}[s]} = 125 [\mu A]$$A current of 125 uA at 1.2 V is not ridiculously high so let it be it, here's the switching response which roughly matches expectations:
The bottom plot shows the gates of the current sources, the mid-plot shows replenishing current, and the top plot indicates LVDS voltage line at 500 Mbps. Hmm, settling doesn't look particularly impressive, but at this point I had to move on as the tapeout deadline was deadly approaching.
Common-mode feedback
Perhaps the CMFB design of this architecture is the most challenging as of the many unknown factors such as the load capacitance and additional parasitics via the compensation Cc MIM cap. Not knowing the exact pad + PCB track capacitance was another factor which could screw-up the stability of the loop. While there could be several different CMFB loop stabilization approaches I decided to use an indirect Miller compensation with a zero cancellation resistor, also limiting the feedforward current. In that scheme to a first order approximation there is a non-dominant pole at:
$$p_{nd} = \frac{g_{m8}}{C_{gs4} + C_{ds0}}$$Also from $C_{c}$ and $R_{z}$:
$$p_{c} = \frac{-1}{R_{z}(C_{gs4}C_{c}/C_{gs4}+C_{c})}$$And because of tje nulling resistor you also get a cancellation zero:
$$z_{c} = \frac{-1}{R_{z}C_{c}}$$Which cancels the output pole and leads the phase. Here is the response of the system obtained via AC SPICE simulations:
The plot shown at the bottom below, provides the loop's response without compensation. As you might note, the compensation does a pretty good job keeping the system at 60deg phase margin, without a significant phase roll-off even beyond the 0 dB magnitude level, implying that it should handle well within PVT corners. Which it does, except for that this simulation does not include parasitics which kind-of ruin the PM of the system (read further).
While there are probably more efficient compensation approaches such as a proper (buffered) indirect compensation with e.g. a source follower which will cancel the feedforward current completely and allow the use of a much smaller compensation capacitor these are more time-consuming for design. Hence the approach here &endash; use a large direct Miller compensation and not care about area. This approach, as it is seen later in text comes at a bandwidth cost due to the exhibited parasitics on the bottom plate of the MIM cap.
The common mode range is dictated by the Vds saturation margins of the switchable current source, as well as the input range of the CMFB error amplifier. In the current design it was estimated that it could span between 0.6 to 0.9 Volts. Most modern FPGA receivers can detect such LVDS common-mode voltages. The LVDS span was tuneable between 50 mV to 300 mV peak-to-peak.
Load model
I've used the ESD in the pads in combination with 2 pF (extra) internal capacitance, as well as about 6 pF for the PCB track and FPGA pads. As we shall see from the measurements, these values might have been a bit on the optimistic side. Here's the used load model for design purposes:
Doubling the far-end capacitance might have been a better idea.
Measurements
So! It underperformed greatly, but luckily didn't fail! Was designed for 600 Mbps but achieved only 250 Mbps. So what happened? Here's a few theories:
— inefficient architecture + design: the Class A bias driver used by the switched current source "pump" capacitors is a rather inefficient approach. A better implementation should involve the use of a proper switched-cap buffer. The latter shall significantly improve current source switching and hence settling.
— last minute change: Yes, these are always risky! Dummy fill was deliberately blocked inside the LVDS core due to parasitics effect worries. However, Dongbu returned the layout with concerns for overetching of a few logic gates inside the Tx. Rushing to submit a new GDSII, I removed all dummy fill blockers under the feedback caps without running parasitic extraction. This added a bit of parasitics between the bottom plate of the pump capacitors and the substrate. Apparently this extra capacitance greatly reduced the bandwidth of the switched current sources. Oops!
As a result, due to asymmetric loading of the control lines, the common-mode of the P and N lines drifted, thus closing the eye of the data line. In principle this shouldn't have mattered as the LVDS Rx in the FPGA is AC-coupled anyway, but the FPGA deserialization block just couldn't lock the training pattern at the designed speeds...
— more parasitics: while I didn't really observe any CMFB instability, some common-mode ringing was exhibited at high speeds which is probably the reason why the ISERDES failed to lock. The loop's PM has been reduced by the extra parasitics on the miller compensation:
— even more undervalued parasitics: to reduce the body effect on the switches and help improve substrate noise isolation I decided to put the N-FET switches under a local DNW and bridge the bulk-source. Doing this adds an extra reverse-biased junction at the tail bias node, which lies exactly within the feedback loop, reducing phase margin. I knew about that effect, so I decided to run some ballpark junction capacitance estimations by fetching some parameters from the SPICE models and using the typical square law model:
$$C_{tot} = C_{j} + C_{jsw} = ((A_{d}*C_{j})/(1+V/V_{o}))^{mj} + \\ + ((P_{d}*C_{jsw})/(1+V/V_{o}))^{mjsw}$$where $mj$ and $mjsw$ are process-specific coefficients which along with $C_{j}$ and $C_{jsw}$ I fetched from the SPICE models. The latter lead to the following result:
Computing 5 femto Farads for a 10 x 10 um diode? I must have lost my mind, this model is very very wrong, and the actual value must be larger. At that point however the deadline was approacting so quickly that I simply had to keep ignoring things. Here's some waveforms run at 100 Mbps using a 100 ohm termination resistance built-in to the FPGA receiver:
Physical overview:
The layout is rather huge, as it may be noted from the layot diagram, I removed all active circuitry under the pump and miller caps so as to reduce parasitics. However, the last minute dummy generation sadly filled all of the area below.
Nevertheless, the driver does just fine up to 250 Mbps and allows me to run and evaluate the ADCs in full speed, although, it would have been nicer to not have to convert and readout at the same time. Hope this mixture of thoughts helps someone who decides to try this architecture.