Loading DesignSpark, please wait..

We apologise for the slowness of this page You are using Internet Explorer 6, upgrading your browser will greatly enhance your experience using DesignSpark

The gateway to online resources and design support for engineers, powered by RS ComponentsAllied

Optimizing FPGAs for Power: A Full-Frontal Attack

Posted by DesignSpark admin

939 views

Power has become a primary factor in the ever-important search for the “perfect” FPGA for a given design. Power management is critical in most applications. Some standards specify maximum power per card or per system. As such, designers must consider power much earlier in the design flow than ever before—often starting with the selection of an FPGA.

Reducing the power consumption of the FPGA simplifies the board design by lowering the supply rails, simplifying power supply design and thermal management, and easing the requirements on the power distribution planes. Low power also contributes to longer battery life and higher reliability (cooler-running systems last longer) of the system.

Power Challenges

With each generation of process technology, transistors are becoming smaller and smaller in accordance with Moore’s Law. This phenomenon has the unfortunate side effect of incurring more leakage within each transistor, which leads to higher static power consumption—that is, the amount of current an FPGA draws when not operating. Increased FPGA performance drives the clock rate higher, which leads to higher levels of dynamic power. Where static power is driven by transistor leakage current, dynamic power is based on the switching frequency in the programmable logic and I/Os. Exacerbating both types of power consumption, FPGAs are growing in capacity with each product generation. More logic means more leakage and more transistors operating at higher speeds per device.

Because of these issues, designers must be more aware of their power supply and thermal-management issues earlier in their design cycles. Slapping a heat sink over a device may not adequately resolve these issues. Instead, designers must look for opportunities to reduce the logic in the design.

Let’s take a look at some guidelines that will help you understand what type of action to take at various points in the design cycle to reduce the power consumption of an FPGA design. Clearly, having a thorough understanding of these issues early in the design process will yield the greatest reward.

 Image

Figure 1 illustrates different points in the design cycle, from FPGA selection through low-power design techniques.

7 Series Process Technology

During FPGA selection, carefully consider the process technology, which helps you identify the leakage and performance of the device. The Xilinx® 7 series FPGAs are based on the 28 HPL (28-nanometer High-Performance, Low-Power) process, covering the high-performance space while also enabling significant power reduction (see cover story, Xcell Journal Issue 76). Choosing devices built on the lower-leakage HPL process eliminates the need for complex and expensive static-power-management schemes in an FPGA design.

FPGAs built with the 28 HP process have no performance advantage over 7 series FPGAs, while some other, competing FPGAs come with the severe penalty of more than twice the static power and present challenges in reducing leakage.

Image

Figure 2 shows a holistic power reduction approach for the 7 series family, which has half the overall power consumption of prior-generation, 40-nm FPGA devices.

Designers can choose a larger FPGA for purposes of development and later migrate to a smaller one in their production line. Choosing a smaller FPGA will not only bring down the cost but will also reduce the power consumption of the system.

All 7 series FPGAs are based on a unified architecture. This unified architecture enables easy upward and downward migration across different FPGA devices and families in the Xilinx 7 series portfolio. Refer to the “7 Series Migration Guide” (UG429) when considering design migration from Virtex®-6 or Spartan®-6 devices to or between 7 series device families.

Xilinx Stacked-Silicon Interconnect Technology

For larger systems, designers often choose multiple FPGAs. This type of architecture frequently requires the delicate and difficult task of moving data at rather high speeds among the various FPGAs. Choosing the larger 7 series FPGAs, such as the XC7V1500T and XC7V2000T devices, which are created using Xilinx stacked-silicon interconnect technology, can circumvent this issue. Simply stated, this SSI technology uses multiple dice residing on a silicon interposer that provides tens of thousands of connections among them, to create a single large device. One benefit of stacked-silicon interconnect technology is the reduction in maximum static power compared with a similar-size device on a standard monolithic die.

Stacked-silicon interconnect technology also provides a significant reduction in I/O interconnect power. Compared with having multiple FPGAs on a board, SSI technology boasts a reduction of I/O interconnect power by 100x (bandwidth/W) over an equivalent interface built with I/Os and transceivers. This dramatic reduction is due to all connections being built on-chip rather than having the power required to drive the signals off chip, enabling incredibly high speed and low power.

Enhanced options for Voltage Scaling

Xilinx 7 series FPGAs offers significant voltage-scaling options.

The 7 series FPGAs offer an extended (E) temperature range (0–100°C) option for both -3 and -2L devices. Due to the headroom in the 28 HPL process, the -2LE devices can operate at 1 or 0.9 volt. These devices are referred to as -2L (1.0V) and -2L (0.9V). The -2L devices operating at 1.0V have the same speed-grade performance as the -2I and -2C devices, but with much lower static power. The -2L devices operating at 0.9V have performance similar to the -1I and -1C devices, but with even lower static and lower dynamic power.

At 0.9V, the voltage drop alone in these devices offers a static power reduction of around 30 percent. The voltage drop would also reduce performance, but Xilinx screens these -2L (0.9V) devices for speed and a tighter leakage specification. This screening method yields a 55 percent reduction in power at worst-case process compared with the standard-speed-grade devices.

By choosing a -2L family device, you can obtain additional power savings on dynamic power. Because dynamic power is proportional to VCCINT2, a 10 percent reduction in VCCINT will provide a 20 percent reduction of power.

Power Estimation Tools

There is an extensive choice of tools available in the market today to help designers evaluate the thermal and supply requirements of an FPGA design throughout the development cycle.

Image

Figure 3 shows the Xilinx tools available at each stage of the FPGA development cycle.

At the outset of the design cycle, the XPower Estimator (XPE) spreadsheet provides an early estimation of power consumption even before the predesign and preimplementation phases of a project. XPE assists with architecture evaluation and device selection and helps in selecting the appropriate power supply and thermal-management components that may be required for the application.

The PlanAhead™ software estimates the design power distribution at the RTL level. Designers can specify the device operating environment, the I/O properties and the default activity rates for the design using constraints or by using the GUI. The PlanAhead software then reads the HDL code to estimate the design resources needed and reports the estimated power from a statistical analysis of the activity of each resource. With its access to more-detailed information about the design intent, the RTL power estimator should be more accurate than the XPower Estimator spreadsheet and less accurate than the post-place-and-route analysis done with the XPower Analyzer.

XPower Analyzer (XPA) is a tool dedicated to the power analysis of placed-and-routed designs. It provides a comprehensive GUI that allows a detailed analysis of the power consumed as well as thermal information for the specified operating conditions.

You can toggle between two different views to identify the power consumed either by type of blocks (clock trees, logic, signals, I/Os or hard IP such as block RAMs or DSP blocks) or over the design hierarchy. Both of these views enable you to perform a detailed power analysis. They provide a very efficient method for locating the blocks or parts of the design that are the hungriest in terms of power, thereby identifying places to begin power optimization efforts.

Software Power Optimization

You can optimize designs using block RAMs for power by minimizing the number of simultaneously active block RAM ports. This optimization, enabled with the –power yes option in XST, modifies the decomposition of RAM or ROM descriptions that span multiple block RAMs. The optimization adjusts address lines as well as port-enable and write-enable control signals to minimize the number of active block RAM ports at each clock cycle, while ensuring that your design meets timing constraints.

Next, force the most power-efficient mapping of block RAMs regardless of the impact on performance. Use the block_power2 option to the ram_style constraint when you know that the timing paths related to this memory are not critical. Savings range from 15 percent to 75 percent.

Also, use the Area Optimization mode in XST. This option minimizes the number of resources your design will use. Note that when optimizing for area, performance may suffer.

An additional tactic is to enable activity-aware optimizations, another way of saying intelligent gating. These algorithms analyze the logic equations to detect for each clock cycle sourcing registers that do not contribute to the result. The software then utilizes the abundant clock-enable (CE) resources available in the FPGA logic to create fine-grained gating signals that neutralize useless switching activity. You control this intelligent clock and data gating with the map -power high option. Total core dynamic power reduction in excess of 15 percent is possible and in most cases, the additional gating logic inserted does not affect performance.

Another way to design for power is to use capacitance-aware optimizations. There are two main techniques:

• ‑Group clock loads: This process reorganizes the placement of synchronous elements (such as flip-flops or DSP blocks) to minimize the reach of each clock net. When you place clock loads along a minimum number of horizontal or vertical clock spines, the software can disable unused branches in the clock region. This reduces both the clock resources and buffering requirements, which saves core dynamic power. This process is controlled by the map -power on option.

• ‑Group data loads: This algorithm minimizes the total wire length in your design while ensuring that you meet performance requirements. Grouping data loads saves power because dynamic power increases with the fanout and the type and length of routing structures you use. The grouping algorithm, likewise enabled with the map -power on option, achieves power reduction by placing related logic closer together.

The ISE® Design Suite features predefined goals and strategies that are already tuned to enable power optimization at synthesis, map and place-and-route levels. This approach may be a good alternative to using nondefault constraint settings of all synthesis constraints. However, running this option can add some delay time on various paths.

Finally, Xilinx implementation tools automatically shut off unused transceivers, phase-locked loops, digital clock managers and I/Os. In 7 series devices, Xilinx has also added power gating of unused block RAM. Leakage in block RAM occurs only in blocks that you are using for a particular design, and not for all block RAMs on the device. Power is routed in the device to the instantiated block RAM only, and disabled for the unused block RAMs.

Low-Power Design Techniques

There are many tips and techniques that designers can explore to lower the power of an FPGA design. One of the first options is to use dedicated hardware blocks rather than implementing the same logic in CLBs. To reduce power, you must look for opportunities to reduce the logic in the design. This will allow you to use as small a device as possible and reduce static power consumption.

Using dedicated hard-IP blocks is one of the most important ways to lower both static and dynamic power, as well as to easily meet timing. Hard IP lowers static power because the total transistor count is less than an equivalent component with CLB logic.

As a general rule, you should attempt to infer resources as much as possible. You can steer the inferred resources individually, or as a group, toward the FPGA fabric or silicon resource via attributes in the code or within a constraint file. You can also leverage the Xilinx CORE Generator™ tool to customize the dedicated hardware for instantiating a specific resource.

Moreover, you can employ unused hard IP cleverly for other tasks that may not be obvious. DSP48 slices serve many logic functions such as multipliers, adders/accumulators, wide logic comparators, shifters, pattern matchers and counters. You can use block RAMs as state machines, math functions, ROMs and wide logic lookup tables (LUTs).

Best Use of Control Signals

The use of control signals (signals that control synchronous elements such as clock, set, reset and clock enable) can affect device density, utilization and performance. Following a few guidelines will help you keep the power impact to a minimum.

First, avoid using both a set and a reset on a register or latch. The flip-flops in Xilinx FPGAs can support both asynchronous and synchronous reset and set controls. However, the underlying flip-flop can natively implement only one set, reset, preset or clear at a time. Coding for more than one of these functions in the RTL code will result in the implementation of one condition using the SR port of the flip-flop and the other conditions in fabric logic, thus using more FPGA resources.

If one of the conditions is synchronous and the other is asynchronous, the asynchronous condition will be the one that gets implemented using the SR port and the synchronous condition in fabric logic. In general, it is best to avoid more than one set/reset/preset/clear condition. Furthermore, only one attribute for each group of four flip-flops in a slice determines if the SR ports of flip-flops are synchronous or asynchronous.

In addition, use active-high control signals. The control ports on registers are active high. Using active-low resets in an FPGA design is not recommended. Active-low signals use more lookup tables because they require inversion before they can directly drive the control port of a register. This inversion must be done with an LUT and thus takes up an LUT input.

Hence, active-low control signals may lead to longer runtimes and result in poor device utilization, which will affect timing and power.

Use active-high control signals wherever possible in the HDL code or instantiated components. When it’s impossible to control a control signal’s polarity within the design, you should invert the signal in the top-level hierarchy of the code. The I/O logic can absorb the inverter that’s inferred without using any additional FPGA logic or routing, thereby resulting in better utilization, performance and power.

Unnecessary Use of Sets or Resets

Unnecessary sets and resets in the code can prevent the inference of shift register LUTs (SRLs), LUT RAMs, block RAMs and other logic structures that might otherwise be inferred. Although designers may find it awkward, many circuits can be made to self-reset, or simply do not need a reset. For example, no reset is required when a circuit is only used for initialization of the register, because register initialization occurs automatically upon completion of configuration.

By reducing the use of unnecessary sets or resets, and with greater device utilization, designers can achieve better placement, improved performance and reduced power.

For more information on reset, please refer to http://issuu.com/xcelljournal/docs/xcell_journal_issue_76/44?viewMode=magazine&mode=embed.

Another area that demands attention if you are serious about lowering power is clock and block activity. You should take full advantage of BUFGMUX, BUFGCE and BUFHCE to gate an entire clock domain for power reduction. These constraints can pause the clock in an entire clock region. Likewise, for applications that only pause the clock on small areas of the design, use the clock-enable pin of the FPGA register.

Designs that spread across multiple clock regions utilize more clocking resources and hence consume more power. Whenever possible, place any intermittently used logic in a single clock region (Figure 5). This helps to reduce power. While the tools will attempt this automatically, some designs may require manual effort to achieve this.

Image

Figure 5 – Where possible, place any intermittently used logic in a single clock region.

Another important technique is to limit data motion (Figure 6). Instead of moving operands around the FPGA, move only the results. Using fewer and shorter buses leads to less capacitance, faster operation and less power consumption. Designers should also be careful while placing the pinout and the corresponding logic for their design during floorplanning.               

Image

Figure 6 – Limit data motion; instead of moving operands around the FPGA, move only the results.

Partial Reconfiguration for Lower Static Power

One way to reduce static power is to simply use a smaller device. With partial reconfiguration, designers can essentially time-slice an FPGA and run parts of their design independently. The design then requires a much smaller device because not every part of the design is needed 100 percent of the time.

Partial reconfiguration has the potential to reduce dynamic power as well as static power. For example, many designs must run very fast, but that maximum performance might only be needed a small percentage of the time. To save power, designers can use partial reconfiguration to swap out a high-performance design with a low-power version of the same design—instead of designing for maximum performance 100 percent of the time. You can switch back to the high-performance design when the system needs it.

This principle can also apply to I/O standards, specifically when a design does not need a high-power interface all the time. LVDS is a high-power interface, regardless of activity, due to the high DC currents required to power it. Designers can use partial configuration to change the I/O from LVDS to a low-power interface, such as LVCMOS, at times when the design does not need the highest performance, and then switch back to LVDS when the system requires high-speed transmissions.

Making the best use of timing constraints is also important in low-power design. If you are operating in a temperature-controlled environment, remember that you can derate the part in order to meet timing. Be certain to only constrain the part to the maximum specified clock rate. Indicating that a faster clock rate is to be used does not generate a better design! Typically, it will use more fabric resources due to reduced resource sharing, more logic/registers duplication, more routing and fewer inferences of FPGA dedicated features. All of these can significantly impact dynamic power.

I/O power has become a major contributor to total power. Some designs draw as much as 50 percent of the total power from I/Os, especially in memory-intensive systems.

The programmable slew rates and drive strength lower dynamic power in I/O drive. While many prefer fast differential I/O capabilities, not every interface requires it. There are standards such as HSLVDCI that can save considerable power in FPGA-to-FPGA communications and in lower-speed memory interfaces.

All Xilinx 7 series devices offer programmable slew rate and drive strength. Xilinx FPGAs sport digitally controlled impedance (DCI) technology, which can also be tristated. DCI eliminates termination power during memory write from the FPGA, so the device consumes termination power only during the read.

The 7 series devices incorporate a user-programmable referenced receiver power mode for HSTL and SSTL. You control these two programmable power modes on an I/O-by-I/O basis, which helps you reduce DC power by making trade-offs between power and performance.

Transceiver Power Picture

Xilinx has optimized the 7 series FPGA transceivers for high performance and low jitter. These transceivers offer several low-power operating features, enabling designers to customize the flexibility of operation and granularity for balancing power and performance trade-offs.

In the 7 series FPGAs, the shared LC phase-locked loop can save a lot of power. For four-lane designs with an identical line rate (XAUI, for example), you can use a quad PLL (instead of an individual-channel PLL) to save power. Similarly, in some cases, because a PLL can run at both higher and lower rates within the range, it is better to select a lower operating range so as to save power.

You can also enable individual TX/RXPOWERDOWN options. PLL power down can be enabled in the lowest-power mode (say, in a system D3 state, which is mostly used in PCIe® systems).

Each Stage of the Cycle

Understanding and implementing power-sensitive design techniques before you perform coding is the single largest mechanism for reducing system power. Using the various Xilinx tools at the appropriate stage of the design cycle will also help you meet power specifications, and provides the board designer with information on selecting the number, type and size of the necessary power supplies. Xilinx 7 series FPGAs provide unprecedented power efficiency through use of process technology and architectural design.

Many of the tips explained in this article are described in the FPGA Power Optimization training course. More information on Xilinx courses is available at www.xilinx.com/training.

by Chandra Sekar Balakrishnan

Solutions Development Engineer

Xilinx, Inc.

Chandra Sekar Balakrishnan is a solutions development engineer in the Xilinx Global Training Solutions team. Chandra Sekar develops content for Xilinx training courses. His areas of expertise are FPGA design and the embedded domain.

Prior to joinng Xilinx in May 2010, he was a project leader at Acme Technologies (Chennai, India), where he worked for more than eight years on projects that implemented FPGA-based embedded systems for Japanese semiconductor organizations.

Reach him at cbalakr@xilinx.com.

Leave a comment