基于FPGA的快速图像处理系统的设计(中英文翻译)

来源：易妖游戏网

基于FPGA的快速图像处理系统的设计

摘要

我们评估、改进硬件、软件架构的性能，目的是为了适应各种不同的图像处理任务。这个系统架构采用基于现场可编程门阵列（FPGA）和主机电脑。PC端安装Lab VIEW应用程序，用于控制图像采集和工业相机的视频捕获。通过USB2.0传输协议执行传输。FPGA控制器是基于ALTERA的Cyclone II芯片，其作用是作为一个系统级可编程芯片（SOPC）嵌入NIOSII内核。该SOPC集成了CPU，片内、外部内存，传输信道，和图像数据处理系统。采用标准的传输协议和通过软硬件逻辑来调整各种帧的大小。与其他解决方案作比较，对其一系列的应用进行讨论。

关键词：软件/硬件联合设计；图像处理；FPGA；嵌入式

1、导言

传统的硬件实现图像处理一般采用DSP或专用的集成电路（ASIC）。然而，

随着对更高的速度和更低的成本的追求，其解决方案转移到了现场可编程门阵列（FPGA）身上。FPGA具有并行处理的特性以及更好的性能。当一个程序需要实时处理，如视频或电视信号的处理，机械操纵时，要求非常严格，FPGA可以更好的去执行。当需要严格的计算功能时，如滤波、运动估算、二维离散余弦变换（二维DCTs ）和快速傅立叶变换（ FFTs ）时，FPGA能够更好地优化。在功能上，FPGA更多的硬件乘法器、更大的内存容量、更高的系统集成度，轻而易举地超越了传统的DSP。以计算机为基础的成像技术的应用和基于FPGA的并行控制器，这需要生成一个软硬件接口来进行高速传输。本系统是一个典型的软硬件混合设计产品，其中包括电脑主机中运行的LvbVIEW进行成像，配备了摄像头和帧采集，在另一端的Altera的FPGA开发板上运行图像滤波器和其他系统组件。图像数据通过USB2.0进行高速传输。各硬件部件和FPGA板的控制部分通过嵌入的NIOSII处理器进行关联，并利用USB2.0作为沟通渠道。

2、设计工具概述

通过FPGA设计DSP系统往往采用高级别算法开发工具和硬件描述语言，例如MATLAB。它也可采用具有第三方知识产权的IP内核执行典型的DSP功能或高速通信协议。在我们的应用中，我们使用的模型设计工具例如Mathworks Simulink来建立DSP。将其生成HDL代码后利用Quartus II与其他硬件设计文件综合。

SOPC-Builder作为一个工具驻留在Quartus环境中，其作用是将NIOSII与外部逻辑硬件或标准外设融为一体。SOPC-Builder提供了一个界面结构，以互联NIOSII和外部存储器、滤波器、以及主机电脑。

3、滤波器的模型和应用设计

这个工作的主要目标就是评估主、协处理器进行图像处理的性能，包括嵌入式的NIOSII的性能以及电脑主机与FPGA板之间的USB2.0传输性能。现有FPGA的性能可能会造成图像处理的局限性。为了完成目标，我们建立了一个典型的图像处理应用，以针对FPGA协处理器。包括一个噪声滤波器和一个边缘检测器。降噪和边缘检测这两个基本过程运用到各种机器视觉中，如目标识别，医学成像，下一代的汽车行进路线检测，人员追踪，控制系统等方面。

我们的噪声模型和边缘检测使用了Altera DSPBuilder Libraries in Simulink。这方面有个例子可以从[11]找到，利用高斯3 · 3 kernel降噪。边缘检测利用典型的Prewitt或Sobel滤波器。这些功能可用于合并一系列边缘检测后减少噪声。图1为滤波器的设计框图。

图 1滤波器的设计框图

除了噪声检测和边缘滤波，还有中间处理逻辑关系的模块用于协调NIOS

II数据和控制路径还有滤波模块工作时序。这种中间的硬件结构定义为Avalon界面[12]。这个接口不能在Simulink环境下仿真，是相当于嵌入系统的Verilog文件。Avalon执行由一个16位数据输入和输出的路径，相应的读写控制信号和一个控制接口可以选择中间输出高斯滤波或边缘检测。数据的输入输出在逻辑模块的帮助下存入FIFO寄存器。每个接收到的图像帧存入外部SDRAM内存缓冲区，并转换为适用于NIOSII操作的16位数据流的方式。在第五和第六节将讨论NIOSII编码的问题。传入的图像通过一个简单的二维数字有限脉冲响应卷积滤波器，处理在3·3区域范围内相邻像素的灰阶强度。产生缓冲的原理图如图2所示。

图 2

我们假设图像大小为0*480像素。该缓冲电路以同样的方法来为滤波器

提供缓冲空间。如果改变帧的大小，我们需要重新设计和编译。延迟数量取决于块的大小，延迟深度取决于每行有多少像素。开发板上具有片外RAM因此不会消耗FPGA逻辑要素。图3从左至右分别为原始图像、高斯滤波图像、边缘滤波图像。

图 3

4、嵌入式系统设计

协处理器执行上述所描述的做为组件的NIOSII处理器。NIOSII处理器在这里的作用是处理数据流。这种设计经常用于基础工业和学术项目。一旦安装综合软件，NIOSII将成为Quartus中的一个元件。

DSP-Builder将设计出来的模型转换成HDL编码以便适用于其他硬件组件。通过综合软件，滤波器可以很容易地集成到SOPC中并与NIOSII结合。NIOSII软核与其他模块构成了一个完整的系统，包括外部存储器控制器、DMA通道、以及一个定制的USB高速通信IP核。VGA控制器可以将最终结果输出至屏幕。诸如此类的功能，可以通过获得开源的IP核来或是第三方公司提供的评估版IP核来实现。

USB2.0高速接口通过一块扩展板被添加到FPGA母板上。做为系统级的解决方案，通过Santa-Cruz周边设备连机器可以将扩展子板插入到任何的Altera母板上。这个子板提供了一个基于PHY CY7C68000的USB2.0收发器。一个符合UTMI规范的继承USB控制功能的NIOSII系统。第8节我们将对IP核的实际性能进行评估。图4为FPGA的流程图。图6为FPGA开发板和图像采集部分。

图 4 FPGA设计流程图

图 5

5、NIOS软核设计：

NIOS配置完毕后，将nios的代码下载。利用C语言来写nios中的代码是有双重目的的：（a）它控制硬件业务，如硬件之间的DMA传输单元。它还提供一个编程接口，处理数据通道，通过”API”命令如“open”“read”，“write”和“close”来控制。（b）它允许系统进行简单的对输入信号进行软件处理而不是使用专用的硬件来处理。例如，nios指令代码可以用来转换图像阵列成为适合的一维数据流。

6、Activity flow

根据软件和硬件的活动，其混合结构的功能可概括如下：（a）图像流是从电脑主机经过usb2.0高速串行总线到达FPGA母板。在下一个章节将会描述

使数据通过usb输入输出的应用程序编程接口。（b）内置的DMA数据总线将内存中的数据传送到nios中处理然后依靠Avalon传至硬件数字逻辑。（c）通过硬件加速器来处理数据流。（d）硬件逻辑对图像数据进行滤波后在通过DMA传送至存储器中。（e）最终结果输出到VGA的数模转换通道上。做为nios处理器的外围设备，支持DMA传输方式。然而做为VGA接口的数模转换芯片并不是实时执行所有数据的转换。因此有一个比较可能的做法就是将数据通过usb返回至电脑主机再做进一步处理成为简单的图像数据。需要指出的是这个设计不仅仅是为了做为黑盒子那样的专门应用，这是代表了一种设计方法，可广泛地定制应用。

7、接口设计与应用

基于PC的应用软件和部分视觉系统的的实施适用于各种工业应用。这套系统包括了windowsXP操作系统、奔腾4处理器、usb2.0高速串行总线控制器和NI1408PCI图像采集卡。主机的应用程序是基于LabVIEW虚拟仪器，它用于控制图像采集，并进行初步的图像处理。图6为PC端LabVIEW控制界面。

图 6 LabVIEW控制界面

图像采集卡最多可支持5个工业相机进行不同的任务。我们的系统中应用CCD相机捕捉全帧大小为0*480黑白画面，但是最终采集后的是320*240的。这样可以生成更小的数据量易于持续传输。LabVIEW主程序与USB之间的通信使用了API函数和动态链接库。LabVIEW的优势在于其集成了一个图像处理平台，能够进行快速的图像数据处理或预处理。当FPGA板接收完一个完整的图像阵列后，系统将图像送至滤波器，经过滤波处理后将数据送至VGA控制器中的缓存模块。

8、系统性能评估

上面已经建立了一套图像捕获装置。通过发送一些测试数据来测试USB对pc和FPGA实验板之间接收和发送性能。经过测试我们发现主机和目标板之间的发送接收有效载荷为307，200字节。当nios的Hal驱动程序版本为1.2时接收速度达到65Mbits/s，传输速度达到80Mbps。全速传输效率为9秒。

9、与其他系统进行对比

下面我们对比一下其他图像处理的解决方案以及性能和灵活性。为此我们

通过搭建其他解决方案并进行一系列实验来来获取对比数据。我们设计了不同的滤波器来验证计算复杂性。经过与结果相比在奔腾4处理器和512兆内存的计算机上结果如图7所示。

图 7

10、结论

本文提出了一个融合电脑主机和FPGA的设计方案。并研究了基于此系统下的图像处理性能。这也代表了一种设计方法，可用于广泛的定制应用。它是基于FPGA可编程器件并以内嵌nios处理器的形式执行。

Design and evaluation of a hardware/software FPGA-based system

for fast image processing

J.A. Kalomiros a,*, J. Lygouras b

Abstract

We evaluate the performance of a hardware/software architecture designed to perform a wide range of fast image processing tasks.The system architecture is based on hardware featuring a Field Programmable Gate Array (FPGA) co-processor and a host computer. ALabVIEW host application controlling a frame grabber and an industrial camera is used to capture and exchange video data with the hardware co-processor via a high speed USB2.0 channel, implemented with a standard macrocell. The FPGA accelerator is based on a Altera Cyclone II chip and is designed as a system-on-a-programmable-chip (SOPC) with the help of an embedded Nios II software processor. The SOPC system integrates the CPU, external and on chip memory, the communication channel and typical image filters appropriate for the evaluation of the system performance. Measured transfer rates over the communication channel and processing times for the implemented hardware/software logic are presented for various frame sizes. A comparison with other solutions is given and a range of applications is also discussed.

Keywords: Hardware/software co-design; Image processing; FPGA; Embedded processor

1. Introduction

The traditional hardware implementation of image processing uses Digital Signal Processors (DSPs) or Application Specific Integrated Circuits (ASICs). However, the growing need for faster and cost-effective systems triggers a shift to Field Programmable Gate Arrays (FPGAs),where the inherent parallelism results in better performance[1,2]. When an application requires real-time processing,like video or television signal processing or real-time trajectory generation of a robotic manipulator, the specifications are very strict and are better met when implemented in hardware [3–5]. Computationally demanding functions like convolution filters, motion estimators, two-dimensional Discrete Cosine Transforms (2D DCTs) and Fast Fourier Transforms (FFTs) are better optimized when targeted on FPGAs [6,7]. Features like embedded hardware multipliers, increased number of memory blocks and system-on-a-chip integration enable video applications in FPGAs that can outperform conventional DSP designs[2,8].

On the other hand, solutions to a number of imaging problems are more flexible when implemented in software rather than in hardware, especially when they are not computationall demanding or when they need to be executed sporadically in the overall process. Moreover, some hardware components are hard to be re-designed and transferred on a FPGA board from scratch when they are already a functional part of a computer-based system. Such components are frame grabbers and multiple-camera systems already installed as part of an imaging application or other robotic control equipment.

Following the above considerations we conclude that it is often needed to integrate components from an alreadyinstalled computer-based imaging application dedicated to some automation system, with FPGA-based accelerators that exploit

the low-level parallelism inherent in hardware structures. Thus a critical need arises for an embedded software/hardware interface that can allow for high-bandwidth communication between the host application and the hardware accelerators.

In this paper we apply and evaluate the performance of an example mixed hardware/software design that includes on the one side a host computer running a National Instruments(NI) LabVIEW imaging application, equipped with a camera and a frame-grabber, and on the other side a Altera FPGA board [9] running an image filter hardware accelerator and other system components. The communication channel transferring image data from the host computer to the hardware board is a high-speed USB2.0 port by means of an embedded macrocell. The various hardware parts and peripherals on the FPGA board are controlled and interconnected by a Nios-II embedded soft-processor.As a result of this evaluation one can explore the range of applications suitable for a host/co-processor architecture including an embedded Nios-II processor and utilizing an USB2.0 communication channel.

In the following, we first give a short account of the tools we used for system design. We also present an overview of the particular image filtering application we embedded in the FPGA chip for the evaluation of the host/co-processor system architecture. We describe the modular interconnection of different system parts and assess the performance of the system. We examine the speed and frame-size limits of such a design when it is dedicated to image processing.Finally, we compare our mixed host/co-processorUSB-based design in terms of other architectures and other communications media.

2. Design tools overview

The design of a DSP system with FPGAs often utilizes both high-level algorithm development tools and hardware description language (HDL) tools. It

can also make use of third-party intellectual property (IP) cores implementing typical DSP functions or high speed communication protocols

[1].In our application we use model-based design tools like The Mathworks Simulink (based on Mathwork’s MATLAB) with the libraries of Altera’s DSP-Builder. The DSP-Builder uses model design to produce and synthesize HDL code, which can then be integrated with other hardware design files within a synthesis tool, like the Quartus II development environment. In the present work, we designed image filter components using DSP-Builder libraries and the resulting blocks were integrated with the rest of the system in Quartus’ System-On-a-Programmable-Chip (SOPC) Builder.

SOPC-Builder design software resides as a tool in the Quartus environment. Its purpose is to integrate an embedded software processor like Altera’s Nios-II with hardware. logic and custom orstandard peripherals within an overall system, often called System-On-a-Programmable-Chip(SOPC). SOPC-Builder provides an interface fabric in order to interconnect the Nios-II processing path with embedded and external memory, the filter co-processors, other peripherals and the channels of communication with the host computer.Nios-II applications were written in ANSI C and were compiled and downloaded to the FPGA board by means of Altera’s Nios II Integrated Development Environment(IDE), a tool dedicated to assemble code for Nios processors.The purpose of Nios-II applications is to control processing and data streaming between the components of the system and its peripherals.On the host side one may develop a control application by means of any suitable language like C. We use Lab-VIEW software by National Instruments Corporation[10],which provides a very flexible platform for image acquisition, image processing and industrial control.

3. Modeling and implementation of the filter design

The main target of this work is to evaluate the performance of a host/co-processor architecture including an embedded Nios-II processor and utilizing a communication channel between host and hardware board, like a USB2.0 channel. The task-logic performed by the embedded accelerator can be any image function within the limitations of existing FPGA devices.For our purpose we built a typical image-processing application in order to target the FPGA co-processor. It consists of a noise filter followed by an edge-detector.Noise reduction and edge detection are two elementary processes required for most machine vision applications,like object recognition, medical imaging, lane detection in next-generation automotive technology, people tracking,control systems, etc.We model noise and edge filtering using the Altera DSPBuilder Libraries in Simulink. An example of this procedure can be found in [11]. Noise reduction is applied with a Gaussian 3 · 3 kernel while edge detection is designed using typical Prewitt or Sobel filters. These functions can be applied combined in series to achieve edge detection after noise reduction. The main block diagram of our filter accelerator is shown in Fig. 1. Apart from noise and edge filter blocks, there is also a block representing the intermediate logic between the Nio-II data and control paths and our filter task logic. Such intermediate hardware fabric follows a specific protocol referred to as Avalon interface [12].This interface cannot be modeled in the Simulink environment and is rather inserted in the system as a Verilog file.Design examples implementing the Avalon protocol can be found in Altera reference designs and technical reports[13]. In brief, our Avalon implementation consists of a 16-bit data-input and output path, the appropriate Read and Write control signals and a control interface that allows for selection between the intermediate output from the Gauss filter or the output from the edge detector. Data input and output to and from the task logic blocks is implemented with the help of Read and Write instances of a 4800 bytes FIFO register.Each

image frame when received by the hardware board is loaded into an external SDRAM memory buffer and is converted into an appropriate 16-bit data stream by means of Nios-II instruction code. Data transfer between external memory buffers and the Nios-II data bus is achieved through Direct Memory Access (DMA) operations controlled by appropriate instruction code for the Nios-II soft processor. Nios-II code flow for this system is discussed in Sections 5 and 6.

(Fig. 1)

Incoming pixels are processed by means of a simple 2D digital Finite Impulse Response (FIR) filter convolution kernel, working on the grayscale intensities of each pixel’s neighbors in a 3 · 3 region. Image lines are buffered through delay-lines producing primitive 3 · 3 cells where the filter kernel applies. The line-buffering principle is shown in Fig. 2. A z1 delay block produces a

neighboring pixel in the same scan line, while a z0 delay block produces the neighboring pixel in the previous image scan line.We assume image size of 0 · 480 pixels. The line-buffer circuit is implemented in the same manner for both noise and edge filters. Frame resolution is incorporated in the line-buffer diagram as a hardware built-in parameter. If a change in frame size is required we need to re-design and re-compile. The number of delay blocks depends on the size of the convolution kernel, while delay line depth depends on the number of pixels in each line. Each incoming pixel is at the center of the mask and the line buffers produce the neighboring pixels in adjacent rows and columns.Delay lines with considerable depth are implemented as dedicated RAM blocks in the FPGA chip and do not consume logical elements.

(Fig. 2)

After line buffering, pipelined adders and embedded multipliers calculate the convolution result for each central pixel. Fig. 3 shows the model-design for implementation of the 3 · 3 Gauss kernel calculations. As is shown in Fig. 3

model-based design transfers the necessary arithmetic into a parallel digital structure in a straightforward manner.Logic-consuming calculations, like multiplications are implemented using dedicated multipliers available in medium-scale Altera FPGAs, like the Cyclone II chip.

(Fig. 3)

When the two filters work in combination, the output of the Gaussian kernel is input to a 3 · 3 Sobel or Prewitt edge detection kernel. First, the kernel-pixels are produced using a line-buffer block identical to the one in Fig. 2. The edge detector is composed of horizontal and vertical components.Fig. 4 shows the model-design for the Prewitt kernel calculations of horizontal image gradient. A similar implementation is used for vertical gradient. For simplicity we combine horizontal and vertical edge detection filtering by simply adding the corresponding magnitudes. A binary image is produced by thresholding the result by means of a comparator. These final steps are shown in Fig. 5. Fig. 6 shows an input and the successive outputs of the hardware co-processor for a 0 · 480 pixel image.

(Fig. 4)

(Fig. 5)

(Fig. 6)

4. Embedded system design

The co-processor parts described above were implemented as components of an embedded system controlled by a Nios-II processor. The Nios-II software cpu which is used here for data streaming control, is often the basis for industrial as well as academic projects. It can be used in its evaluation version along with the tools for assembling and downloading instruction code [14]. Once installed within the synthesis software, the Nios processor becomes integrated as a library component in Quartus’ SOPCbuilder tool.

DSP-Builder converts the model-based design into HDL code appropriate for integration with other hardware components. The filter is readily recognized by the synthesis software as a System-on-a Programmable-Chip (SOPC) module and can be integrated within a Nios-II system with suitable hardware fabric. Other modules that are necessary for a complete system are the Nios-II soft processor, external memory controllers, DMA channels, and a custom IP peripheral for high speed USB communication with the host. A VGA controller can be added in order to monitor the result on an external screen. Many of such peripheral functions can be found as open source custom HDL Intellectual Property (IP) or as evaluation cores provided by Altera or third party companies.

USB 2.0 high speed connectivity is added to the FPGA board by means of a daughter-card by System Level Solutions(SLS) Corporation [15]. It can be added to any Altera board featuring a Santa-Cruz peripheral connector. This daughter-card provides an extension based on CY7C68000 PHY USB2.0 transceiver. A USB2.0 IP core compliant with Transceiver Macrocell Interface (UTMI)

protocol allows integration of the USB function with the Nios-II system. We tested evaluation versions of the IP core and present practical transmit and receive rates in Section8. The FPGA chip along with the embedded Nios-II processor is always a slave device in the communication via the USB channel, while the host computer is always the master device.

A block diagram of the hardware system implemented on the FPGA board is shown in Fig. 7. The channel to the host computer is also shown.

The embedded system is assembled by means of the SOPC-Builder tool of the synthesis software, by selecting library components and defining their interconnection.After being generated by SOPC Builder, the system can be inserted as a block in a schematic file for synthesis and fitting processing. The only additional components that are necessary are PLLs for Nios and memory clocking.After we synthesize and simulate the design by means of the tools described in Section 2, we target a Cyclone II2C35F672C6 FPGA chip incorporated on a development board manufactured by Altera Corporation. This board along with the peripheral USB2.0 extension is shown in the picture of Fig. 8. The board also features external memory and several typical peripheral circuits.

5. Nios programming

After device configuration, we assemble and download instruction code for the Nios-II processor, according to the specifications of Nios’ Hardware Abstraction Layer (HAL). The interested reader can look for details in the reference manuals for Nios-II and its dedicated Integrated Development Environment (Nios-II IDE) [16].Nios code is originally written in ANSI C and has a twofold purpose:

(a) It controls hardware operations, like DMA transfers between hardware

units. It also offers a programming interface for handling data channels, with API commands like ‘‘open’’, ‘‘read’’, ‘‘write’’ and ‘‘close’’.

(b) It allows the system to perform simple software operations on the input

data instead of using dedicated hardware stages for such processing. For example,Nios instruction code can be used to convert image arrays

into appropriate one-dimensional data streams.

Nios instruction code is downloaded to on-chip memory.Our code for the particular design opens the USB port and waits to read the transmitted pixel array. It then controls pixel streaming from input to final output as described in the following section.

6. Activity flow

The flow of software and hardware activity that supports the operation of a system designed according to the mixed architecture detailed in this work can be outlined as follows:

(a) An image stream is transferred from the host computer to the hardware

board for processing through the high speed USB2.0 communication channel. As described in the next section the host application communicates with the USB port using Advanced Programming Interface (API) calls for data input and output.

(b) According to Nios processor instructions, embedded DMA hardware

operations transfer data from memory to the Nios data path and into the hardware task logic by means of the Avalon interface.

(d) DMA operations stream the filtered output result from the hardware

task-logic back to external memory(see Fig. 7).

(e) The final result is output to the on board VGA digital-to-analog channel,

which is peripheral to the Nios-II processor and is supported by embedded DMA hardware transfers. However, a digital to analog converter for a VGA port is not always implemented on a development board, so a possible alternative is the resulting binary image to be

channeled back to the host computer via the USB connection for further processing or simply for display.The assessment of the design performance that is presented in Section 8 includes all the above activity steps.It is important to note that the above system is not merely a black-box custom design implemented for a particular application, but represents a design methodology that can be used for a wide range of custom applications.The technical details of the operations abstracted above are well documented for the user of the particular development platform, so that every aspect of the design can be tested for repeatability. A more detailed analysis of the system development techniques is however out of the scope of the present article.

7. Host-based setup and application

On the host part a vision system is implemented, appropriate for a spectrum of industrial applications. The host computer is a Windows XP Pentium IV featuring an onboard high speed USB 2.0 controller and a NI 1408 PCI frame grabber. The host application is a LabVIEW virtual instrument (VI) that controls the frame grabber and performs initial processing of the captured images (see Fig. 9). The frame grabber can support up to five industrial cameras performing different tasks. In our system the VI application captures a full frame sized 0 · 480 pixels from a CCIR analog B& W CCD camera (Samsung BW-2302). It may then reduce the image resolution applying an image extraction function, down to 320 · 240 pixels. It can produce even smaller frames in order to trade size for transfer rate.

The LabVIEWhost application communicates with the USB interface using API functions and a Dynamic Link Library (DLL) and transmits a numeric array out of the captured frame. An advantage of using LabVIEWas a basis for

developing the host application is that it includes a VISION library able to perform fast manipulation of image data or a preprocessing of the image if it is necessary.

When the reception of the image array is completed at the hardware board end, the system loads the image data to the filter coprocessor and sends the output to the VGA controller via SRAM memory (see Fig. 7). Alternatively the output can be sent back to the host application by means of a ‘‘write’’ Nios command. The procedure is repeated with the next captured frame.

8. Evaluation of the system performance

The above set up was tested with various capture rates and frame resolutions. Using several test-versions of the SLS USB2.0 megafunction we measured receive (rx) and transmit (tx) throughput between the host PC and the target hardware system.We use a payload of 307,200 bytes for both directions. We find that using the Nios II HAL driver,the latest evaluation version 1.2 of the IP core transfers in high speed operation 65Mbits per second in receive mode and about 80 Mbps in transmit mode. In full speed the transfer rate is 9 Mbps.However, data transfer rate from the host computer to the hardware board is only one factor that affects the performance of an image processing system designed according to a host/co-processor architecture, like the one studied here. There are also software issues to be taken into account both at the host end and at the Nios-II embedded processor side. For example, frame capturing and serialization prior to transferring are factors that limit frame rate in video applications. On the other hand, the Nios-II embedded processor controls the data flow following instruction code downloaded to embedded memory, as described in Section 5.So, the overall performance of the system depends on the fine-tuning of all these factors. The LabVIEW software allows for an efficient handling of array structures and also

possesses image grabbing and vision tools that reduce processing time on the host side. Beside the above software limitations, there are also hardware issues related to an integrated System-on-a-Programmable-chip, like the time needed for Direct Memory Access (DMA) transactions between units. The performance of the hardware board is divided into the processing rates of the hardware filter co-processor and the performance of the rest of system, like external memory buffers and the interconnect fabric. This second factor adds an overhead depending on memory clocking and the structure of the interconnect units.We evaluate the performance of the proposed architecture taking into account and measuring when possible the following delay times:

(a) Time to grab an image frame and serialize it. (b) Transfer time over the USB2.0 channel.

image frame.

(d) Overhead time needed for data flow and control in the integrated

hardware system.

Table 1 summarizes the response time of the above operations and reports frame rates for two typical frame sizes. As a whole the system results in a practical and stable video rate of 20 frames per second at an image resolution of 320 · 240 pixels. Similarly, larger frames with dimensions 0 · 480 pixels can be transferred and processed at a rate of approximately seven frames per second. When the board is clocked at 100 MHz the hardware image filter processes a 0 · 480 pixels frame in a minimum of 3.1 ms while DMA transfers and other control flow add approximately 30 ms in order to transfer image data between units. The frame is transferred from the host computer to the hardware board in approximately 26 ms. Other possible latencies include proper data manipulation by the Nios-II instruction code and depend on the frame size.

As a conclusion, delays are divided between software and hardware procedures on both sides of the system. The hardware filter co-processor does not substantially delay the overall system performance. Host and Nios-II processing time and transfer rates over the communication channel can be a bottleneck for large frames. Since open core USB2.0 embedded technology is still developing one may expect it soon to make even better use of the available bandwidth. Table 2 summarizes the hardware requirements for the overall system we implemented in this study (see also Fig. 7). The table reports number of logical elements and memory bits needed to implement the functions presented above in a medium FPGA chip,the Altera Cyclone II EP2C35F672C6. The full potential of this chip is 33000 Logic elements and 480000 memory bits. The table also reports clock frequencies for the soft processor and the external DDR2 memory. Clocks were implemented by means of two Phase Locked Loops (PLLs).

(Fig. 7)

9. Comparison with other systems

In the following, we present a comparison with other image processing solutions, in terms of performance and flexibility. In order to establish some numerical comparison between the presented architecture and a purely software solution, we performed a series of experiments. We synthesized designs with variations of the basic filter operations, introducing different degrees of computational complexity. We also implemented in hardware a Sum of Absolute Differences(SAD) algorithm for dense depth map calculations,which is based on

correlation operations and is much more intensive computationally than simple convolution. We compared the results with the same algorithms implemented in software and running on a Pentium IV processor at 3 GHz with 512 MB RAM. Typical results are presented in Fig. 10 and summarized in Table 3. The software results were attained by programming analytically the corresponding procedures in NI’s LabVIEW language,using optimized library functions for array processing.We used pre-captured AVI video sequences and processed each frame with the same image functions as in the hardware version of the algorithm. Frame resolutions of 320 · 240 and 0 · 480 pixels were assessed separately, since they need different transferring times to our hardware co-processor.

Table 3 shows frame rates for processing video files of both resolutions as a function of computational complexity.In the first column of Table 3, the corresponding image processing operation is given. Computational complexity in the second column is measured as multiples of n · N, where n is the size of the convolution kernel, equal to 3 · 3, and N is a total of 320 ·240 image pixels. In the case of the SAD algorithm complexity is considered to be the product n · N · D, where n is the size of the comparison window applied on an image of N pixels and for a total disparity range of D pixels [17]. Our hardware version of this algorithm implemented as a FPGA co processor will be published in a future article. The third and fourth column of Table 3 give total frame rate for each operation when implemented in pure software and in our proposed architecture, respectively. Let us note that the frames per second (fps) value in the case of the SAD algorithm refers to a single transmitted frame instead of the stereo pair in order to have a result comparable with other values. For both stereo.

Table 3 Frame-rate comparison between PC-based implementations and host/coprocessor designs for a number of image processing operations images the whole cycle of transmitting, processing and projecting on a monitor screen is 14

fps (resolution 320 · 240), when working with the current version of the SLS USB2.0 megafunction.

The last column of Table 3 gives total logic elements required for mapping each algorithm on a Cyclone 2C35F672 Altera FPGA. The required hardware resources increase with increasing number of stages in the hardware pipeline. However, resources do not necessarily increase with increasing computational complexity. The reason is that the same hardware structure can in principle be adequate for both small and large frames, since the same parallel computations are needed per pixel and per clock cycle, as pixels stream into the system pipeline. Frames of increased resolution only need deeper line buffers, as shown in Fig. 2. Hence, smaller frames need less dedicated memory bits than larger frames. On-chip memory requirements for the implementations of Table 3 vary between 185,000 and 230,000 memory bits out of a total of 480,000. Hardware implementations in Table 3 require a nominal processing time 0.77 ms for every 320 · 240 frame and 3.1 ms for every 0 · 480 frame (see also Table 1). SAD needs twice that time in order to process a stereo pair.Additional times for data transferring and control are as discussed in Section 8.

Fig. 10 presents the above results in diagram form. In Fig. 10 the processing results of a video file with frame resolution 320 · 240 are presented. Our hardware/software co-design results are shown with triangles. In Fig. 10b our hardware processing results are shown with squares and refer to frame resolution of 0 · 480. The rhombs, at the lower part of both diagrams represent frame rates achieved by pure software running on the host computer.

It can be seen that the performance of the host PC without the hardware co-processor is good in the case of simple processes but, as it is expected, it falls rapidly with increasing computational complexity. The performance of our hardware/software mixed architecture is almost constant as complexity increases,

because in all cases processing is completed in a single pass, at clock rate. A slight decrease is due to pre-processing steps, according to the type of operation. However, the performance of our system is dependent on frame size, as is shown by the corresponding curves in Fig. 10a and b. The reduction of frame rate in the case of increased resolution is clearly attributed to increased payload during USB transfers and DMA transactions from board memory to the FPGA system.

By better optimizing the software algorithms or by adopting faster processors in the future, the performance of the software system can be enhanced, however the gap will still persist, with increasing computational demands.

The bottom line of the above analysis is clearly that the suitability of the proposed mixed architecture is application dependent. It is better justified in the case of algorithms with increased computational complexity, where the PC alone cannot respond at a reasonable frame-rate. Keeping a moderate frame resolution helps our design to respond at a decent rate, which however is still less than true video rate.

Depending on the system requirements other approaches may also be used. Using a video input card like the Altera DC Video-TVP5146N one can input video data to the FPGA and perform a range of image functions, like simple filtering, color processing, recasting to different video formats, compression, etc. [18]. In this case one can by pass the computer-based frame grabber and avoid the use of a host PC computer. This has the principal cost of losing the flexibility inherent in the software part of the system as well as losing the PC-based control over the cameras and frame grabber, but the result is an all-hardware system that excels in performance. Processing rate is in principle limited by the frame grabber, which in the case of NTSC composite input signal is 30 fps while for PAL video input it is 25 fps.

Similarly, processing at video-rate may be obtained by using the ASICs

approach, which usually incorporates all parts in an all-hardware custom system. Image processing implementations of medium and high complexity appear in the literature and most of them can manipulate images in real time [19,20]. They can reach a frame rate of 30 frames per second or even more in some custom systems. In some cases they can also incorporate a CCD interface [21]. However, ASIC implementations are complex and expensive systems that suffer in terms of flexibility. They have slow design cycle and are certainly away from the plug and play approach adopted in the present article.

Pure DSP-based designs support thousands of MIPS and are comparable in performance with desktop PCs,even running at slower clock speeds. They also consume much less power. Being purely software-based they are much more flexible than hardware implementations. In addition, the control flow part of an algorithm is better mapped to software than to hardware. Moreover, they can easily incorporate video input and output channels with little additional hardware [22]. However, just as is the case with ordinary serial computers they cannot perform computationally demanding tasks at a high video-rate [23].

The use of FPGAs in hardware implementations can partly bridge the flexibility gap because FPGAs are re-programmable and have a rapid design cycle. Compared with DSPs, systems implemented using FPGAs are more efficient, especially for data flow algorithms with minimal control processing. However, building with FPGAs, we also have to implement in hardware protocols and interfaces for video I/O.

In the host/co-processor architecture applied in our design, the FPGA absorbs sections of the algorithm that are costly in software, while Nios-II executes flow control. Although Nios is a medium performance processor it can greatly help by implementing the control path, as well as simple pre-processing and post processing steps. In our system, an application can further be partitioned between

the host computer and the hardware board. The host part interfaces with a frame-grabber and controls peripheral equipment. It can also execute parts of the algorithm that are software friendly.

Hardware task logic operates as an accelerator of any computationally demanding function and can be added in the SOPC system as a library component. It can also be re-used, shared and imported in any SOPC system that follows the same design platform. These characteristics testify of the high level of flexibility inherent in this design.

As shown from the discussion above the presented system performs better than a general purpose computer or a pure DSP design in the case of heavy computational tasks, since the hardware task logic is very fast. It is less efficient than a pure hardware system, since such systems can perform at 30 frames per second or more. However it is more flexible and has a clear potential to evolve. USB2.0 transfer rate will certainly be enhanced in future versions. In addition, this particular IP core complies with Altera’s OpenCore Plus program [24] that offers free evaluation and simulation of a megafunction. Finally the proliferation of USB2.0 communication ports in modern computers and video equipment makes this proposed channel a suitable commodity choice when building a hardware/software architecture based on a FPGA co-processor.

10. Conclusions

The design of a general hardware/software system based on a host computer and a FPGA co-processor was presented in this article. The system performance was studied in the case of image processing algorithms and represents a design methodology that can be used for a wide range of custom applications. It is based on a hardware board featuring a medium FPGA device, which is configured as a

System On a Programmable Chip, with a Nios-II software processor in the system control path. Other hardware components are local and on-chip memory and a UTMI-compliant macrocell by SLS Corporation, allowing fast communication with a host computer.

An integrated system controlled by a Nios-II software processor provides significant flexibility for the transferring and processing of image data. An optimum choice of hardware/software features may accelerate the system performance. Partitioning a machine vision application between a host computer and a hardware co-processor may solve a number of problems and can be appealing in an academic or industrial environment where compactness and portability of the system is not of primal importance.

The processing and transfer rates reported in Tables 1 and 3 can be sufficient for a spectrum of academic or industrial vision applications. Obviously, the system is better justified in the case of very demanding image processing computations that fail to perform at a reasonable rate using a personal computer. Such tasks are point pattern matching for biometric applications [25] or block matching for finding correspondences between images of a stereopair [17]. The later application is usually depended on many cameras and has significant computational demands. Cameras and frame grabbers are better controlled by a computer with custom software since image grabbing and transmitting protocols are not easily transferred into hardware designs. On the other hand, the real-time calculation of disparities from two or more cameras is better performed by hardware [26–28].

Alternatively the host/co-processor Nios-II based architecture can be used for implementing and testing in hardware a variety of image-processing academic designs, given its modular flexibility and potential for using open source code. It can also be used to manage image processing in a number of educational applications and student machine-vision exercises [29].

Future work can include more tests of complex algorithms or comparisons of the presented architecture with future high throughput Ethernet channel or PCI channel for host/co-processor communication. Also, it is certainly meaningful to work towards better optimization of the USB2.0 communication channel. Next generation USB2.0 macrocells and Nios soft processors can still increase the range of applications of the proposed design.

Acknowledgements

This work was conducted in communication with the support team of System Level Solutions Corporation, who provided successive evaluation versions of the USB2.0 IP core for test.We need to especially thank software manager Tejas Vaghela for his constant support and advice.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文