ORGFX - a Wishbone compatible Graphics Accelerator for the
Transcription
ORGFX - a Wishbone compatible Graphics Accelerator for the
ORGFX - a Wishbone compatible Graphics Accelerator for the OpenRISC processor Per Lenander Mälardalen University Robotics program Västerås, Sweden per.lenander.swe@gmail.com Anton Fosselius Mälardalen University Robotics program Västerås, Sweden anton.fosselius@gmail.com August 20, 2012 Abstract Modern embedded systems such as cellphones or medical instrumentation use increasingly complex graphical interfaces. Currently there are no widely used open hardware solutions to accelerate embedded graphical applications. This thesis presents the ORSoC graphics accelerator (ORGFX), an open hardware graphics accelerator that can be used with programmable hardware. A standalone software implementation is provided to help for a quick development of accelerated applications. The accelerator is able to render 2D, 3D and vector graphics. The example implementation of the ORGFX is integrated with the OpenRISC Reference Platform System on Chip version 2 (ORPSoCv2). The final implementation runs on a Xilinx FPGA at 50 MHz, and provides accelerated graphics output from an HDMI port. An extensive software driver and a set of utilities to ease development for the graphics accelerator are provided along with the hardware. The software implementation of the accelerator uses the same API as the hardware drivers, making it possible to quickly develop applications for the accelerator without access to a physical platform. The final implementation trades performance against platform independence and generality. The component can be integrated with any CPU or memory chip and works alongside a custom display core that renders the output to an external screen. The software drivers can be run bare metal or modified to run on an operating system. All of the hardware and software developed in this project is provided as open source under the GNU Lesser General Public License (LGPL), and can be downloaded from www.opencores.com. The authors hope that future releases will be integrated as a standard component into the OpenRISC Reference Platform System on Chip. Keywords: Embedded Computer Graphics, OpenRISC, FPGA, Vector Graphics Contents 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 2 Related works 4 3 Concepts 3.1 Introduction to graphics . . . . . . . . 3.1.1 Rasterized graphics . . . . . . . 3.1.2 Vector graphics . . . . . . . . . 3.1.3 Framebuffer . . . . . . . . . . . 3.1.4 Textures . . . . . . . . . . . . . 3.1.5 Sprites . . . . . . . . . . . . . . 3.1.6 Fonts . . . . . . . . . . . . . . 3.1.7 Glyph . . . . . . . . . . . . . . 3.1.8 Triangulation . . . . . . . . . . 3.2 Hardware terminology . . . . . . . . . 3.2.1 FPGA technology . . . . . . . 3.2.2 Hardware description languages 3.2.3 IP Cores . . . . . . . . . . . . . 3.2.4 System-on-Chip . . . . . . . . . 3.2.5 Hard and soft CPUs . . . . . . 3.2.6 OpenRISC . . . . . . . . . . . 3.2.7 Wishbone bus . . . . . . . . . . 3.2.8 ORPSoCv2 . . . . . . . . . . . 3.3 Vector Fonts . . . . . . . . . . . . . . 3.3.1 TrueType fonts . . . . . . . . . 3.3.2 PostScript fonts . . . . . . . . . 3.3.3 OpenType fonts . . . . . . . . 3.3.4 FreeType . . . . . . . . . . . . 3.4 Linux and free Software . . . . . . . . 3.4.1 GPL . . . . . . . . . . . . . . . 3.4.2 LGPL . . . . . . . . . . . . . . 3.4.3 Linux . . . . . . . . . . . . . . 3.4.4 Drivers . . . . . . . . . . . . . 3.4.5 DirectFB . . . . . . . . . . . . 3.4.6 X-Server . . . . . . . . . . . . . 3.4.7 KMS . . . . . . . . . . . . . . . 3.4.8 Direct Rendering Infrastructure 3.4.9 Direct Rendering Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Requirements 5 5 5 5 6 6 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 11 11 11 11 11 5 Design 5.1 Display control . . . . . . . . . . . . . . . . . . . . 5.1.1 Render target . . . . . . . . . . . . . . . . . 5.1.2 Device coordinate system . . . . . . . . . . 5.1.3 Texture coordinate system . . . . . . . . . . 5.2 Control interface . . . . . . . . . . . . . . . . . . . 5.3 2D engine features . . . . . . . . . . . . . . . . . . 5.3.1 Color depth modes and variable resolution . 5.3.2 Rectangles . . . . . . . . . . . . . . . . . . 5.3.3 Lines . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Triangles . . . . . . . . . . . . . . . . . . . 5.3.5 Clipping . . . . . . . . . . . . . . . . . . . . 5.3.6 Coloring . . . . . . . . . . . . . . . . . . . . 5.3.7 Color keying . . . . . . . . . . . . . . . . . 5.3.8 Alpha blending . . . . . . . . . . . . . . . . 5.4 3D engine features . . . . . . . . . . . . . . . . . . 5.4.1 Transformations . . . . . . . . . . . . . . . 5.4.2 Interpolation . . . . . . . . . . . . . . . . . 5.4.3 Z-buffer culling . . . . . . . . . . . . . . . . 5.5 Vector engine features . . . . . . . . . . . . . . . . 5.5.1 Path theory . . . . . . . . . . . . . . . . . . 5.5.2 Shape implementation . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 12 12 13 13 13 13 14 15 17 17 18 18 19 19 20 20 20 20 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 23 23 24 24 6 HDL implementation 6.1 Development board . . . . . . . . . . . . . . . 6.1.1 Video Ram . . . . . . . . . . . . . . . 6.1.2 Display core . . . . . . . . . . . . . . . 6.1.3 HDMI converter . . . . . . . . . . . . 6.2 Architecture . . . . . . . . . . . . . . . . . . . 6.2.1 OpenRISC CPU . . . . . . . . . . . . 6.2.2 System-on-Chip . . . . . . . . . . . . . 6.2.3 Wishbone interfaces . . . . . . . . . . 6.2.4 Pipeline . . . . . . . . . . . . . . . . . 6.2.5 Transformation processor . . . . . . . 6.2.6 Rasterizer . . . . . . . . . . . . . . . . 6.2.7 Interpolation . . . . . . . . . . . . . . 6.2.8 Clipping . . . . . . . . . . . . . . . . . 6.2.9 Fragment processor: coloring . . . . . 6.2.10 Fragment processor: vector rendering 6.2.11 Blender . . . . . . . . . . . . . . . . . 6.2.12 Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 25 25 25 25 26 26 28 28 28 28 29 29 30 30 31 31 7 Software integration 7.1 Bare metal driver . . . . . . . . . . . . . . . . . . . 7.1.1 Basic functionality . . . . . . . . . . . . . . 7.1.2 Extended API . . . . . . . . . . . . . . . . 7.1.3 Advanced API – Tilesets and bitmap fonts 7.1.4 Advanced API – Vector fonts . . . . . . . . 7.1.5 Advanced API – 3D . . . . . . . . . . . . . 7.2 Utilities . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Sprite maker utility . . . . . . . . . . . . . 7.2.2 Bitmap font maker utility . . . . . . . . . . 7.2.3 Mesh maker utility . . . . . . . . . . . . . . 7.2.4 Vector font maker utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 32 33 34 34 35 35 35 35 38 38 8 Testing and validation 8.1 Algorithmic validation 8.2 Hardware validation . 8.3 Software validation . . 8.4 System validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 39 39 39 5.6 5.5.3 Alternative approaches Software . . . . . . . . . . . . 5.6.1 Bus interface . . . . . 5.6.2 Surfaces . . . . . . . . 5.6.3 Meshes . . . . . . . . 5.6.4 Fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Results 39 9.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 9.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 10 Future work 10.1 Textures . . . . . . . . . . . . . 10.2 Bandwidth issues . . . . . . . . 10.3 8/24/32 bpp . . . . . . . . . . . 10.4 Alpha from memory . . . . . . 10.5 Precision issues . . . . . . . . . 10.6 Platform specific optimizations 10.7 Other bus implementations . . 10.8 Linux driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 40 40 40 40 40 41 41 41 11 Conclusions 41 A Appendix A, ORGFX Specification 43 B Appendix B, Enhanced VGA/LCD Specification 75 3 1 Introduction There is a growing demand for graphic user interfaces in modern embedded systems. The end-user wants an easy to use graphical interface in everything from cellphones to machine interfaces. A big part of the embedded market uses open source software and some have begun to work with open hardware. While there are several processors for embedded systems available today with accelerated graphics, few offer open source graphics drivers. To make it possible to build a more open system this thesis shows how a pipelined fixed point graphics accelerator can be implemented. While building a graphics accelerator from scratch is a non trivial task that requires some time and knowledge, it is far from impossible. With the introduction of the Field Programmable Gate Array (FPGA)1 , it is suddenly possible to create hardware by writing a few lines of code in a Hardware Description Language (HDL)2 . As the FPGA chips are getting cheaper and faster they are more commonly used in embedded systems. The logical next step for a embedded system with a FPGA is to integrate not only open source software but also open hardware into the design. This thesis presents the ORSoC Graphics Accelerator – ORGFX – an open hardware component with open source drivers capable of rendering 2D, vector and simple 3D graphics. See appendix A for a technical specification of the final result of this thesis. 1.1 Background Open source software is steadily gaining ground as more and more companies start to see the benefits that it brings. Something that is still relatively unknown is open hardware. If a large set of Intellectual Property cores (IP) that provide common functionality (such as memory interfaces, Ethernet connectors, USB connectors and so on) were open source, it would greatly increase the speed at which a System-on-Chip (SoC) can be developed. The Swedish company ORSoC attempts to increase the amount of open hardware available on the market by maintaining and developing the open source community OpenCores.org and the open source processor OpenRISC. Recently ORSoC has released a development board with an Altera Cyclone IV FPGA and some standard connectors. The board is intended to demonstrate the capabilities of the OpenRISC processor. The OpenRISC processor can run the Linux operating system, and even supports some basic frame-buffer rendering. However, it does not currently support any graphics accelerators. Due to the low clock speed of the FPGA (and thus of the OpenRISC processor) doing graphics in software is very slow. Instead, it is a good idea to use specialized accelerated FPGA components to obtain high resolution graphics. The goal of this thesis was to build an open source graphics acceleration component that can be connected to a CPU, such as the OpenRISC processor. Beyond the basic capabilities of copying images from video memory to the frame buffer, the core should be able to perform triangular and rectangular color fill and line drawing operations. Additionally, the core provides acceleration of vector graphics. By setting up a few parameters, the CPU can move a lot of serial computation from the CPU to dedicated hardware, running one or more pipelines. This allows the CPU to spend more time on other calculations. 1.2 Thesis structure After a brief review of related works and documentation of similar systems in section 2, some basic concepts and terms used throughout the paper are explained in section 3. The actual body of the thesis is separated into several sections; first the requirements on the system are presented in section 4, then the theory and overall design of the solution is described in section 5. Finally, details about the hardware implementation are explained in section 6. Section 7 presents the software drivers and various utilities developed for the hardware implementation. The environment used for testing and validation is explained in section 8, and the results of the tests are presented in section 9. Finally, future work is presented in section 10 and the authors conclusions are discussed in section 11. 2 Related works Several open hardware display cores have been developed and are available on OpenCores, but very few open source graphics accelerators. The VGA/LCD Controller core (see appendix B) which is used in the example implementation of ORGFX, only supports displaying a section of memory on a monitor. The core has no accelerated drawing operations, and the accelerator cores currently found on OpenCores can only handle simpler 2D operations like line and rectangle drawing. Modern consumer level graphics cards have a large number of highly configurable Graphic Processing Units (GPU, as opposed to CPU). These processors are designed to perform a large number of similar calculations in parallel, such as coloring and shading a set of pixels. Traditionally these processors are specialized in performing graphics calculations, but there are many other applications that could benefit from data centric parallel computation. 1 More 2 More info about FPGAs is found in section 3.2.1 info about HDLs is found in section 3.2.2 4 One way to provide acceleration for a given calculation is to write a small program (called a shader) for the GPU using one of the common graphic APIs OpenGL or DirectX. In this way advanced calculations adjusted for parallel computing can be performed by the graphics card, leaving the main processor(s) free. Graphics card developer Nvidia has released many articles on how to use their hardware to accelerate calculations of different kinds. In their book GPU Gems 3 [7], they describe a way to accelerate rendering of vector graphics on the GPU using programmable shaders. Some non-graphics uses of GPU hardware are: physics calculations3 , medical simulations such as Folding@Home4 , encryption and decryption software([7], chapter 36) and much more. All the benefits of using parallel hardware for GPU computations does of course apply to the more generic FPGA technology. On the embedded market, ARM Holdings has its own GPU architecture called Mali 5 , while Imagination Technologies has its slightly more powerful PowerVR 6 . Nvidia has also made an effort to reach the embedded market with its Tegra platform 7 . All three vendors have support for Open Graphics Library for Embedded Systems (OpenGL ES8 ), Open Computing Language (OpenCL9 ) and Open Vector Graphics (OpenVG10 ). The disadvantage of all these implementations and many other similar ones is that both the hardware and the software drivers are proprietary; there is no way for a user to make changes to the hardware or software itself, only configure it using the provided interfaces. The Open Graphics Project (OGP11 ) is an FPGA-based open hardware graphics card that has been under development for a few years. The goal of the project is to build an open source graphics card for desktop computers with full OpenGL support. The card is based on an FPGA and is connected to the computer through the PCI port. A full software emulation has been implemented, but the hardware development has stalled. OGP released a development board in 2010 named Open Graphics Development 1 (OGD1). Another FPGA-based project of interest is the proprietary TurboVG. TurboVG is a vector graphics accelerator that implements hardware acceleration for OpenVG [3]. While there is not much information available about the system yet, it is a recently developed product with features very similar to those presented in this thesis. 3 Concepts This section contains a quick reference and background to some core concepts used frequently in this thesis. It outlines the basic knowledge that the reader should have on these subjects to fully understand the rest of the thesis. 3.1 Introduction to graphics This section introduces the graphics-related terms used in this thesis. 3.1.1 Rasterized graphics All displays used in the computer industry work at discrete resolutions. This means that if you look close enough you will be able to see the pixels, the smallest image elements in the screen12 . If a screen works at 640x480 pixels at 16 bpp, that means that there are 640 horizontal pixels by 480 vertical, where each pixel color is described by 16 bits of data. A common way to store images are bitmaps, which are simply pixel buffers. Though these image buffers can be scaled and rotated, the result is usually very jagged. While this can be improved by using different filters to smooth the scaled image, the result is either jagged or blurry, especially at lower resolutions (see figure 1). Images consisting of a discrete number of pixels are referred to as rasterized graphics. See the next section for a different way to store images that overcomes the drawbacks of rasterized graphics. 3.1.2 Vector graphics The concept of vector graphics is that instead of storing every single pixel of an image, a mathematical formula that describes the various shapes in the image is stored. Since displays are still made of pixel arrays, the vector images still have to be rasterized when they are actually drawn to the screen. However, since the image is described by vectors, it can be scaled and rotated before rasterization without any loss of detail. 3 Physics for Nvidia hardware: http://www.geforce.com/hardware/technology/physx 4 http://folding.stanford.edu/ 5 http://www.arm.com/products/multimedia/mali-graphics-hardware/index.php 6 http://www.imgtec.com/powervr/powervr-graphics.asp 7 http://www.nvidia.com/object/tegra.html 8 https://www.khronos.org/opengles/ 9 http://www.khronos.org/opencl/ 10 http://www.khronos.org/openvg/ 11 http://wiki.opengraphics.org/ 12 As a side note: modern TV sets usually make use of hardware smoothing algorithms to somewhat hide this fact. This is the reason that hooking up your computer to a TV instead of a monitor can produce a blurry image. 5 Figure 1: A rasterized image in original size, scaled and finally rotated. No filtering was used when scaling and rotating. Figure 2: A vector image. Notice that the image has infinite detail even when scaled, and no pixelation artefacts are visible. One common use of vector graphics is TrueType Fonts, fonts that can be rendered to smooth text at any resolution. For an example, see figure 2. Another common use of vector graphics is Adobe Flash, which also demonstrates how vector graphics can be used for animation through simple transformations of one image or shape, instead of storing multiple images. By saving the state of the vector control points for a few keyframes, the computer can generate the intermediate frames. The process is known as tweening, short for Inbetweening, and allows for very smooth animations using small amounts of data. 3.1.3 Framebuffer A frame buffer is a buffer that stores the content that will be written to the screen. The central processing unit (CPU) or the graphics processing unit (GPU) write data to the frame buffer. The display hardware that runs the screen then reads from the frame buffer and writes its content to the screen. A common problem with framebuffers is that the CPU/GPU cannot write to the framebuffer while the display hardware reads from it, and that the display hardware renders the image to screen before the CPU/GPU is finished. To avoid flickering and delays double buffering is often implemented. With double buffering, there are two or more frame buffers where one of the buffers is read by the display hardware and the other is written to by the CPU/GPU. The two buffers are then swapped when the CPU/GPU is finished with a drawing. 3.1.4 Textures A texture is an array of data that is used to store image data. Usually a texture is two dimensional, but both one dimensional and three dimensional textures can be useful in graphics calculations. A single element of a 2D texture is referred to as a pixel or texel, while a single element of a 3D texture is called a voxel. Textures are commonly used to store not only color data, but also normal maps or bump maps (used for two different shading techniques that are outside the scope of this thesis). 6 When this thesis report mentions textures, the term always refers to 2D bitmaps containing color data (raster images). 3.1.5 Sprites The term sprites is used in this thesis for images stored in device memory. A sprite can refer to an image or a particular part of an image. An image that is a collection of sprites is often referred to as a sprite sheet. The term is often used when referring to animated 2D characters in video games. 3.1.6 Fonts A font is a collection of signs, letters or symbols that can be drawn to form a word or expression. There are two common types of fonts; bitmap fonts, where every ”shape” is stored as a image, and the more common vector fonts. Vector fonts are represented by mathematical formulas instead of rasterized images. For more info see section 3.3. 3.1.7 Glyph A font is a collection of glyphs, where each glyph represents one character, symbol or shape (for example, the letter ’D’ or the symbol ’@’). In most fonts, each letter or symbol is represented by a glyph. In bitmap fonts those glyphs are stored as small images, while in vector fonts they are stored as a collection of outlines. 3.1.8 Triangulation In computer graphics the term Tessellation is used to describe the ability to fill a shape with sub shapes. When you fill a geometric shape (such as a vector outline) with triangles it is called Triangulation. 3.2 Hardware terminology This section will give you a brief introduction to some hardware terms used in this thesis. 3.2.1 FPGA technology The Field Programmable Gate Array (FPGA) consists of logic units that can be connected together to form a complex circuit. A Hardware Description Language (HDL) is used to describe how those logic units are connected. Using FPGA technology have become increasingly more popular since it was invented in the eighties. Some of the reasons for its popularity is that it can solve legacy and component shortage issues and can reduce the time-to-market when developing new products. The two largest FPGA developers today are Xilinx and Altera. 3.2.2 Hardware description languages The two most common Hardware description languages (HDL) used today are Verilog HDL (Verilog) and Very High Speed Integrated Circuit HDL (VHDL). The industry standard in Europe is VHDL which is based on ADA, while Verilog with its C style syntax is the preferred language outside of Europe. Stephen Bailey presents a comprehensive summary of the differences between the two languages in a white paper from 2003[1]. This project uses Verilog, since all the surrounding components have Verilog implementations while only a few of them have VHDL implementations. The major difference from ordinary programming languages is that statements written in HDL are executed in parallel rather than sequentially. This allows for higher data throughput on an FPGA than on a CPU, even if the FPGA runs at a lower frequency. 3.2.3 IP Cores A HDL component is called a Core or an IP. IP stands for Semiconductor intellectual property core, but is most commonly named IP core or IP block. Three types of IP cores exists: hard cores, firm cores and soft cores. The hard and firm cores are outside the scope of this thesis. The soft IP core is a component created in a hardware description language that can be synthesized to run on a FPGA. 3.2.4 System-on-Chip An FPGA based System-on-Chip, or SoC for short, is an embedded system built up from several IP Cores and tied together with ”FPGA-glue”. A SoC usually has some sort of CPU that controls the system. Both Altera and Xilinx provide tools to create SoCs from packaged IP cores without writing a single line of code. 7 3.2.5 Hard and soft CPUs In System-on-Chip solutions, there are two different possibilities when it comes to choosing a central processor for the system: hard and soft CPUs. A hard CPU is an integrated part of the FPGA chip that cannot be changed (usually an ARM-based CPU). Hard CPUs are becoming more common in newer FPGA chips. A soft CPU on the other hand is described by HDL, and can thus be implemented on any FPGA that has enough gates. The performance and logic usage of a soft CPU can be impacted by other factors, such as the availability of internal memory blocks or hardware multipliers. While soft CPUs are more versatile (any number of them can be added to an FPGA design), hard CPU implementations can provide much higher performance. 3.2.6 OpenRISC OpenRISC13 is an open source 32-bit Harvard architecture soft Reduced Instruction Set Computer (RISC) CPU IP Core. The OpenRISC project was started by Damjan Lampret, the founder of opencores.com. The implementation used in this thesis was an OR1200 14 from opencores.com. In addition to supporting the newlib and uClibc C implementations, the processor has been supported by the Linux operating system since kernel version 3.1. The OpenRISC implementation comes with a full compiler suite, both for building bare metal programs that run directly on the CPU (or32-elf tool chain) but also for building Linux applications (or32-linux tool chain). In addition to this, there is a fully compatible simulator that emulates the entire CPU. 3.2.7 Wishbone bus Wishbone is an example of ”FPGA-glue” that is used to tie several cores together with a unified model of communication. The Wishbone bus15 is an open bus protocol used in many open source designs, including the OpenRISC processor. It can handle variable data and address widths, the most common being 32-bit for both, and supports both reading and writing. Later revisions of the bus supports reading and writing in bursts, allowing higher bandwidth usage. 3.2.8 ORPSoCv2 The OpenRISC Reference Platform System-on-Chip Version 2 (ORPSoCv2) reference platform is a SoC which integrates an OpenRISC CPU, a VGA/LCD driver and several other useful components. This SoC is used in the implementation phase of this thesis in order to test and validate the ORGFX core. 3.3 Vector Fonts A vector font is a font that is described by outlines instead of discrete images. These outlines are represented by a set of points that are connected with lines or Bézier curves. A more detailed explanation of Bézier curves is given in section 5.5. The main advantage of vector fonts over bitmap fonts is that they can be scaled, rotated or otherwise transformed without any loss of detail. What makes vector fonts difficult to render is the interaction of several outlines. One outline contained within another can signify that the shape has a hole in it. Thus, the entire glyph must be considered as a whole instead of handling the outlines one by one. In addition to this, many font formats use clever tricks such as implicit points to reduce the file size. An example of an implicit point can be found between point one and two in figure 3 where the point without a number is an implicit point. 3.3.1 TrueType fonts The TrueType font format is one of the most common font formats. The format stores points that can form lines or quadratic Bézier curves. The points can be either on-line points or off-line points. If two on-line points are stored after each other, a line will be drawn between them. If an on-line point is followed by an off-line point, the next point is checked. If the next point is an on-line point, a Bézier curve will be drawn between the first and last point with the middle point as a control point. However if both the second and third point are off-line points, there exists an implicit point between them that have to be calculated with the midpoint formula: x0 + x1 y0 + y1 x= ,y = 2 2 The TTF format does not store any information about if a shape is filled or if the shape is a hole in another shape. Therefore all the shapes have to be analysed before deciding what pixels to be filled. The order in which the points in a shape are defined indicates what type of shape it is. A clockwise defined shape indicates a filled shape and a counter clockwise defined shape indicates a hole. Figure 3 displays in what order the two shapes in the letter ”D” is defined. The outer shape is defined in clockwise order and is 13 http://opencores.org/openrisc,or1200 14 http://opencores.org/svnget,or1k?file=/trunk/or1200/doc/openrisc1200 15 http://opencores.org/opencores,wishbone 8 spec.pdf therefore filled while the second shape is defined in counter clockwise order and is therefore not filled. The TTF definition calls the rule that decides what pixels to draw or not to draw as the ”winding rule”. The winding rule states that a point is filled as long as a line from the point towards infinity does not cross the equal number of clockwise and counter clockwise defined outlines16 . 3.3.2 PostScript fonts Postscript fonts are simular to TTF fonts but they use Cubic Bézier curves instead of quadratic Bézier curves. The current implementation of the ORGFX does not support cubic Bézier curves and therefore there is no hardware acceleration for postscript fonts. 3.3.3 OpenType fonts OpenType is an extension of freetype fonts and an opentype font can be stored in two modes, either as a TTF font or a PostScript font. OpenType fonts are not supported by ORGFX because as with PostScript fonts it demands support for cubic Bézier curves. 3.3.4 FreeType FreeType is a open source library for handling fonts. FreeType have support for TTF, PS, OT and a wide range of other more or less common font formats. In this thesis FreeType is used to read the TTF files and extract the points inside the Glyphs. 3.4 Linux and free Software According to the free software foundation an application is defined as free if: The users have the freedom to run, copy, distribute, study, change and improve the software17 . When software is released to the public it often contains a free software license that explains what you are allowed to do with the software. There exists several free software licenses, some a more restrictive than others. The most common open source license today is the Gnu Public License (GPL). 3.4.1 GPL The GNU General Public License (GPL) is a free software license that allows for editing and redistribution, as long as the original author gets credit for his/her work and the changes to the code and all new code that is integrated with the original code is released to the community. 3.4.2 LGPL The GNU Lesser General Public License (LGPL) is a lesser strict version of GPL, it allows a project to include LGPL code without having to release the project as LGPL. However modified open source code must still be released to the community. Some companies prefers LGPL because it integrates more easily with proprietary code. 3.4.3 Linux Linux is a free Unix-based operating system commonly used almost everywhere from dishwashers to supercomputers. The operating system gains more and more popularity each year and have lately had a breakthrough on the cellphone market with Googles Android. Linux have long been popular among developers and scientists, but not until lately found its way down to the common user. 3.4.4 Drivers A driver is an application that tells the operating system how to handle a physical device. In Linux there are two types of drivers: kernel space drivers and user space drivers. A kernel space driver is compiled as an extension or ”module” to the Linux kernel. The driver is commonly loaded during boot, but can also be loaded on the fly during runtime. User space drivers are simpler to implement but lack the ability to utilize interrupts and other kernel features. 3.4.5 DirectFB DirectFB is a hardware abstraction layer that allows for hardware acceleration on embedded Linux systems. DirectFB can be run atop of the standard Linux framebuffer driver and add hardware acceleration. Each feature in DirectFB has a software implementation to allow for full compatibility without hardware acceleration. DirectFB is popular among embedded systems with limited hardware. 16 More details on the TTF font format can be found in the specification at https://developer.apple.com/fonts/TTRefMan/ 17 http://www.gnu.org/philosophy/free-sw.html 9 Figure 3: A visualisation of the glyph ’D’ from a TTF font. Dots are explicit on-line points, crosses are off-line points and circles are implicit on-line points. 10 3.4.6 X-Server The X-server is the standard graphics manager for Linux and Unix. It provides unified graphics API, allowing the same source code to be compiled on different computers with different hardware. It has been under active development since 1984. 3.4.7 KMS Kernel Mode Setting (KMS) is used to set the screen resolution and color depth in kernel space. This has the benefit that the screen mode can be set during early boot. This allows fancy graphics during boot and a smoother integration of virtual terminals. The KMS can be accessed at the same time as the DRM. 3.4.8 Direct Rendering Infrastructure Direct Rendering Infrastructure (DRI) is a framework used by the X-Server to interface directly with graphics hardware. One of the benefits of DRI is that it allows for a faster OpenGL implementation in X. 3.4.9 Direct Rendering Manager The Direct Rendering Manager (DRM) implements the communication interface to the hardware. It is the DRM that reads and writes registers on the graphics accelerator. The DRI uses the DRM to get access to the hardware. 4 Requirements The goal of the ORGFX project was to develop a generic open source graphics accelerator that can be used in modern embedded systems. The accelerator should be able to provide 2D and vector graphics rendering without adding substantial load on the host CPU. It will be possible to integrate the accelerator with a standard open source CPU through a standard bus interface. The target platform was the OpenRISC processor and the Wishbone bus, though the device was implemented to be generic enough to adapt to other bus interfaces. To meet the rendering requirements a basic feature set was constructed: • 2D engine: Color fill (rectangle), draw line, render texture (memory copy). • Vector engine: Quadratic Bézier curves. Filled Quadratic Bézier shapes. To have 2D features, filling areas with color and copying memory is basic features expected of any graphics accelerator. In addition to the above, a few simple blending and clipping operations need to be supported (alpha blending for half-transparent draws and colorkeying for rendering images with transparency). The vector engine features were requested by ORSoC, and can potentially be used to render vector graphics and vector fonts. To make the ORGFX easy to use, a stable and efficient software layer is needed. The software layer enable both detailed control of the hardware and more complex functions that reduce the number of API calls needed to perform common tasks. To make it possible to use images and fonts the API supports direct loading, or provide conversion tools for such files. The graphics accelerator has no hard real-time requirements or framerate requirements. 5 Design The main focus of the ORGFX core was to provide a graphics interface for the OpenRISC processor, the component itself is platform independent and can be connected to anything with a wishbone bus interface. With minor adaptations it is possible to adapt the core to another bus interface. During the development of the ORGFX a 3D engine was added. This section outlines some design concepts common to all features, and describes the theory behind each feature in the feature set. The focus is to explain the purpose of each feature and present one or several algorithms that can be used to implement the feature. Each feature is considered individually, but parallels to other similar features are drawn when possible. Section 6 proceeds to explain how the architecture of ORGFX is implemented to realize all these features. 5.1 Display control The ORGFX core is not a display component, it works by interfacing with some form of video memory and writing graphics primitives to it. The core has no interface that can produce VGA, HDMI or any other form of display signals. 11 Figure 4: The right-handed Device coordinate system. The Z-axis points into the surface, so the higher the Z-value the farther a point is from the ”camera”. To display the information written by the graphics accelerator, a certified core from Opencores called the VGA/LCD controller18 was used (see appendix B for specification). No changes were made to the original design of the VGA/LCD core. A side note: Hardware acceleration of multiple layers is a common technique used in early game consoles to minimize overdraw. While there is no technical limitation in the graphics accelerator to provide several different hardware layers, the design of the display controller prevents this. By using a display controller able to handle multiple layers, the ORGFX core could provide this functionality without any modifications. 5.1.1 Render target The areas in memory that the ORGFX core can access and render pixels to are known as render targets. In the special case that the render buffer is the same buffer that will be drawn to screen it is sometimes denoted as the framebuffer. The ORGFX core supports switching back and forth between different render targets. These memory areas can be of any resolution (limited by the hardware to a maximum size of 65536 by 65536 pixels). Double buffering can be achieved by simply alternating between two framebuffers, rendering to one while showing the other on screen. 5.1.2 Device coordinate system ORGFX uses a right-handed device coordinate system that is based on screen coordinates for simplicity. Each unit is one pixel, and the origin is placed in the top left corner of the surface. The X-axis increases when moving to the right, and the Y-axis increases when moving down. Finally, the Z-axis points into the surface. In other words, the far end of the depth buffer is at the largest possible positive Z-value, and the closest to the viewer is at the largest possible negative Z-value. This is visualized in figure 4. Internally, coordinates are handled as fixed point numbers, with the default precision being set to 16 bit integer part and 16 bit fractional part. The fixed point architecture makes the implementation of the device much simpler and smaller than if floating point numbers were used, at the loss of precision. 5.1.3 Texture coordinate system Another important coordinate system is the texture coordinate system. Textures in ORGFX are just 2D images, so there is no depth coordinate. The X and Y axis are renamed U and V (a standard graphic convention for textures), but have the same direction (top-left corner is the origin, one U unit is one pixel 18 http://opencores.org/project,vga lcd 12 Algorithm 1 Rasterization algorithm for rectangles for y = p0 .y to p1 .y do for x = p0 .x to p1 .x do Put pixel (x, y) end for end for in the image). Unlike the regular coordinate system, texture coordinates are only stored as 16 bit integers, without a fractional part. 5.2 Control interface The ORGFX device has a set of registers that hold the device state and influence how the core operates. To keep a consistent device state even when the device is busy doing operations, register writes are stored in a circular First In First Out (FIFO) queue. Since all drawing operations most likely take at least a few clock cycles, it is important to prevent the device state from changing during an operation. By storing writes in the FIFO and only allowing the FIFO to be read from when the device is not busy, the device can be kept in a stable state. To prevent this FIFO from overflowing during very long drawing operations, the software should only write to the FIFO if the FIFO is not full. The ORGFX contains a large number of registers that can be written to. Of all the registers on the ORGFX only the control register can start drawing operations. The ORGFX is put in a busy state when drawing. 5.3 2D engine features The 2D engine can draw various graphic primitives and perform memory copy operations from either the texture memory or the framebuffer. The basic feature set contains support for: • 16 bit color depth mode • Variable resolution • Acceleration of rectangle, line and triangle raster operations • Acceleration of memory copy operations • Saving textures to video memory • Clipping/Scissoring • Alpha blending and colorkeying All rendering operations will apply to the current render target, which can either be a texture in memory or the visible screen. The graphics accelerator does not differentiate between rendering between a texture and rendering to the screen. 5.3.1 Color depth modes and variable resolution Color depth modes and variable resolution can cause several problems. The color depth of a surface ties in closely with how the display controller interprets the data in memory. The way that pixels align to memory addresses can further complicate supporting different color depth modes. Finally, having too large resolution and color depth on the framebuffer can lead to bandwidth issues. Internally, the device only knows of the current render target and its size (additional render targets have to be stored in software). A render target is represented as a base memory address and the width and height in pixels. The size of the render target is needed so that operations do not write outside of the current render target, and additionally so that the correct stride is applied (since surfaces are stored serially, one ”row” of a surface will have a different memory offset depending on the width). There is no real limit to the size a render target can have, other than the size of the registers holding the width and height values. The color depth affects how the pixel data is packed in memory. Using 16 bits for each pixel gives both a decent color range and is kind on memory bandwidth. 24 bit color mode does not tile well in a 16/32 bit memory, to allow for 24bit color depth, additional logic for alignment and memory management is needed. One way to implement 24 bit color mode is to use 32 bit mode and ignore the last 8 bits. This method is not supported by the display driver and is therefore not used. 5.3.2 Rectangles Filling rectangles of pixels in video memory is accomplished by iterating over each pixel and writing it to memory. The rasterization algorithm is presented in Algorithm 1 and illustrated in figure 5. 13 p0 p1 All pixels in the rectangle have to be traversed. Figure 5: Rasterization of a rectangle. Figure 6: Image of a circle with eight octants and how octant 2 to 8 can be transformed into the first octant. 5.3.3 Lines The ORGFX core implements a line drawing module capable of drawing a line between two arbitrary points. The current implementation is based on Bresenham’s line algorithm [2]. This particular algorithm was chosen for its iterative nature, which makes it easy to implement on an FPGA. Algorithm 2 describes the flow of the algorithm. This algorithm only works for the first octant. The input is therefore transformed to the first octant then calculated and finally transformed back to the original octant. One example of this is when the Y axis increases faster then the X axis (second octant), the X and Y axis are then switched, calculated and finally switched back. The table below and figure 6 shows how the different octant’s are transformed. See figure 7 for an example of a line drawn using the algorithm. Octant 1 2 3 4 5 6 7 8 Switch X and Y X X Negate X X X X X X X Negate Y X X X X An alternative line drawing algorithm is presented by Rokne in [8], usually known as Xiaolin Wu’s line algorithm. It provides speed improvements of a factor 4 to the rasterization over Bresenham, and also allows anti-aliased lines. However, due to the structure of the pipeline 19, the ORGFX core would not become significantly faster unless parallel pipelines were added. 14 Algorithm 2 Rasterization algorithm for lines (Bresenham) ∆x ← p1 .x − p0 .x ∆y ← p1 .y − p0 .y ← ∆x − 2 ∗ ∆y y ← p0 .y for x = p0 .x to p1 .x do Put pixel (x, y) if < 0 then y ←y+1 ← + 2 ∗ ∆x − 2 ∗ ∆y else ← − 2 ∗ ∆y end if end for Figure 7: Example rasterization of a line using Bresenham. 5.3.4 Triangles Another feature of the ORGFX graphics accelerator is to render triangles. Two different algorithms were considered. A triangle can be described by three lines connected by three points. The equations of the lines can be calculated from the three points. Once the line equations are known, it is possible to iterate over the pixel spans between the lines. While this algorithm 19 always iterates over the least number of pixels possible, it is not without problems. Because the algorithm uses the slope of the lines, there will be problems when the slope is very small (subpixel differences). In an early prototype of the algorithm, rendering artefacts appeared, and due to this the algorithm was discarded. An alternative approach is presented in [6] and expanded on in [10](page 5-7). It can be calculated if a given pixel is inside a triangle or not by evaluating the pixels position relative to the triangles three edges. By calculating on which side of the three edges a point resides, it can be calculated if a given pixel is inside the triangle or not. edge0 = −(p2y − p1y )(x − p1x ) + (p2x − p1x )(y − p1y ) edge1 = −(p0y − p2y )(x − p2x ) + (p0x − p2x )(y − p2y ) edge2 = −(p1y − p0y )(x − p0x ) + (p1x − p0x )(y − p0y ) The sign of the result denotes on which side of the edge a point is located. With the equations above, if all edge functions are positive, the pixel is fully inside the triangle (see figure 8). For the full algorithm, see Algorithm 3. The main disadvantage of the algorithm is one of speed: it has to iterate over every pixel in a rectangle, where only some – at most half – of the pixels are actually rendered. The problem is illustrated in (figure 9a). A simple speed-up can be added to the algorithm to lessen the problem somewhat (9b). Given that the algorithm iterates over the body of a triangle and suddenly hit a ”no-draw” pixel, this means that no more pixels will be drawn this row, and it can be skipped completely. As can be observed, this approach adds a lot of overhead from the ideal case presented in the first algorithm. With the second algorithm, barycentric coordinates (see section 5.4.2) can be calculated from the edge functions and the triangle area. 19 http://joshbeam.com/articles/triangle rasterization/ 15 Figure 8: Visual representation of the triangle edge functions. The sign of the function for each pixel indicates if the pixel is inside the triangle or not. Picture from [10] . b) Using the speed up technique, many pixels can be skipped (filled). a) All pixels in the rectangle have to be traversed. Figure 9: A demonstration of how the speed up technique for drawing triangles leads to iterating over fewer pixels. 16 Algorithm 3 Rasterization algorithm for triangles xmin ← min(p0x , p1x , p2x ) ymin ← min(p0y , p1y , p2y ) xmax ← max(p0x , p1x , p2x ) ymax ← max(p0y , p1y , p2y ) for y =ymin to ymax do for x = xmin to xmax do edge0 ← −(p2y − p1y )(x − p1x ) + (p2x − p1x )(y − p1y ) edge1 ← −(p0y − p2y )(x − p2x ) + (p0x − p2x )(y − p2y ) edge2 ← −(p1y − p0y )(x − p0x ) + (p1x − p0x )(y − p0y ) if edge0 > 0 and edge1 > 0 and edge2 > 0 then Put pixel (x, y) end if end for end for Figure 10: 1. Texture, 2. Source, 3. Render target, 4. Clip, 5. Destination The very idea to implement triangles might seem out of the scope of the 2D engine at first, but it will be shown that by implementing triangles using barycentric coordinates, much of the groundwork for the 3D engine and the vector engine is already finished. See section 5.4.2 for further deliberation. 5.3.5 Clipping All pixels generated by the various raster operations are checked against a clipping rectangle (see number 4 in figure 10). If a pixel falls outside the clipping rectangle it will not be rendered, and it is discarded from the pipeline. This technique is sometimes known as scissoring, and can be enabled or disabled with a flag. Any pixels that fall outside of the active render target (see number 3 in figure 10) should always be discarded, regardless of if clipping is enabled or not. This is to prevent drawing operations to one buffer to fall over into another buffer. 5.3.6 Coloring Once a pixel coordinate has been generated (by a rectangle, line or triangle draw operation), the next step is to decide what color the pixel should have. There are several possible ways to do this: • Use a flat color for the entire shape. • Generate a color based on a gradient. This is expanded on in section 5.4.2. • Fetch a color from texture memory. The last technique is sometimes referred to as bit block transfer, or blitting. When this technique is applied to an entire rectangle, it can be used to copy an image from one place to another in memory. In practice, one would store images somewhere in memory, then fetch them and draw them to the render target as needed. This covers the Acceleration of memory copy operations feature. A sprite can be loaded into the texture memory by the CPU. When the sprite is in the texture memory, the ORGFX can draw it to the active render target by copying it pixel by pixel in hardware. This approach requires less CPU time compared to drawing the sprite pixel by pixel every frame. This covers the Saving textures to video memory feature. A comparison of the three rendering modes is shown in figure 11. 17 b) a) c) Figure 11: Three different color rendering modes: a) Flat. b) Interpolated gradient. c) Textured. Figure 12: The same image rendered without colorkeying and with colorkeying. Both images are rendered against a white background. 5.3.7 Color keying The term color keying refers to rendering images with transparent patches to screen, such as many 2D video games do. This technique consists of picking a specific color in the image to be the color key. Whenever a pixel of this color is encountered it is considered to be fully transparent and is then discarded. For an example, see figure 12. This method only applies to operations using the textured coloring method described in the previous section. 5.3.8 Alpha blending A more complex form of transparency can be achieved through alpha blending[9]. By providing an alpha value between zero and one, the active pixel can be drawn as fully transparent, fully opaque or something in between. In practice, this is achieved by sampling the background color from the target pixel and mixing this with the pixel to be drawn: alpha = alphaglobal ∗ alphapixel colorout = colorin ∗ alpha + colortarget ∗ (1 − alpha) where alpha is a value between 0 (transparent) and 1 (opaque). If alpha blending is disabled the pixel is passed on unmodified. The alpha value can be interpolated over a triangle to create gradients (see section 5.4.2). If this function is turned off (interpolation is disabled on triangle draws) then alphapixel is set to 1. The global alpha parameter is a separate value that can set the overall alpha of an entire drawing primitive and is applied to all pixels if blending is enabled. The interpolated alpha only applies to triangle and curve renders. For an example of the result of an alpha blending operation, see figure 13. 18 Figure 13: The same image rendered with different global alpha values (from left to right: alpha = 100%, alpha = 70%, alpha = 30%). The interaction with the background text shows how the alpha settings change the blending. The image is also colorkeyed. 5.4 3D engine features The 3D enigne in the ORGFX is designed to have support for the following features: • Hardware vector transformations • Interpolation • Depth buffer culling Those features will be discussed in detail in this section. 5.4.1 Transformations When working with large 3D objects built from a set of points, a common operation is to apply a matrix multiplication to all the points, creating a common transformation. The equation in its simplest form is as follows: pointout = T ransf ormation ∗ pointin This can for example represent how an object is rotated, by applying a simple rotation kernel: 0 cos(α) −sin(α) 0 x x 0 y = sin(α) cos(α) 0 y z0 0 0 1 z This transformation rotates the input point by α around the Z-axis. By extending the 3x3 transformation matrix to a 3x4 matrix, it is possible to not only rotate, but also translate a point in the same step. Expanded, the generic calculation looks like this: 0 x x aa ab ac tx y 0 = ba bb bc ty y z z0 ca cb cc tz 1 The components aa through cc describes the combined scaling and rotation, while the vector tx, ty, tz describes the translation. Elaborating the expression creates the following equations: x0 = y0 = z0 = aa ∗ x ba ∗ x ca ∗ x +ab ∗ y +bb ∗ y +cb ∗ y +ac ∗ z +bc ∗ z +cc ∗ z +tx +ty +tz At this point it is a good idea to step back and consider this. For each point in a 3D model (which can easily contain thousands of points), the same set of multiplications and additions have to be performed. This common operation will be a severe load on the CPU, so a large leap in performance can be gained by moving it to hardware. Additionally, in hardware the parallel nature of the FPGA can be used to perform the entire transformation in a fraction of the amount of clock cycles needed by the CPU. 19 5.4.2 Interpolation This section expands on the triangle drawing theory from section 5.3.4. The ORGFX is designed to have hardware accelerated bilinear interpolation of triangles. This is achieved by calculating the Barycentric coordinates[6][10] of each pixel rendered. The Barycentric coordinates are an indication of how close to the corners of the triangle each pixel is. Each factor is in the range between 0 and 1, and the sum of all three factors is always 1. Once the barycentric coordinates have been calculated, they can be used to interpolate many different variables for the triangle. The one most interesting for the 3D engine is interpolated depth. The user sets the depth value for each corner of the triangle, and the factors are used to get a smooth interpolation of the depth value over the entire triangle. Recall the edge calculations described in the triangle rasterization algorithm: e0 (x, y) = −(p2y − p1y )(x − p1x ) + (p2x − p1x )(y − p1y ) e1 (x, y) = −(p0y − p2y )(x − p2x ) + (p0x − p2x )(y − p2y ) e2 (x, y) = −(p1y − p0y )(x − p0x ) + (p1x − p0x )(y − p0y ) Additionally, the signed area of the triangle is needed: 1 1 ((px − p0x )(p2y − p0y ) − (p2x − p0x )(p1y − p0y )) 2 As described in the papers mentioned, the Barycentric coordinate factors can be calculated with the formula below: e0 (x, y) f actor0 = 2A∆ e1 (x, y) f actor1 = 2A∆ e2 (x, y) f actor2 = 2A∆ A∆ = Since f actor0 + f actor1 + f actor2 = 1 one of the divisions can be omitted: f actor2 = 1 − f actor0 − f actor1 Using these factors it is simple to interpolate many different variables for a given pixel in a triangle using the formula below: z 0 = f actor0 ∗ z0 + f actor1 ∗ z1 + f actor2 ∗ z2 In this example the depth value at a given pixel in the triangle is interpolated from the depth at each control point and the calculated factors. The same calculation can be used to find the interpolated texture coordinates, alpha value and color of a pixel. For an example of the interpolation technique in action, see figure 11b (interpolated colors) and 11c (interpolated texture coordinates). 5.4.3 Z-buffer culling When drawing shapes with different depth in a 3D environment, the order of drawing objects suddenly becomes important. A shape that is ”farther away” from the viewer than a shape already drawn to screen may end up overwriting the first shape. To prevent this behaviour, a separate buffer containing depth values is held in memory. Whenever a pixel is being drawn to the render target, the depth value of the pixel is compared to the current depth at that point. If the depth is less than the current value, the depth is updated and the pixel is rendered. If the depth being rendered is greater than the current value (the pixel is behind an object in the scene) the pixel is discarded. This feature is vital when rendering any form of complex 3D graphics scenes. For an example of why the depth buffer is needed, see figure 14. 5.5 Vector engine features The Vector engine is designed to be able to perform rasterization of filled quadratic Bézier shapes. This section will discuss what features the vector engine needs in order to raster those shapes. 5.5.1 Path theory The main advantage of vector graphics is that objects can be rendered with infinite detail. Instead of storing an image as an array of pixels, shapes are described using something called parametric curves. The theory of these parametric curves was developed in 1959 by Paul de Casteljau, and later popularized and patented by Pierre Bézier. The main use for the curves was to describe hulls of cars in CAD programs. Their use has expanded considerably since then, and today Bézier curves are also used to describe scaleinvariant fonts and vector graphics. A few common vector graphics formats include Postscript, PDF, flash and SVG. The most widely used format for vector fonts is Truetype fonts (TTF). Bézier curves are also used to describe interpolated change, for example when describing animations. 20 Figure 14: The image above shows the following scenario. A camera is looking at two objects: a box and a person behind the box. The box (1) is rendered first, and the person (2) is rendered second. Without Z-buffer culling, the result will be as is seen on the left (the person appears to be in front of the box). With the correct culling active, the parts of the person behind the box will fail the depth test, and be discarded. In the right image, the person appears to be behind the box, even if the box was rendered first. Because Bézier curves are described as a series of points, it is possible to perform transformations such as rotations and scaling before the curve is rasterized, without any loss of detail. The formula for linear, quadratic and cubic Bézier curves are presented below: Linear: BP 0,P 1 (t) = (1 − t)P0 + tP1 , where t ∈ [0, 1] Quadratic: BP 0,P 1,P 2 (t) = (1 − t)BP 0,P 1 + tBP 1,P 2 , where t ∈ [0, 1] Cubic: BP 0,P 1,P 2,P 3 (t) = (1 − t)BP 0,P 1,P 2 + tBP 1,P 2,P 3 , where t ∈ [0, 1] The same recursive pattern can be further expanded to get n-dimensional Bézier curves. For some example Bézier curves, see figure 15. One notable disadvantage of Bézier curves is their inability to describe a perfect circle or a circle arc. Because of this, most systems capable of drawing vector graphics with Bézier curves have a special case for drawing circular shapes. This feature was not considered for the ORGFX graphics accelerator because of time constraints. The Bézier curve formula can be extended to describe surfaces instead of curves, allowing for scaleinvariant three dimensional shapes. 5.5.2 Shape implementation ORGFX only supports one particular case of Bézier curves; filled quadratic Bézier shapes. This feature is enough to describe all quadratic and cubic vector fonts, with the correct preparations. Quadratic Bézier curves are parametric curves. A parametric curve can be described as a second degree implicit curve (often referred to as a conic section). Quote from a paper by C.Loop[5]: Claim: Any rational quadratic parametric curve has an implicit form that is a projected image of the algebraic curve f (u, v) = u2 − v The mathematical proof for this claim is outside of the scope of this thesis, but can be found in the paper. What it means is that by interpolating the coordinates u and v over a rasterized triangle, the values 21 Figure 15: The top image shows a quadratic Bézier curve starting at p0 and ending at p2, where the curvature is adjusted by p1. The bottom image shows a cubic Bézier curve, starting at p0 and ending at p3, where the curvature is adjusted by p1 and p2. Figure 16: The canonical quadratic curve element (left), a triangle formed by the control points of a quadratic Bézier curve (right). Image from C.Loop[5]. can be tested against the formula. If f (u, v) < 0 then the pixel is inside the curve, otherwise it is outside[5]. See figure 16 for an example of this. Note that the use of the term texture space refers to the way that Loop et al implements this rendering technique using texture coordinates on a programmable GPU, and is not connected to the use of textures or texture coordinates in this thesis. This approach to shape rendering can easily be implemented on top of the interpolation module previously described in section 5.4.2. For example filled shapes, see figure 17. 5.5.3 Alternative approaches The first attempt at implementing Bézier curves consisted of making a parallel implementation of de Casteljau’s algorithm. It was pretty easy to find the correct coordinates of any point in the Bézier curve in linear time. The difficult part was to find the correct step size of the interpolation variable. In fact, depending on the arrangement of the control points, it is entirely possible that the ”correct” step size is not constant over the curve. Tests and experiments with the algorithm show that if the step size is too small there will be significant overdraw, which will lead to significant rendering artefacts when alpha blending is enabled. If the step size is too big, the Bézier curve will have gaps in it. This can be somewhat reduced by either using line draws between the calculated points (for a Bézier curve) or by filling triangles (for a Bézier shape). The problem is that this still leads to jagged shapes. This approach to drawing Bézier shapes was dropped in favour of using the method described by Loop[5], due to the accuracy problems. 22 Figure 17: Above are two different ways to render the same Bézier shape within the bounds of the triangle defined by the three control points. 5.6 Software In addition to the hardware design of the device, a functional software layer is needed to properly interact with the device. This section explains the basic design of how the software communicates with the device, as well as the data structures used to abstract some graphics operations. For the example implementation, the software runs on a 32-bit OpenRISC processor with no operating system. 5.6.1 Bus interface The example implementation of the software assumes that the device is connected the CPU data bus, and thus can be accessed by writing to and reading from specific memory addresses. The data bus is shared with many other devices, so the software layer must know the base address of the device, in addition to the address offset of the specific register to be accessed. Below is an example of how this can be implemented in C using defines for the specific addresses and a macro for mapping memory. After these declarations follows example usage of how to write a value to a register and read from a register: #d e f i n e GFX BASEADDR 0 xB8000000 /∗ Bus Adress t o GFX #d e f i n e GFX STATUS #d e f i n e GFX COLOR0 (GFX BASEADDR + 0 x04 ) (GFX BASEADDR + 0 x84 ) #d e f i n e REG32( add ) ∗(( v o l a t i l e unsigned i n t ∗/ ∗ ) ( add ) ) ... REG32(GFX COLOR0) = 0 x f 8 0 0 ; s t a t u s = REG32(GFX STATUS ) ; For a full list of registers and their addresses, refer to the ORGFX device specifications in appendix A. For consistency, all registers defined in software have the same name as their hardware counterpart. 5.6.2 Surfaces Since the hardware itself only knows the address and size of the current render target and active texture, the software must keep track of many such surface objects to be able to switch between them freely. The bare minimum information needed for this is the base address of the surface and the width and height in pixels. With the following structure, these parameters can be stored together: struct orgfx { unsigned unsigned unsigned surface i n t addr ; i n t w; int h; 23 }; By passing this structure to a bind function, the correct values can be loaded to the hardware. By design decision, it is up to the user to manage the surface structure. 5.6.3 Meshes A mesh is nothing more than a collection of triangles drawn around the same origin point. Each triangle can be thought of as a face that contains three vertexes and three texture coordinates. It is relatively common that those coordinates are shared by other faces too, so it is possible to save a lot of space by just storing the indices that each face uses. typedef struct orgfx point2 { float x, y; } orgfx point2 ; typedef struct orgfx point3 { float x, y, z ; } orgfx point3 ; typedef struct { unsigned i n t unsigned i n t unsigned i n t } orgfx face ; orgfx face p1 , p2 , p3 ; uv1 , uv2 , uv3 ; color1 , color2 , color3 ; typedef s t r u c t orgfx mesh { u n s i g n e d i n t numVerts ; orgfx point3 ∗ verts ; u n s i g n e d i n t numUvs ; o r g f x p o i n t 2 ∗ uvs ; u n s i g n e d i n t numFaces ; orgfx face ∗ faces ; } orgfx mesh ; 5.6.4 Fonts Vector fonts can be described as a set of glyphs, each a number Bézier shapes that form the curved exterior and a number of triangles that fill the interior of the shape. This way of describing vector fonts is designed with the implementation of hardware Bézier shapes in mind (see section 5.5.2). Each Bézier shape representation needs three 2D points describing the shape, as well as a flag indicating if the shape should be filled as inside or outside (see figure 17). typedef struct Bezier write { orgfx point2 start ; orgfx point2 control ; o r g f x p o i n t 2 end ; int f i l l I n s i d e ; } Bezier write ; typedef struct Triangle write { o r g f x p o i n t 2 p0 ; o r g f x p o i n t 2 p1 ; o r g f x p o i n t 2 p2 ; } Triangle write ; t y p e d e f s t r u c t Glyph { i n t advance x ; i n t index ; int bezier n writes ; Bezier write ∗ bezier ; int triangle n writes ; 24 Triangle write ∗ triangle ; } Glyph ; typedef struct orgfx vector font { int i n d e x l i s t s i z e ; Glyph ∗∗ i n d e x l i s t ; int size ; Glyph ∗ glyph ; } orgfx vector font ; To be able to support unicode fonts the software layer uses wide character strings. This makes it possible to write strings that contain letters not included in the basic 128 ASCII set. This includes characters such as åäö, and other alphabets such as the Arabic, the Cyrillic and the Chinese character sets. Below is a piece of example code in C that shows how wide character strings can be used. Note that constant wide strings must be prefaced with a capital L. #i n c l u d e <wchar . h> w c h a r t w i d e s t r i n g [ ] = L” This i s a wide s t r i n g ” ; 6 HDL implementation The hardware implementation of the algorithms from the previous section is presented here. Before diving into the architecture of the ORGFX device, the development board used for the implementation and several important IP cores used are presented. 6.1 Development board A Digilent ATLYS development board (see figure 18) was used during this thesis. The ATLYS board has a Xilinx Spartan 6 FPGA and 1 Gbit of DDR2 SDRAM. The board has four HDMI, two USB ports, Ethernet and audio connectors, some push buttons, several LEDs and switches, as well as a GPIO port. For more information about the board see the AtlysTM Board Reference Manual 20 . For the purpose of the ORGFX implementation, the only components actually needed on the board is the FPGA, the memory and the HDMI connector (including the surrounding Integrated Circuit logic). All the modules on the FPGA run on a 50 MHz clock. 6.1.1 Video Ram There is only one larger RAM chip on the Atlys board (128MB in size), so the RAM is shared between the CPU and the graphics accelerator. The graphics accelerator can easily switch to using a different memory because of the generic wishbone interface. A dedicated graphics memory may allow for larger resolution and better performance. 6.1.2 Display core The display driver used in this project is the Enhanced VGA/LCD controller 21 . This component is connected to the system with a Wishbone revB.3 data bus and is widely used in other projects (for example: it is the main display core used in ORPSoCv2). The specification for this core is provided in Appendix B. 6.1.3 HDMI converter Since the display controller core generates VGA signals, some modifications have to be made before the signal can be forwarded to the HDMI port. The VGA signal passes through another core that interfaces directly with an HDMI converter chip present on the Atlys board. 6.2 Architecture The ORSoC Graphics Accelerator core is designed to reduce CPU load by undertaking expensive graphical operations. The core has a pipeline structure so that it performs several pixel operations in series in an efficient manner. For some simpler operations, some steps in the pipeline are skipped completely for a shorter operation latency (for example: the blending step is not needed if the rendered pixel has no transparency). See figure 19 for an overview of the various submodules. 20 http://www.digilentinc.com/Data/Products/ATLYS/Atlys 21 http://opencores.org/project,vga rm.pdf lcd 25 Figure 18: Picture of the Digilent ATLYS development board. While all operations use the same pipeline (see section 6.2.4), several steps are skipped or simplified when only doing 2D operations. This modular pipeline architecture was chosen for several reasons: • Several actions (mostly operations on individual pixels) can be queued, trading low latency for a higher throughput. • Several similar operations can be combined into one module and modified through flags. This can reduce the size of the final core since logic can be reused. • It is easy to add new pipeline stages that do additional operations. For example, a stage for tesselation, or a stage for per-pixel lighting. • With a solid coordination and buffering mechanism, parts of the pipeline can be parallelized for highly improved performance22 . • Each module can be developed, simulated and verified individually, making it easier to localize bugs in the system. 6.2.1 OpenRISC CPU The reference implementation makes use of an OpenRISC soft processor running at 50 MHz. All of the FPGA cores are connected to the CPU through a 32-bit wishbone bus interface. The CPU controls the ORGFX component by setting registers through the bus. For more information about the software running on the CPU, see section 7. 6.2.2 System-on-Chip The ORGFX core was verified by being integrated in an ORPSoCv2 system. The main OpenRISC processor communicates with the accelerator through the wishbone interconnect. The ORPSoCv2 design contains a memory controller, the display core and the HDMI adapter core. In addition to this the SoC provide many debug interfaces such as Ethernet, JTAG UART and an USB controller. 22 The bottleneck in such a system will most likely be the bandwidth of the bus accessing the memory. 26 Figure 19: Picture showing an overview of the ORGFX pipeline. The bold downwards arrows represent the main flow of data ”downstream”. Acknowledgement signals are sent back ”upstream”. The wishbone reader and wishbone writer interfaces are connected to video memory through a wishbone connection. 27 6.2.3 Wishbone interfaces The main control interface of the ORGFX core is a 32-bit wishbone slave. In the reference implementation, this bus interface is connected to the data bus of the OpenRISC processor, allowing the CPU read and write access to the devices registers. All accelerated operations are initiated by writing to certain bits in the main control register on the ORGFX device. The core has two wishbone master interfaces, one that can initiate reads from memory and one that can initiate writes to memory. The two were kept separate to keep the internal wishbone logic simple. Both the wishbone revB.3 and the newer revB.4 specifications define burst read and write modes, to decrease the overhead of reading and writing larger blocks of information. None of these modes are used in ORGFX due to limited time for implementation, but the feature is a possible source of optimization. 6.2.4 Pipeline The ORGFX core uses a pipelined architecture to speed up operation. An overview of the pipeline can be seen in figure 19. Each module in the pipeline communicates with acknowledge and write signals. A module will not assert write to the next module unless it receives an acknowledgement first (or if the module was previously in a ready state, in which case the downstream pipeline is empty). All acknowledgement and write signals are always exactly one clock tick long, to prevent triggering multiple instances of the same instruction. Each module in the pipeline may hold the upstream pipeline for several clock ticks. For example, the rasterizer will prevent incoming raster instructions until all the pixels for the current operation are generated. When the rasterizer is ready for new data, it will send an acknowledgement upstream. To keep a consistent device state, once the pipeline is in operation all wishbone writes to the device are queued up in a FIFO until the current operation is complete. Variables that are unique to the current pixel are buffered each step of the pipeline, while variables constant over one operation – such as the currently active texture – are stored in global registers accessible by every pipeline stage that needs them. 6.2.5 Transformation processor As can be seen in the design of the 2D raster features (rectangle, line and triangle), all of the features operate on points. These points can be transformed to mimic exploring a 3D space, projected on a 2D canvas. The transformation processor is designed to handle translation, scaling and rotation of the control points used by the raster operations. It is implemented as a single matrix multiplication which can be loaded to the device through twelve registers. As can be seen in the pipeline overview, this module is not actually part of the main pipeline, but it provides input to the rasterizer. Every point rendered will be affected by this transformation if the transformation processor is currently active. It is possible to disable it to draw 2D shapes (in this case, the provided points are forwarded instead of transformed). 6.2.6 Rasterizer The rasterizer module initiates the rendering of rectangle, line and triangle primitives. When it receives a command to start an operation, it follows the algorithms described in the design section to generate pixels one by one. The module will hold the upstream pipeline until the entire shape has been rendered (every generated pixel has been acknowledged). The rasterizer has two submodules to handle the more complex rendering processes; one for Bresenham lines and one for triangles. The module itself has a state machine controlling its behaviour, as can be seen in figure 20. Starting in the Wait state, the module moves to one of the other states once a signal to start an operation arrives. The Line and Rect states are very straightforward, and simply generate pixels until the operation is finished, then return to the Wait state. The triangle rendering is slightly more complex, going through a preparation state (Triangle Prep) and alternating between the Triangle and Triangle Write states to generate pixels. This is because unlike the line or rectangle operation, the algorithm has to examine the generated pixel to see if it is actually inside the triangle, or if it should be discarded. It should be noted here that the output of the rasterizer can go either to the interpolation pipeline or directly to the clipping module. Which path the pixel takes depends on if any interpolation operation is active. This includes: • Gradient coloring of triangles • Textured triangles • Triangles with depth coordinates • Interpolated alpha In other words, everything listed in section 5.4.2. 28 Figure 20: Picture of the rasterizer state machine 6.2.7 Interpolation The division and interpolation modules form a separate pipeline that can be skipped entirely for simple rendering operations such as rectangles. Interpolated variables are only supported for triangle rendering23 . As mentioned in the design section, the formula to calculate the Barycentric coordinates of each triangle corner that will be used for interpolation is as follows: e0 (x, y) 2A∆ e1 (x, y) f actor1 = 2A∆ f actor2 = 1 − f actor0 − f actor1 f actor0 = Both the edge functions and the triangle area are calculated in the triangle rasterizer. The hardware division is implemented as two pipelined division modules, one for f actor0 and one for f actor1 . In the interpolation module24 , f actor2 is calculated, and all three factors are used to calculate the depth, alpha, texture coordinate and color of the point (not all of these values have to be used). The values are calculated by multiplying the supplied base values of each corner point with the associated factor: z 0 = f actor0 ∗ z0 + f actor1 ∗ z1 + f actor2 ∗ z2 The calculations for the other values are similar (texture coordinates are two calculations, one for u and one for v). 6.2.8 Clipping As mentioned before, the clipping module can take input either directly from the rasterizer, or from the interpolation pipeline. Three forms of clipping/culling are performed in the clipping module: • Clipping against the target size: Any attempted pixel draws that fall outside of the target are discarded. This operation is always performed. • Clipping against the clip rect: An arbitrary clipping rectangle can be set. Any pixel falling outside of it will be discarded. This clipping operation can be turned on and off by setting a flag in the control register. • Depth buffer culling: The z-value of the pixel drawn is compared to the z-value at the target pixel. If the depth value of the pixel is lower (farther away) than the target, the pixel is discarded. This operation requires that a depth buffer is bound, and that the z-buffer is enabled by setting a flag in the control register. When depth buffer culling is activated, the depth buffer has to be accessed. The clipping module does this by calling the wishbone reader interface through an arbiter (since only one of the three modules connected to the reader can access it at any given moment, see figure 19). Much like the current render target, the depth buffer is represented by a base address and a width and height. The depth buffer represents the depth of each pixel with a 16 bit value25 . Any time at least one of the conditions for clipping is met, the pixel is discarded and the module immediately sends an acknowledgement upstream. If none of the enabled clipping conditions are met, the pixel is passed on to the fragment processor for coloring. 23 It was decided not to support interpolated values for lines (not often useful) or rectangles (can be achieved by drawing two triangles) because it would mean adding more division units. 24 Also known as the CUVZ module, as it calculates Color, UV-coordinates and Z (depth). 25 In other words, at 16 bit color depth, the render target and the depth buffer will have the same dimensions and take up equal amount of memory. 29 Figure 21: Picture of the fragment processor state machine 6.2.9 Fragment processor: coloring The fragment processor adds color to the pixels generated by the rasterizer (the ones that are not discarded by the clipping module). This can be done using one of several sources: 1. A flat color residing in the main color register. 2. An interpolation of several colors from all the three color registers (one color for each corner of a triangle). 3. Textured, using texture coordinates U and V generated by either the rasterizer or the interpolation pipeline. Which coloring mode is used is defined in a global register, and is constant over the drawing of each graphics primitive. Flat colors can be used for all graphics primitives. Since the color is constant over an entire operation, the fragment processor fetches the color from a global register. Gradient coloring is only available for triangles. Here, the fragment processor fetches the calculated color from the interpolation pipeline. This color is the linear combination of three global color registers and the interpolation factors for each corner of the triangle. The textured coloring mode is available for rectangles and triangles26 . This mode requires access to texture memory through the wishbone reader. The address where the fragment processor looks for the color is calculated from the base texture address and from the U and V texture coordinates. These coordinates are either generated by the rasterizer (for rectangles) or by the interpolation pipeline (for triangles). One additional feature handled by the fragment processor is colorkeying. As mentioned in the design section, colorkeying only really makes sense if textured mode is used. If colorkeying is enabled and the fetched pixel matches the colorkey, the fragment processor discards the pixel instead of pushing it downstream. A flowchart for the fragment processor state machine can be seen in figure 21. 6.2.10 Fragment processor: vector rendering Finally, the fragment processor handles the rendering of filled Bézier shapes, implementing the rendering of vector graphics described in section 5.5. As stated in the shape implementation section: Quote from a paper by C.Loop[5]: Claim: Any rational quadratic parametric curve has an implicit form that is a projected image of the algebraic curve f (u, v) = u2 − v The u and v parameters here should not be confused with the texture coordinates U and V, they are not related in the ORGFX implementation. Instead, the factors are renamed: f (bezierF actor0 , bezierF actor1 ) = bezierF actor02 − bezierF actor1 As can be seen in the left triangle in figure 16, the different coordinates at each corner ([0, 0], [ 21 , 0] and [1, 1]) represent corner values for [bezierF actor0 , bezierF actor1 ]. When the ORGFX device is sent a command to start a Bézier shape operation, it is handled exactly as an interpolated triangle draw. Each pixel in a rectangular bounding box around the triangle is generated and tested by the rasterizer, and the pixels that fall inside of the triangle are passed on to the interpolation pipeline. In the interpolation pipeline, the Barycentric coordinates are used to calculate [bezierF actor0 , bezierF actor1 ] by interpolating between the corner values. The fragment processor is presented with the actual value of [bezierF actor0 , bezierF actor1 ] at the generated pixel, and from this calculates the result of the equation: f (bezierF actor0 , bezierF actor1 ) = bezierF actor02 − bezierF actor1 26 Technically lines will also work, but since no texture coordinates are generated, the fragment processor will always fetch the first pixel of the texture. 30 Figure 22: Picture of the blender state machine If f (bezierF actor0 , bezierF actor1 ) < 0 then the pixel is inside the curve, otherwise it is outside. The fragment processor is provided with a flag that decides if the curve should be filled inside or outside (see figure 17). If the shape should be filled outside, the condition is f (bezierF actor0 , bezierF actor1 ) >= 0 instead. If the pixel passes the test it is colored as usual, but if the test fails the pixel is discarded. A pixel in a textured Bézier shape that passes the test can still be discarded in the colorkeying step, if this feature is enabled. 6.2.11 Blender The purpose of the Blender is to calculate the combined color, based on the color provided by the fragment processor, the color at the target pixel and the alpha value. This module implements the transparency feature described in section 5.3.8. Alpha blending is an optional feature that can be turned off, which will save some memory bandwidth. There are two components to the alpha value, the global alpha – fetched from a global register since it is constant over a primitive – and the pixel alpha. The pixel alpha is only used if triangle interpolation is active, and enables interpolating between different alpha values over a single primitive. If interpolation is not active, the fragment processor sets the pixel alpha to no transparency. All alphas are stored as 8 bit fixed point values (0 integer bits, 8 fractional bits), where 0 represents full transparency and 255 represents no transparency. The combined alpha is calculated with the following formula: alpha = alphaf ragment ∗ alphaglobal The final alpha is right shifted by 8 bits to account for the fixed point multiplication. The blender fetches the color of the target pixel from the render target, then calculates the final color of the pixel: coloroutr = colorf ragmentr ∗ alpha + colortargetr ∗ (255 − alpha) coloroutg = colorf ragmentg ∗ alpha + colortargetg ∗ (255 − alpha) coloroutb = colorf ragmentb ∗ alpha + colortargetb ∗ (255 − alpha) The final value is right shifted by 8 bits to account for the fixed point multiplication. A flowchart for the state machine in the blender can be seen in figure 22. 6.2.12 Renderer The rendering module calculates the address of the target pixel and the bitmask to write the color value to memory without affecting adjacent pixels. These values are then sent to the wishbone write interface for processing. If the depth buffer is enabled and the current pixel passed the clipping stage, the depth of the pixel must be written to the z-buffer so it can be compared in later operations. In other words: if depth is enabled, the renderer will perform two memory writes; one to the actual target pixel and one to the depth buffer. One of the more notable optimizations discussed in future works (section 10) is bandwidth usage optimization. The renderer would be the correct place to implement a write queue to process burst writes. 7 Software integration In this section the Hardware/Software interface is explained. 31 7.1 Bare metal driver The term bare metal refers to when the OpenRISC processor is running C-code or assembly instructions directly without having an operating system active. This mode is very useful for testing and debugging, since it removes several layers of complexity. All the driver components are written in ANSI C, without any platform specific functions or macros. The exact implementation of the driver depends on how the ORGFX device is connected to the OpenRISC processor. The reference implementation developed alongside the component assumes that ORGFX is mapped to memory and all registers can be written to and read from directly, without any caching. The bare metal driver is written in several layers of increasing complexity, with the lower layers being ideal for debugging individual instructions and the higher layers giving the application programmer an API that is easier to use. The higher level APIs usually perform more writes to the device than is strictly needed, but they ensure a more stable device state. 7.1.1 Basic functionality • orgfx.h • orgfx.c • orgfx regs.h The basic functionality layer handles all communication with the device itself (with each additional layer only adding convenience functions that use the basic functionality). Communication with the device over the wishbone bus is performed by a simple macro: #d e f i n e REG32( add ) ∗ ( ( v o l a t i l e u n s i g n e d i n t ∗ ) ( add ) ) This method can be used to both read and write from memory. The actual memory addresses of each register and specific pin numbers are stored in orgfx regs.h. These match the hardware parameters defined in gfx params.v. orgfx regs.h also define the base address of the device on the CPU data bus. The design of the basic driver functionality is minimalistic, each function call doing as little operations as possible. To perform more complex tasks, the user of the API will have to call several functions in sequence, while keeping track of the current device state. Three things are needed to initialize the driver: 1. A call to orgfx init() to initialize the driver with the base video memory address. 2. A call to orgfx vga set videomode() to initialize the VGA/LCD module. 3. A call to orgfx init surface() to get a rendering target. The third function returns a struct orgfx surface, which contains information about a render target or texture. To perform drawing on the new target, it has to be bound as the currently active render target, by using the orgfx bind rendertarget() function. A render target can be of any resolution or aspect ratio, but the first one should be set to the same resolution as the video mode (it will represent the screen). The driver makes no attempt to hold on to render targets, it is entirely up to the user to keep track of them. Each additional render target is allocated memory sequentially by incrementing a memory offset inside the driver. When the device is properly initialized, the user can start making drawing calls to have pixels appear on the screen. While it is possible to set pixels individually using the orgfx set pixel() function, this function does not have any hardware acceleration. After setting the drawing color with orgfx set color() or orgfx set colors(), the user can perform accelerated drawing operations with the following primitives: orgfx orgfx orgfx orgfx r e c t ( x0 , y0 , x1 , y1 ) l i n e ( x0 , y0 , x1 , y1 ) t r i a n g l e ( x0 , y0 , x1 , y1 , x2 , y2 , i n t e r p o l a t i o n ) c u r v e ( x0 , y0 , x1 , y1 , x2 , y2 , f i l l , i n s i d e ) These functions draw simple rectangles, Bresenham lines[2], triangles (with or without interpolated colors), and quadratic Bézier shapes. An important thing to note is that all point coordinates are defined in fixed point notation. For convenience, the define FIXEDW can be used to create valid coordinates this way. For example, to draw a rectangle from point (10,15) to (20,25) one would write: o r g f x r e c t (FIXEDW∗ 1 0 , FIXEDW∗ 1 5 , FIXEDW∗ 2 0 , FIXEDW∗ 2 5 ) ; To do more interesting things than drawing flat rectangles, textures need to be loaded to the device. A texture is essentially another render target, so orgfx init surface() and orgfx bind rendertarget() has to be called to allocate the new texture. Once the texture is bound, any of the above drawing operations can be used to fill the pixels, like with any render target. Usually the user wants to load a prepared image though, which is most easily achieved by calling orgfx memcpy() with a memory buffer and its size. This function is intended to accept the generated output of the sprite maker utility (see section 7.2.1). 32 To draw the texture, it has to be bound as a texture using the orgfx bind tex0() function, and texturing has to be enabled with the orgfx enable tex0() function. Once texturing is enabled, it will be used instead of the regular color for the drawing primitives. To only draw certain sections of a texture, the user can set the source rect with the orgfx srcrect() function. This will add an offset to the texture in orgfx rect() calls. The source rect is reset each time a new texture is bound. Drawing textured triangles and Bézier shapes is slightly more complex. For this, a texture coordinate has to be set for each control point. Do this with a call to orgfx uv(). For this to work, the triangle function has to be called with the interpolate parameter set to one (texture coordinates will be interpolated between the triangle control points). One more thing that should be noted about triangles is that they must be defined in clockwise order. Any triangles defined in the wrong order will be discarded in hardware, and the same holds true for Bézier shapes. Colorkeying can be applied to any texture draws by using orgfx enable colorkey and orgfx set colorkey. Any time a texture read matches the colorkey, the current pixel is discarded. The ORGFX alpha blending functionality can be used with the functions orgfx enable alpha and orgfx set alpha. Take care when using alpha blending together with interpolated triangles, since alpha values will be set for each control point and interpolated over the primitive. The resulting per-pixel alpha will be multiplied by the global alpha as described in section 5.3.8. The alpha value sent to the device consists of four parts, arranged as thus: Bit # Description [31:24] Point 0 alpha [23:16] Point 1 alpha [15:8] Point 2 alpha [7:0] Global alpha For example, calling orgfx set alpha with an alpha of 0xff8000ff would mean that p0 is opaque, p1 has half transparency, p2 is transparent and the global alpha is set to opaque. The ORGFX device has one major function that supports 3D rendering: orgfx triangle3d(). This function works exactly like the regular triangle function, but allows shapes with depth to be rendered. To make full use of this feature, the user can create and bind a depth buffer to perform depth culling. The buffer itself is created the same way as render targets and textures: with orgfx init surface(). There are three functions related to depth buffer culling: orgfx bind zbuffer () orgfx enable zbuffer () orgfx clear zbuffer () First, the depth buffer has to be bound. It is up to the user to ensure that the bound z-buffer is of the same resolution as the render target. Once depth culling is enabled, any writes that pass the culling stage will overwrite the depth buffer. This means that once the user wants to draw a new frame, the depth buffer should first be cleared. Finally, the user can activate the hardware accelerated 3D transformations of the ORGFX device with orgfx enable transform() and orgfx set transformation matrix(). 7.1.2 Extended API • orgfx plus.h • orgfx plus.c While all the functionality of the graphics card can be accessed with the basic driver, it is fairly difficult to keep track of the device state and keep it consistent. The extended API is intended to improve this and encapsulates some of the more complex functionality in convenient functions. One major change from the basic API is that surfaces are tracked internally by the driver, and the user gets an integer ID that is used for binding the surface. The extended driver is initialized by a call to orgfxplus init(). The function initializes the graphics card, sets the video resolution and allocates the screen surface. Additionally, by setting the flags of the function it automatically allocates surfaces for double buffering and depth buffering. The function returns an integer number that is used to refer to the screen surface (always -1). When double buffering is activated, the driver keeps track of which surface is currently the active buffer. The user can switch between active buffers with the orgfxplus flip() function. If depth buffering is activated, the driver automatically binds the depth buffer, but does not enable the z-buffer culling. To initialize a surface and load an image into it with one function call, use orgfxplus init surface(). The function takes the width and height of the surface and a pixel buffer to copy into it, and returns an ID referring to the allocated surface. The number of surfaces that can be allocated is static and can be changed before compiling the driver. 33 Figure 23: Example bitmap font. The characters are placed at regular intervals in a 16 by 16 grid. Since a new syntax for handling surfaces is used, the extended API has two new functions for binding the render target and the currently active texture: orgfxplus bind rendertarget() orgfxplus bind tex0(). The new syntax also changes the way that sprites are rendered: the orgfxplus draw surface() and orgfxplus draw surface section() function binds the supplied texture, enables texturing and draws the image to screen. The second function also sets the source rect, causing only part of the image to be rendered. All of the functions in the basic API can be used alongside the extended API; the extended API simply provides an easier way to initialize and handle surfaces. 7.1.3 Advanced API – Tilesets and bitmap fonts Files: orgfx tileset.h orgfx tileset.c orgfx bitmap font.h orgfx bitmap font.c A relatively common way to handle sprites is to store multiple sprites in the same image file, and only draw part of the image when a sprite is requested. It is possible to get this functionality from the basic driver (by setting the source rect before drawing), or by using the orgfxplus draw surface section() function from the extended API. Both these methods require that the user provide the source rect every time a sprite is drawn. The tileset library provides a simple wrapper around this. By storing an array of orgfx sprite rect structs, the user can draw sprites with a call to the orgfx draw tile() function, providing a tileset pointer and the index of the sprite to be drawn. The tileset library uses the extended API syntax for handling surfaces. Bitmap fonts are a special case of tilesets. By providing an image of the entire ASCII character set, the user can render text to the screen with only one function call. Figure 23 shows an example bitmap font. To enable the user to write special characters such as åäö, wide character strings are used. The syntax for writing text using a loaded bitmap font is as follows (note the L that denotes the text as a wide character string): o r g f x p u t b i t m a p t e x t (& f o n t , x0 , y0 , L”Some example t e x t ” ) ; Since writing the specification for a bitmap font by hand can be quite tedious, a utility to automate the process is provided. See section 7.2.2. 7.1.4 Advanced API – Vector fonts Files: orgfx vector font.h orgfx vector font.c Vector fonts are much more versatile than bitmap fonts. Since the glyphs are store as vectors, they can be scaled up or down without loss of detail. In addition to this, the points can be arbitrarily translated, scaled and rotated. orgfx make vector font orgfx init vector font orgfx put vector text For more information on how to actually generate the internal data structures needed to render vector fonts, see section 7.2.4. 34 Figure 24: The same mesh rendered in wireframe, colored triangles and textured mode. 7.1.5 Advanced API – 3D Files: orgfx 3d.h orgfx 3d.c The basic driver allows for hardware accelerated transformations of points and rendering triangles in 3D. By calling the correct functions a depth buffer can be initialized and used to prevent triangles far away to overwrite closer triangles. This is quite far from a manageable 3D interface though, so a convenience driver for displaying 3D models is provided. The main object of the 3D interface is the orgfx mesh struct. Besides storing all the points in the model, and information about how they form triangles, the mesh struct contains a set of transformation variables. The translation, rotation and scale variables can be adjusted to move and manipulate the transformation matrix of the mesh. The mesh can be rendered with the provided transformations by calling orgfx3d draw mesh(). The function allows for rendering the mesh with filled triangles or as a wireframe, using lines (see figure 24). Since the basic driver is only capable of loading a prepared transformation matrix, the 3D API provides simple functions to create and transform matrices. Meshes can be generated from Wavefront .obj files with the meshmaker utility (see section 7.2.3). 7.2 Utilities While developing the graphics accelerator we implemented some tools to make it easier to manage the project. 7.2.1 Sprite maker utility A small application that converts an image into a header file that can be included in the project when compiled. The application generates an array of color values that can be loaded as a sprite. The application has support for reading common image file formats such as bmp, png and jpg (for a full list, see the supported file formats of the SDL image libaray). 8- 16- and 32-bit output is supported, and can be changed by passing a command line argument to the program (by default, the output is adjusted for 16 bit color mode). The resulting output header file, which is named after the input, can be included in a program using the extended bare metal driver. The easiest way to use the sprite is to use the generated initialize function defined in the header file. 7.2.2 Bitmap font maker utility Another application generates the data structures necessary to load bitmap fonts with very little effort. It takes an image and a grid spacing as input, and automatically generates offsets for all the glyphs in the font. The font generated by the program has 256 characters arranged according to the ASCII charset, as seen in figure 25 and 26. The application has support for reading common image file formats such as bmp, png and jpg (for a full list, see the supported file formats of the SDL image libaray). 8- 16- and 32-bit output is supported, and can be changed by passing a command line argument to the program (by default, the output is adjusted for 16 bit color mode). Both vertical and horizontal grid spacing are set to 32 pixels by default, but this can be changed through command line arguments. The resulting output header file, which is named after the input, can be included in a program using the bare metal and font driver. The easiest way to use the bitmap font is to use the generated initialize function defined in the header file. 35 Figure 25: The ASCII table. Each number from 0 to 127 refers to a character. The numbers 0 to 31 cannot be printed. 36 Figure 26: The extended ASCII table. Each number from 128 to 255 refers to a character, mostly special characters not included in the basic table. 37 Figure 27: A font rendered by the software implementation of the ORGFX. Bézier curves are single colored while the triangles are interpolated between current color and black 7.2.3 Mesh maker utility The mesh maker utility loads 3D objects and generates a header file that can be used by the advanced 3D API. Currently the utility only supports Wavefront .obj files which only contains 3rd order polygons. Any higher order polygons will be discarded, so all polygons in the model must be converted to triangles prior to running the utility. The application supports loading texture coordinates for each vertex, allowing for textured meshes. The resulting output header file, which is named after the input, can be included in a program using the bare metal 3D API. The easiest way to use the mesh is to use the generated initialize function defined in the header file. 7.2.4 Vector font maker utility The Font maker is a application that can convert a .ttf file to a format that the graphics card can handle. The Font maker outputs a .h file that can be included in a project to enable the graphics accelerators vector font capabilities. A TTF font is a font format that stores a set of explicit points to describe an outline. The points connects to each other and form shapes. The converter finds all explicit vector points in a .ttf file and then calculates the implicit points. At the same time it checks where the glyphs contours end. The points are then sent to a Delaunay triangulation function – based on the work of V. Domiter and B. Zalik [4] and implemented by M. Green and T. Åhlén 27 . The generated .h file consists of two fields for each glyph, one field for Bézier writes and one for triangle writes. The generated header file will contain two lists for each glyph, one to store Bézier writes and one to store triangle writes. The algorithm is confirmed to work with a development font (see figure 27). The following assumptions are made: • The initial shape in the glyph is a filled shape. • Any shape that is defined outside of the previously filled shape is also a filled shape. • All shapes that collide with the previous filled shape are holes in that shape. This algorithm does not work with fonts that begin with a hole and then later add the filled shape. 8 Testing and validation This section will describe the testing and validation processes used in this project. Since the ORGFX core is a very complex system spanning both software and hardware, it is important that each subsystem is properly validated, both separately and in their interaction. 8.1 Algorithmic validation All of the rasterization and rendering logic was implemented as C-code to validate its function prior to Verilog implementation. Using Simple Direct media Layer (SDL28 ) as a graphical backend with a ”put pixel” interface, the speed at which prototypes could be developed was greatly increased. 27 http://code.google.com/p/poly2tri/ 28 http://www.libsdl.org 38 This validation step was performed to confirm that the chosen algorithms worked the way they were supposed to, and to identify possible problems with them. In addition, since the software implementation was designed to use the same API as the hardware implementation, applications using the accelerator can be developed and tested faster. 8.2 Hardware validation Once the algorithm itself was verified, it was implemented in hardware. This hardware had to – in turn – be verified, due to the increased complexity of parallel computations and issues introduced by timing. Icarus Verilog (iverilog) is a open source Verilog simulation tool that also can be used as a synthesis tool. This tool is used to build test benches, a test bench is a small script containing simulated input. The test bench is then compiled with the corresponding HDL code, this generates a dump file that can be viewed in a wave viewer. In this project we have used the open source tool GTKWave. The output from the test bench is analysed in GTKWave to see if we get the correct output for the given input. Each module have its own test bench based on iverilog. There is also a test bench simulating the pipeline as a whole system. This verifies that the modules are properly connected and interacts correctly. The test benches verify that the implementation is logically correct. They do not detect any timing errors that can occur when the code gets synthesized/mapped onto the device. It can be hard to verify some graphical operations with a test bench and while not being the perfect debug environment, it is a lot better then just doing a visual inspection of what shows up on the screen. 8.3 Software validation The software of the ORGFX component needs to be verified both separately and together with the hardware. Thankfully the interface between the two is very simple, and just consists of fixed width memory writes. The bare metal driver is verified by a script that runs a test application and checks that the output is correct for the given input. The software is also tested and verified by visual inspection of the output on the synthesized hardware. The script verifies that the API works as intended. 8.4 System validation The system test is based on iverilog and the bare metal drivers. A script binds the system testbench and bare metal drivers together and checks that the correct output is delivered according to the input. This test proves that the software and hardware are compatible and give the correct output. However, running the Verilog code through iverilog will not guarantee that the hardware actually works on the device. Additional considerations such as fitting, the availability of specialized hardware on the FPGA and routing delays can affect the performance and function of the ORGFX. 9 Results The ORSoC Graphics Accelerator is a FPGA core with 2D, 3D and vector drawing capabilities. The use of a graphics accelerator releases CPU time that can be put to better use than putting pixels on the screen. The current implementation is very generic and platform independent but still manages to run a demo of all its features smoothly on a 50 MHz OpenRisc processor. If some more time is spent on optimization for the specific platform the ORGFX will work even better. The best way to improve performance is to implement hardware with a dedicated graphics RAM. 9.1 Performance This project have aimed to build a generic graphics accelerator and the focus has been on implementing new features rather than optimizing the implementation for the current development platform. The limiting factor on the development board is how the accelerator accesses the RAM. There is no dedicated memory for the graphics accelerator and there is no texture cache implemented on the graphics accelerator. Ultimately, tests show that the main bottleneck was the bandwidth of the wishbone bus and memory access. The memory bandwidth has to be shared with the VGA core, and if the ORGFX uses too much memory bandwidth the VGA core is unable to handle it, making the output picture unstable. Performance of applications depend on how rendering is handled in software. One common technique used is to clear and redraw the entire scene each frame. This will expend a lot of bandwidth and a smooth framerate (above 25 frames per second) will not be possible. Getting a smooth framerate is not as problematic if only parts of the screen are redrawn (the parts that change). Scrolling scenes can be implemented by moving the VGA read pointer instead of redrawing the entire screen. It is more difficult to achieve smooth framerate for 3D rendering than for 2D, since moving the camera usually forces a complete redraw of the scene. 39 9.2 Benchmarking The ORGFX core take up 10000 slice LUTs (calculated using Xilinx ISE 13.4). The longest timing path of the core on the Atlys board is 16.076 ns, allowing for a core speed of 62.205 MHz. The current implementation is able to display a smooth rendering of a rotating 3D mesh with 90 faces. The ORGFX can display roughly 5.1 million pixels per second (simple pixel-by-pixel rectangle rendering). This is compared to roughly 0.5 million pixels per second rendered by the 50 MHz CPU (also simple pixelby-pixel rectangle rendering). This is a 10 times increase in performance. It should also be noted that the CPU is free to perform other operations during the hardware rendering. More complex operations should yield an even greater improvement in performance, since the ORGFX pipeline has specialized hardware for transformations, coloring and texturing. 10 Future work The design and implementation of the ORSoC graphics accelerator presented in this thesis is just a proof of concept, and many things could be worked on to improve both the function and performance of the device. This section lists a number of areas that should have future work dedicated to them. 10.1 Textures To make it possible to interpolate from one image to another, more texture banks needs to be added. Currently only one (Tex0) is implemented. If several image sources are available, this also opens up the possibility to add new interesting features to the device such as bump mapping, normal mapping or decals. Of course, more textures on the same surface means more memory reads per pixel, which leads to the next point of improvement. 10.2 Bandwidth issues The current implementation suffers from bandwidth limitations and unoptimized use of bandwidth. The same pixel in a texture may be read multiple times, causing a large overhead in the communication with the video memory. There are two relatively simple ways to improve performance here: by implementing an internal texture cache for each of the textures or buffers, several clock cycles per pixel operation could be gained. Another way to reduce the problems introduced by the limited bandwidth is to optimize the wishbone access by using block reads and writes, described in the revB.4 Wishbone bus specification. 10.3 8/24/32 bpp Another desired feature is to have proper support for 8-, 24- and 32-bit color depth modes. The current implementation only has full support for 16-bit color depth mode. This feature is closely entangled with how the display controller is implemented, since the ORGFX device has to write pixels to memory in the same format that the display controller reads. 10.4 Alpha from memory The current implementation supports setting the transparency of a drawing primitive either globally or through interpolation. Colorkeying does implement a form of per-pixel transparency loaded from memory, but it would be desirable to have full alpha support for each pixel. This would of course further increase bandwidth usage. 10.5 Precision issues The choice to use fixed point arithmetic in ORGFX was based on fast development time and low logic complexity (which in turn translates to less logic usage on the FPGA). It does introduce two problems however: 1. The device has trouble processing extremely large or extremely small numbers. 2. There is an inevitable loss of precision due to the calculations. In some cases this may be visible to the user in the form of jagged textures or triangle edges not matching perfectly. While the issues could be reduced by increasing the bits used for the fixed point arithmetic, that would in turn lead to greater bandwidth usage. The most desirable solution would be a full floating point unit (FPU) to process the calculations, but that could be extremely costly in terms of FPGA logic usage and adds an entire level of complexity. 40 10.6 Platform specific optimizations The current implementation suffers from performance issues, some of which could possibly be overcome by adding optimizations specific to a particular development board or FPGA circuit. While this may gain some speed or reduce the size of the IP Core, it would reduce the number of platforms that the device can be implemented on. The ORGFX implementation was specifically designed to be as generic as possible so it can be loaded to any FPGA device. It is even possible to change the display controller and the master CPU without any changes to the ORGFX component. There is only one non-generic part of the current design: the wishbone bus interface. 10.7 Other bus implementations The ORGFX graphics accelerator would benefit from support of common FPGA data buses like Altera’s Avalon bus used for the NiosII soft core processor or CoreConnect PLB bus that is used with Xilinx soft core processor Microblaze. Expanding the number of available communication interfaces has two advantages: 1. It makes it possible to integrate the component in older SoC designs with minimal effort. 2. It makes it possible to use the SoC design tools provided by the larger FPGA vendors (Altera has SOPC Builder/QSys for example). This can greatly increase the speed of designing larger systems. 10.8 Linux driver The possibility of implementing a Linux driver was studied during the research phase of this thesis. It was concluded that it would be most convenient to implement a DirectFB driver or use the bare metal drivers and write to the hardware through memory mapping. This is an easy way to add Linux support for the graphics card, but it requires that programs have their graphics API ported to the DirectFB/ORGFX API to gain graphics acceleration. Due to the complexity of the task and the limited time a Linux driver where never implemented. A DirectFB and/or DRI/DRM driver might be included in future releases. 11 Conclusions The ORSoC graphics accelerator is a fully functional 2D and 3D graphics accelerator for embedded systems, with additional support for hardware accelerated vector graphics. While the device uses technology a few years behind current high end graphics accelerators, it is one of the few truly open alternatives, since all hardware, software and documentation is available under LGPL. The aim to make the implementation as generic and platform independent as possible have led to some concessions on performance, but the modular design allows for a lot of expansions. A future implementation of ORGFX optimized against a target platform and configured with multiple pipelines and texture cache would lead to large improvements in performance. Code written for the ORGFX API can with the help of the software implementation be verified without access to the graphics hardware. This allows interested peers to become developers for the ORGFX without having to buy expensive hardware. By using the provided utilities the developers can quickly integrate media into their ORGFX applications. The ORGFX as it is can be used for static or low framerate graphics applications on embedded systems, such as HMI interfaces. The authors of this thesis hope that ORGFX can be used as a base platform to build additional functionality for open hardware graphics, and that future performance optimizations can make the platform viable for high framerate graphics on embedded platforms. References [1] S. Bailey. Comparison of vhdl, verilog and systemverilog. 2003. [2] J. E. Bresenham. Algorithm for computer control of a digital plotter. IBM Systems Journal, 4(1):25 –30, 1965. [3] S.-H. Chen, H.-M. Lin, C.-C. Hsieh, C.-T. Huang, J.-J. Liou, and Y.-C. Chung. Turbovg: a hw/sw codesigned multi-core openvg accelerator for vector graphics applications with embedded power profiler. In Proceedings of the 16th Asia and South Pacific Design Automation Conference, ASPDAC ’11, pages 97–98, Piscataway, NJ, USA, 2011. IEEE Press. [4] V. Domiter and B. Zalik. Sweep-line algorithm for constrained delaunay triangulation. International Journal of Geographical Information Science, 22(4):449–462, 2008. [5] C. Loop and J. Blinn. Resolution independent curve rendering using programmable graphics hardware. ACM Trans. Graph., 24:1000–1009, July 2005. [6] K. Mcallister. Triangle rasterization, 2007. 41 [7] H. Nguyen. Gpu gems 3. Addison-Wesley Professional, first edition, 2007. [8] J. G. Rokne, B. Wyvill, and X. Wu. Fast line scan-conversion. ACM Trans. Graph., 9(4):376–388, Oct. 1990. [9] A. R. Smith. Alpha and the history of digital compositing. In Microsoft Technical Memo 7, 1995. [10] W. Zhang and I. Majdandzic. Fast triangle rasterization using irregular z-buffer on cuda. 2010. 42 A Appendix A, ORGFX Specification 43 ORSoC Graphics accelerator Specification Per Lenander, Anton Fosselius August 20, 2012 1 Revision history Rev. 1.0 2.0 3.0 Date 23/3/2012 4/6/2012 20/8/2012 Author Per Lenander Per Lenander Anton Fosselius Description Initial draft and basic functionality Advanced functionality (vector, 3D etc) Fixed typos 2 Contents 1 Introduction 1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 IP Core directory structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Architecture 2.1 Overview . . . . . . . . . . . . . 2.2 Concepts . . . . . . . . . . . . . 2.3 Coordinate precision . . . . . . . 2.4 Instruction FIFO . . . . . . . . . 2.5 Pipeline . . . . . . . . . . . . . . 2.6 Description of core modules . . . 2.6.1 Wishbone slave . . . . . . 2.6.2 Transformation processor 2.6.3 Rasterizer . . . . . . . . . 2.6.4 Clipper . . . . . . . . . . 2.6.5 Fragment processor . . . . 2.6.6 Blender . . . . . . . . . . 2.6.7 Wishbone arbiter . . . . . 2.6.8 Wishbone master read . . 2.6.9 Renderer . . . . . . . . . 2.6.10 Wishbone master write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 IO Ports 6 6 6 6 6 8 8 9 9 10 10 10 10 10 10 10 10 10 10 10 11 4 Registers 4.1 Control Register (CONTROL) . . . . . . . . . 4.2 Status Register (STATUS) . . . . . . . . . . . . 4.3 Alpha (ALPHA) . . . . . . . . . . . . . . . . . 4.4 Colorkey register (COLORKEY) . . . . . . . . 4.5 Target base address Register (TARGET BASE) 4.6 Target size width Register (TARGET SIZE X) 4.7 Target size y Register (TARGET SIZE Y) . . . 4.8 Texture 0 Base Register (TEX0 BASE) . . . . 4.9 Texture 0 size x Register (TEX0 SIZE X) . . . 4.10 Texture 0 size y Register (TEX0 SIZE Y) . . . 4.11 Source Pixel position 0 x Register (SRC P0 X) 4.12 Source Pixel position 0 y Register (SRC P0 Y) 4.13 Source Pixel position 1 Register (SRC P1 X) . 4.14 Source Pixel position 1 Register (SRC P1 Y) . 4.15 Destination Pixel position Register (DEST X) . 4.16 Destination Pixel position Register (DEST Y) . 4.17 Destination Pixel position Register (DEST Z) . 4.18 Matrix coefficient registers . . . . . . . . . . . . 4.19 Clip Pixel position 0 x Register (CLIP P0 X) . 4.20 Clip Pixel position 0 y Register (CLIP P0 Y) . 4.21 Clip Pixel position 1 x Register (CLIP P1 X) . 4.22 Clip Pixel position 1 y Register (CLIP P1 Y) . 4.23 Color Registers (COLOR0-2) . . . . . . . . . . 4.24 Texture coordinate Registers (U0-2 and V0-2) . 4.25 Depth buffer Register (ZBUFFER BASE) . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 13 13 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 16 16 16 16 16 17 17 5 Operation 5.1 Draw pixel 5.2 Fill rect . . 5.3 Line . . . . 5.4 Triangle . . 5.5 Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Clocks 7 Driver interface 7.1 newlib . . . . . . . . . . . . . . . . . . 7.1.1 orgfx init . . . . . . . . . . . . 7.1.2 orgfx vga set videomode . . . . 7.1.3 orgfx vga set vbara . . . . . . . 7.1.4 orgfx vga set vbarb . . . . . . 7.1.5 orgfx vga bank switch . . . . . 7.1.6 orgfx init surface . . . . . . . . 7.1.7 orgfx bind rendertarget . . . . 7.1.8 orgfx enable cliprect . . . . . . 7.1.9 orgfx cliprect . . . . . . . . . . 7.1.10 orgfx srcrect . . . . . . . . . . 7.1.11 orgfx set pixel . . . . . . . . . 7.1.12 orgfx memcpy . . . . . . . . . . 7.1.13 orgfx set color . . . . . . . . . 7.1.14 orgfx set colors . . . . . . . . . 7.1.15 orgfx rect . . . . . . . . . . . . 7.1.16 orgfx line . . . . . . . . . . . . 7.1.17 orgfx line3d . . . . . . . . . . . 7.1.18 orgfx triangle . . . . . . . . . . 7.1.19 orgfx triangle3d . . . . . . . . . 7.1.20 orgfx curve . . . . . . . . . . . 7.1.21 orgfx uv . . . . . . . . . . . . . 7.1.22 orgfx enable tex0 . . . . . . . . 7.1.23 orgfx bind tex0 . . . . . . . . . 7.1.24 orgfx enable zbuffer . . . . . . 7.1.25 orgfx bind zbuffer . . . . . . . 7.1.26 orgfx clear zbuffer . . . . . . . 7.1.27 orgfx enable alpha . . . . . . . 7.1.28 orgfx set alpha . . . . . . . . . 7.1.29 orgfx enable colorkey . . . . . . 7.1.30 orgfx set colorkey . . . . . . . . 7.1.31 orgfx enable transform . . . . . 7.1.32 orgfx set transformation matrix 7.2 Extended newlib . . . . . . . . . . . . 7.2.1 orgfxplus init . . . . . . . . . . 7.2.2 orgfxplus init surface . . . . . . 7.2.3 orgfxplus bind rendertarget . . 7.2.4 orgfxplus bind tex0 . . . . . . . 7.2.5 orgfxplus flip . . . . . . . . . . 7.2.6 orgfxplus clip . . . . . . . . . . 7.2.7 orgfxplus fill . . . . . . . . . . 7.2.8 orgfxplus line . . . . . . . . . . 7.2.9 orgfxplus triangle . . . . . . . . 7.2.10 orgfxplus curve . . . . . . . . . 7.2.11 orgfxplus draw surface . . . . . 7.2.12 orgfxplus draw surface section 17 17 17 17 17 18 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 18 18 18 19 19 19 19 19 19 19 20 20 20 20 20 20 20 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 7.3 7.4 7.5 7.6 7.7 7.8 7.2.13 orgfxplus colorkey . . . . 7.2.14 orgfxplus alpha . . . . . . Bitmap Fonts . . . . . . . . . . . 7.3.1 orgfx make bitmap font . 7.3.2 orgfx put text . . . . . . . Vector Fonts . . . . . . . . . . . 7.4.1 orgfx make vector font . . 7.4.2 orgfx init vector font . . . 7.4.3 orgfx put vector char . . 7.4.4 orgfx put vector text . . . 3D API . . . . . . . . . . . . . . 7.5.1 Transformations . . . . . 7.5.2 orgfx3d make mesh . . . . 7.5.3 orgfx3d mesh texture size 7.5.4 orgfx3d draw mesh . . . . Linux . . . . . . . . . . . . . . . Software emulation . . . . . . . . Utilities . . . . . . . . . . . . . . 7.8.1 Sprite maker utility . . . 7.8.2 Bitmap font maker utility 7.8.3 Mesh maker utility . . . . 7.8.4 Vector font maker utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Programming examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 27 27 30 30 30 5 1 Introduction The ORSoC Graphics accelerator allows the user to do advanced vector rendering and 2D blitting to a memory area. The core supports operations such as drawing textures, lines, curves and filling rectangular and triangular areas with color. This IP Core is designed to integrate with the OpenRISC processor through a Wishbone bus interface. The core itself has no means of displaying the information rendered, for this purpose it can work alongside a display component, such as the enhanced VGA/LCD IP core found on OpenCores. 1.1 Features • 32-bit Wishbone bus interface • Integrates with enhanced VGA/LCD IP core • Support for 16 bit color depth • Support for variable resolution • Acceleration of line operations • Acceleration of rectangle and triangle rasterization • Acceleration of memory copy operations • Textures can be saved to video memory • Vector transformation and rasterization • Clipping/Scissoring • Alpha blending and colorkeying • Filled Bezier curves • Bitmap Fonts • Vector Fonts (ttf) • Interpolation of colors • UV-Mapping • Transformation (scaling and rotation) • 3D model support (.obj model files built using 3rd degree polygons) • Z-Buffer (triangles drawn in depth order) • Requires around 10000 Slice LUTs (Xilinx ISE 13.4) 1.2 IP Core directory structure An overview of the contents of the IP core source folder can be found in figure 1. 2 2.1 Architecture Overview A topology of how the ORGFX is connected to the VGA driver and the OpenRISC core is shown in figure 2. The ORGFX has three wishbone interfaces: one read/write port that is used to communicate with the host CPU. One read port that reads depth/texture/alpha blending information from the RAM and one write port to write pixel information to the RAM. 6 Figure 1: Directory structure of the ORSoC graphics accelerator. Figure 2: Overview of the ORPSoCv2’s wishbone interconnection. 7 Figure 3: 1. Texture, 2. Source, 3. Render target, 4. Clip, 5. Destination 2.2 Concepts This section describes a few basic terms used in this document. Video memory – The ORGFX component writes pixels one by one to an external memory, usually an SDRAM or DDR RAM chip. The CPU should also have access to this memory space to be able to write pixels directly (the easiest way to load textures). Render target – The render target, defined by the target base and size registers, describes the area to which all operations render pixels. It is possible to change the base address and size, enabling render-to-texture and double buffering. Surface/Texture – Any memory area that can be rendered to, including the render target, is considered a surface. A surface is defined by its base address and size. There are two main surfaces that the ORGFX device handles: the render target and the currently active texture. Swapping between different textures has to be done in software. The operation of setting the current render target or texture is referred to as binding. Source, Destination and Clip rectangles – There are three sets of rectangles that affect rendering, each described by two points. The first point sets the beginning of the rectangle, while the second point sets the pixel after the end of the rectangle. This way, a rectangle exactly filling the screen would be (0,0,640,480) at 640x480 resolution. See figure 3 Source rectangle – The source rectangle defines what pixels should be read from a texture during textured operations. The points are defined in the coordinates of the currently bound texture. This way sections of a texture can be drawn (good for tile maps or bitmap fonts). Destination rectangle – The destination rectangle defines where operations such as draw pixel and draw line will draw pixels, in the coordinates of the render target. Clip rectangle – The clip rectangle defines an area within the current render target which is valid to draw to. Any pixels outside this rectangle are discarded in the rasterization step. Pixels outside of the render target are automatically discarded. Z-buffer – The depth or Z-buffer is a surface containing z coordinate information. This can be used to draw graphics primitives in depth-correct order. 2.3 Coordinate precision The ORGFX core supports variable coordinate precision through two parameters, point width and subpixel width. Both parameters defaults to 16 bits width. Target size, clip and source rects are defined as point width bit integers. Destination points are defined as fixed point numbers, with point width bit integer precision and subpixel width fractional precision. Internally many calculations are done with fixed point logic. 8 Figure 4: Picture of the ORGFX pipeline 2.4 Instruction FIFO All wishbone writes sent to the slave interface will pass through an instruction fifo. If the device is in the busy state (when the pipeline is active) the instruction will be queued instead of being set immediately. This is important to take into account when reading from registers, since an operation that changes the register being read might be queued. To find out if the device is busy, poll the status register and check if the busy bit is high. 2.5 Pipeline The ORGFX core uses a pipelined architecture to speed up operation. An overview of the pipeline can be seen in figure 4. Each module in the pipeline communicates with acknowledge and write signals. A module will not assert write to the next module unless it receives an ack first (or if the module was previously in a ready state, in which case the downstream pipeline is empty). All ack and write signals are always exactly one clock tick long, to prevent triggering multiple instances of the same instruction. Each module in the pipeline may hold the upstream pipeline for several clock ticks. For example, the rasterizer will prevent incoming raster instructions until all the pixels for the current operation are generated. When the rasterizer is ready for new data, it will send an ack upstream. 9 2.6 Description of core modules 2.6.1 Wishbone slave The wishbone slave handles all communication from the main OpenRISC processor (or other master CPU). This component holds all the registers, and the instruction FIFO that sets them. This component can be in one of two states: busy or wait. It enters the busy state when a pipeline operation is initialized, and returns to the wait state when the operation is finished. Operations can be initialized by writing to the control register (see section 4). 2.6.2 Transformation processor The transformation processor handles rotations and scaling. 2.6.3 Rasterizer The rasterizer generates pixel coordinates from points for several different operations. 2.6.4 Clipper Discard generated pixel if clipping is enabled and pixel is out of bounds. Always discard pixels outside of the target area. 2.6.5 Fragment processor The fragment processor adds color to the pixel generated by the rasterizer. If texturing is disabled a color supplied from the color register is used. If texturing is enabled on the other hand, the u v coordinates supplied by the rasterizer are used to fetch a pixel from the active texture. If colorkeying is enabled and the fetched color matches the color key, the current pixel is discarded. 2.6.6 Blender The blender module performs alpha blending if this is enabled. The module fetches the color of the pixel that the current operation will write to, and mixes the value of the target color and the color from the fragment processor using the following algorithm: alpha = alphaglobal ∗ alphapixel colorout = colorin ∗ alpha + colortarget ∗ (1 − alpha) where alpha is a value between 0 (transparent) and 1 (opaque). If alpha blending is disabled the pixel is passed on unmodified. The alpha value can be interpolated over a triangle to create gradients. If this function is turned off (interpolation is disabled on triangle draws) then alphapixel is set to 1. 2.6.7 Wishbone arbiter Since two parts of the pipeline (fragment and blender) needs to access video memory, the arbiter makes certain only one of them can access the reader at once. The blender has the highest priority. 2.6.8 Wishbone master read The wishbone reader handles all reads from video memory. 2.6.9 Renderer The renderer calculates the memory address of the target pixel. 2.6.10 Wishbone master write The wishbone master handles all writes to the video memory. 10 3 IO Ports The Core has three wishbone interfaces: • Wishbone slave – connects to the data bus of the OpenRISC processor. In the case of ORPSoC, this bus is connected through an arbiter. Supports standard wishbone communications, not any burst modes. • Wishbone master read-only – connects to a video memory port with read access. Used for fetching textures and during blending. • Wishbone master write-only – connects to a video memory port with write access. Used for rendering pixels to the framebuffer. There is an interrupt enabled that can be connected to the interrupt pins on the or1200 CPU (in the supplied orpsoc top it is connected to or1200 pic ints[9]). For this interrupt to trigger, the correct bits in the control register has to be set. 11 4 Registers Name CONTROL STATUS ALPHA COLORKEY TARGET BASE TARGET SIZE X TARGET SIZE Y TEX0 BASE TEX0 SIZE X TEX0 SIZE Y SRC P0 X SRC P0 Y SRC P1 X SRC P1 Y DEST X DEST Y DEST Z AA AB AC TX BA BB BC TY CA CB CC TZ CLIP P0 X CLIP P0 Y CLIP P1 X CLIP P1 Y COLOR0 COLOR1 COLOR2 U0 V0 U1 V1 U2 V2 ZBUFFER BASE Addr 0x00 0x04 0x08 0x0c 0x10 0x14 0x18 0x1c 0x20 0x24 0x28 0x2c 0x30 0x34 0x38 0x3c 0x40 0x44 0x48 0x4c 0x50 0x54 0x58 0x5c 0x60 0x64 0x68 0x6c 0x70 0x74 0x78 0x7c 0x80 0x84 0x88 0x8c 0x90 0x94 0x98 0x9c 0xa0 0xa4 0xa8 Width 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 Access RW R RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW RW Description Control register Status register Global alpha register Colorkey register Render target base Render target width Render target height Texture 0 base Texture 0 width Texture 0 height Source pixel 0 x Source pixel 0 y Source pixel 1 x Source pixel 1 y Destination pixel x Destination pixel y Destination pixel z Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Transformation matrix coefficient Clip pixel 0 x Clip pixel 0 y Clip pixel 1 x Clip pixel 0 y Color 0 Color 1 Color 2 Texture coordinate 0 Texture coordinate 0 Texture coordinate 1 Texture coordinate 1 Texture coordinate 2 Texture coordinate 2 Depth buffer base address Each register is described in detail in the following sections, with information about what the purpose of each bit in the register is. The default value provided for each register is set when the device receives a reset signal. 12 4.1 Control Register (CONTROL) Bit # [31:20] [19] [18] [17:16] [15:14] [13] [12] [11] [10] [9] [8] [7] [6] [5] [4] [3] [2] [1:0] Access W W RW W W W W W W RW RW RW RW RW RW Description Reserved Transform point Forward point Active point Reserved Bézier inside shape Interpolation Curve write Triangle write Line write Rect write Reserved Z-buffer enable Clipping enable Colorkey enable Blending enable Texture0 enable Color depth Default value: 0x00 Color depth is defined as follows: Mode Color depth 00 8 bit 01 16 bit 10 24 bit (not supported) 11 32 bit The active point is defined as follows: Mode Point id 00 p0 01 p1 10 p2 11 The operations Forward point and Transform point reads the current values of the active point and stores the x, y, z values in the correct register inside the device. 4.2 Status Register (STATUS) Bit # [31:16] [15:1] [0] Access R R R Description Current FIFO size Reserved Busy pin (high when busy) Default value: – 4.3 Alpha (ALPHA) Bit # [31:24] [23:16] [15:8] [7:0] Access RW RW RW RW Description Point 0 alpha Point 1 alpha Point 2 alpha Global alpha Default value: 0xffffffff 13 The global alpha value is used in all rendering when alpha blending is enabled. 0xff is full opacity, while 0x00 is full transparency (nothing rendered). When interpolation of triangles is activated, the point alpha values are used to find an interpolated alpha value for each pixel. This value is then multiplied with the global alpha before being used for blending. 4.4 Colorkey register (COLORKEY) Bit # [31:0] Access RW Description Colorkey Default value: 0x00 By setting a colorkey certain pixels in a texture can be discarded in the fragment stage, providing a hard transparency. Depending on the color depth, a mask is applied to the color. Using 8 bit color, only the 8 least significant bits in the colorkey will be compared with the texture color during the check. The colorkey enable bit in the control register must be set to enable this functionality. 4.5 Target base address Register (TARGET BASE) Bit # [31:2] [1:0] Access RW - Description Video Memory Address Nothing Default value: 0x00 4.6 Target size width Register (TARGET SIZE X) Bit # [31:0] Access RW Description Integer Width Default value: 0x00 4.7 Target size y Register (TARGET SIZE Y) Bit # [31:0] Access RW Description Integer Height Default value: 0x00 4.8 Texture 0 Base Register (TEX0 BASE) Bit # [31:2] [1:0] Access RW - Description Video Memory Address Nothing Default value: 0x00 4.9 Texture 0 size x Register (TEX0 SIZE X) Bit # [31:0] Access RW Description Integer Width Default value: 0x00 4.10 Bit # [31:0] Texture 0 size y Register (TEX0 SIZE Y) Access RW Description Integer Height Default value: 0x00 14 4.11 Bit # [31:0] Source Pixel position 0 x Register (SRC P0 X) Access RW Description Integer x pos Default value: 0x00 The source pixels are used to define a specific area in a texture to draw. 4.12 Bit # [31:0] Source Pixel position 0 y Register (SRC P0 Y) Access RW Description Integer y pos Default value: 0x00 4.13 Bit # [31:0] Source Pixel position 1 Register (SRC P1 X) Access RW Description Integer x pos Default value: 0x00 4.14 Bit # [31:0] Source Pixel position 1 Register (SRC P1 Y) Access RW Description Integer y pos Default value: 0x00 4.15 Bit # [31:16] [15:0] Destination Pixel position Register (DEST X) Access RW RW Description Signed Integer part Fractional part Default value: 0x00 The control register flag active point decides the destination register inside the device. Points are pushed to the device by setting the forward or transform bit in the control register. 4.16 Bit # [31:16] [15:0] Destination Pixel position Register (DEST Y) Access RW RW Description Signed Integer part Fractional part Default value: 0x00 4.17 Bit # [31:16] [15:0] Destination Pixel position Register (DEST Z) Access RW RW Description Signed Integer part Fractional part Default value: 0x00 4.18 Matrix coefficient registers The matrix coefficients are defined in the following way: 15 AA AB AC T X M = BA BB BC T Y CA CB CC T Z Each coefficient has a register, where the bits are defined as: Bit # Access Description [31:16] RW Signed Integer part [15:0] RW Fractional part The defaultmatrix is set to no scaling, no rotation, no translation: 1 0 0 0 Mdef ault = 0 1 0 0 0 0 1 0 4.19 Bit # [31:0] Clip Pixel position 0 x Register (CLIP P0 X) Access RW Description Integer x Default value: 0x00 4.20 Bit # [31:0] Clip Pixel position 0 y Register (CLIP P0 Y) Access RW Description Integer y Default value: 0x00 4.21 Bit # [31:0] Clip Pixel position 1 x Register (CLIP P1 X) Access RW Description Integer x Default value: 0x00 4.22 Bit # [31:0] Clip Pixel position 1 y Register (CLIP P1 Y) Access RW Description Integer y Default value: 0x00 4.23 Bit # [31:0] Color Registers (COLOR0-2) Access RW Description Color bits Default value: 0x00 There are several color modes available (set in control register ): Mode Format 32bpp [31:24] is alpha channel. [23:16] is R, [15:8] is G and [7:0] is B 16bpp [15:11] is R, [10:5] is B and [4:0] is G 8bpp gray [7:0] sets both R, G and B values 8bpp palette [7:0] sets the color index in the palette Currently only 16 bit color depth is fully supported. 16 4.24 Texture coordinate Registers (U0-2 and V0-2) Bit # [31:0] Access RW Description Coordinate bits (integer) Default value: 0x00 4.25 Depth buffer Register (ZBUFFER BASE) Bit # [31:2] [1:0] Access RW - Description 32-bit word base address Ignored Default value: 0x00 This register holds the base address of the depth buffer. The depth buffer operations uses TARGET SIZE X and TARGET SIZE Y for the size of the depth buffer (it is assumed that the render target and the depth buffer are of the same size). 5 Operation All hardware accelerated operations draw pixels to the currently active surface (defined by TADR REG and TSZE REG). These operations are all affected by clip p0 and clip p1. No pixels that fall outside the clipping rectangle will be rendered. 5.1 Draw pixel Input needed: dest p0, color0 ORGFX have no hardware-support for writing a single pixel to the video memory. However it is possible to draw a line, rect or curve with the size of one pixel. The software API makes it possible to draw a pixel by writing directly to the memory (this is the most optimal way). Since the video memory can point to both the framebuffer and to textures, the same operation can be used to draw an arbitrary pixel to the screen and to load a texture into video memory. 5.2 Fill rect Input needed: ctrl, dest p0, dest p1, color0, [src p0, src p1] Fill rect will fill the area of a rectangle created between the pixel dest p0 and dest p1 with color. If texturing is enabled, color will be taken from the active texture in the area between src p0 and src p1. This operation is hardware accelerated, and is activated by setting the Rect write bit in the control register. 5.3 Line Input needed: ctrl, dest p0, dest p1, color0 Line will draw a line between the pixels dest p0 and dest p1 with color. This operation is hardware accelerated. 5.4 Triangle Input needed: ctrl, dest p0, dest p1, dest p2, color0, [color1, color2, u0, v0, u1, v1, u2, v2] Draw the pixels in the triangle created by dest p0, dest p1 and dest p2. The triangle can be colored with either a flat color, a gradient or a texture. Gradient or textured coloring require the interpolation pin to be set in the control register. 17 5.5 Curve Input needed: ctrl, dest p0, dest p1, dest p2, color0, [color1, color2, u0, v0, u1, v1, u2, v2] Draws a filled quadratic Bézier curve with dest p0 as start, dest p1 as control point and dest p2 as end. For this operation to work, the interpolation pin must be set in the control register. 6 Clocks The entire component uses the same clock domain. 7 Driver interface The ORSoC graphics accelerator offers three different APIs to code against, two for bare metal when coding directly against the processor, and a Linux kernel module. The extended bare metal interface is a wrapper around the basic bare metal API, and makes coding easier by reducing the number of calls. The drawback is lesser control over the graphics card. 7.1 newlib The basic library is provided in orgfx.h and orgfx.c. The bare metal library declares a structure that can hold surfaces (both framebuffers and textures). Many functions take a pointer to one of these structures. struct orgfx surface { u n s i g n e d i n t addr ; unsigned i n t w; unsigned i n t h ; }; 7.1.1 orgfx init Description: The orgfx init must be called first to get other oc gfx commands to work properly. v o i d o r g f x i n i t ( u n s i g n e d i n t memoryArea ) ; 7.1.2 orgfx vga set videomode Description: Sets the video mode, width, height, bpp. v o i d o r g f x s e t v i d e o m o d e ( u n s i g n e d i n t width , unsigned i n t height , u n s i g n e d c h a r bpp ) ; 7.1.3 orgfx vga set vbara Description: Assign a memory address to ”Video Base Address Register A”. v o i d o r g f x v g a s e t v b a r a ( u n s i g n e d i n t addr ) ; 7.1.4 orgfx vga set vbarb Description: Assign a memory address to ”Video Base Address Register B”. v o i d o r g f x v g a s e t v b a r b ( u n s i g n e d i n t addr ) ; 18 7.1.5 orgfx vga bank switch Description: Switches the framebuffer. void orgfx vga bank switch ( ) ; 7.1.6 orgfx init surface Description: Initialize a surface and return a control structure for it. This function increments an internal video memory stack pointer, so each surface will be allocated after the previous one in memory (starting at memoryArea set by orgfx init). There is currently no memory management in place to recycle surface memory once it is no longer in use. The first surface initialized will point to the same memory that the video controller reads from, so it should be initialized with the width and height of the screen. struct orgfx surface o r g f x i n i t s u r f a c e ( u n s i g n e d i n t width , unsigned i n t height ) ; 7.1.7 orgfx bind rendertarget Description: Binds a surface as the active render target. This function must be called before any drawing operations can be performed. void o r g f x b i n d r e n d e r t a r g e t ( s t r u c t o r g f x s u r f a c e ∗ s u r f a c e ) ; 7.1.8 orgfx enable cliprect Description: Enables/disables clipping. i n l i n e void o r g f x e n a b l e c l i p r e c t ( unsigned i n t enable ) ; 7.1.9 orgfx cliprect Description: Sets the clipping rect. No pixels will be drawn outside of this rect (useful for restricting draws to a specific area of the render target). orgfx bind rendertarget will reset the clipping rect to the size of the surface. i n l i n e void o r g f x c l i p r e c t ( unsigned unsigned unsigned unsigned 7.1.10 int int int int x0 , y0 , x1 , y1 ) ; orgfx srcrect Description: Sets the source rectangle that will be used by texturing operations. This allows for only drawing a small part of a texture. orgfx bind tex0 will reset this to the size of the texture. i n l i n e void o r g f x s r c r e c t ( unsigned unsigned unsigned unsigned 7.1.11 int int int int x0 , y0 , x1 , y1 ) ; orgfx set pixel Description: Set a pixel on coordinate x,y to color. This is done in software by direct memory writes. This operation is not affected by the clipping rect! i n l i n e void o r g f x s e t p i x e l ( i n t x , int y , unsigned i n t c o l o r ) ; 19 7.1.12 orgfx memcpy Description: Copies memory from the processor to the video memory. Size is in 32-bit words. This function is intended to work with the output array of the sprite converter utility to load images into memory. Remember to bind a texture as the render target first! v o i d orgfx memcpy ( u n s i g n e d i n t mem [ ] , unsigned i n t s i z e ) ; 7.1.13 orgfx set color Description: Sets the current drawing color (for flat coloring). i n l i n e void o r g f x s e t c o l o r ( unsigned i n t c o l o r ) ; 7.1.14 orgfx set colors Description: Sets all the current drawing colors (for gradient coloring). i n l i n e void o r g f x s e t c o l o r s ( unsigned i n t color0 , unsigned i n t color1 , unsigned i n t c o l o r 2 ) ; 7.1.15 orgfx rect Description: Draws a rect from (x0,y0) to (x1,y1) and fills it with the current drawing color. If texturing is enabled, the current texture will be drawn instead. i n l i n e void o r g f x r e c t ( i n t int int int 7.1.16 x0 , y0 , x1 , y1 ) ; orgfx line Description: Draws a line from (x0,y0) to (x1,y1) with the current drawing color. If texturing is enabled, the first pixel of the current texture will be drawn instead. i n l i n e v o i d o r g f x l i n e ( i n t x0 , i n t y0 , i n t x1 , i n t y1 ) ; 7.1.17 orgfx line3d Description: Draws a line from (x0,y0,z0) to (x1,y1,z1) with the current drawing color. If texturing is enabled, the first pixel of the current texture will be drawn instead. i n l i n e v o i d o r g f x l i n e 3 d ( i n t x0 , i n t y0 , i n t z0 , i n t x1 , i n t y1 , i n t z1 ) ; 7.1.18 orgfx triangle Description: Draws a filled triangle of the space spanned by (x0,y0), (x1,y1) and (x2,y2). The order of the points is important, since triangles calculated to be counter clockwise will be discarded (backface culling). The interpolate flag indicates if flat coloring or interpolated coloring should be used. The interpolate flag must be enabled if interpolated alpha, texture coordinates or depth buffer culling is desired (flat coloring can be obtained by setting all three color registers to the same color). 20 i n l i n e v o i d o r g f x t r i a n g l e ( i n t x0 , i n t y0 , i n t x1 , i n t y1 , i n t x2 , i n t y2 , unsigned i n t i n t e r p o l a t e ) ; 7.1.19 orgfx triangle3d Description: This function works the same way as the triangle function, but the Z-values are set. i n l i n e v o i d o r g f x t r i a n g l e 3 d ( i n t x0 , i n t y0 , i n t z0 , i n t x1 , i n t y1 , i n t z1 , i n t x2 , i n t y2 , i n t z2 , unsigned i n t i n t e r p o l a t e ) ; 7.1.20 orgfx curve Description: Draws a Quadratic curve between the points (x0,y0) and (x2,y2) with the control points (x1,y1). The three points form a triangle. The inside flag determines if the inside or outside of the curve is filled inside the triangle. i n l i n e v o i d o r g f x c u r v e ( i n t x0 , i n t y0 , i n t x1 , i n t y1 , i n t x2 , i n t y2 , unsigned i n t i n s i d e ) ; 7.1.21 orgfx uv Description: Sets the three texture coordinates used in textured triangle renders. i n l i n e v o i d o r g f x u v ( u n s i g n e d i n t u0 , u n s i g n e d i n t v0 , u n s i g n e d i n t u1 , u n s i g n e d i n t v1 , u n s i g n e d i n t u2 , u n s i g n e d i n t v2 ) ; 7.1.22 orgfx enable tex0 Description: Enables or disables texturing. void o r g f x e n a b l e t e x 0 ( unsigned i n t enable ) ; 7.1.23 orgfx bind tex0 Description: Binds a surface as the current texture. Will reset the source rect. void o r g f x b i n d t e x 0 ( s t r u c t o r g f x s u r f a c e ∗ s u r f a c e ) ; 7.1.24 orgfx enable zbuffer Description: Enables or disables reads and writes to the depth buffer. Requires that a depth buffer is bound. void o r g f x e n a b l e z b u f f e r ( unsigned i n t enable ) ; 7.1.25 orgfx bind zbuffer Description: Binds the depth buffer. This surface should have the same resolution as the render target. void o r g f x b i n d z b u f f e r ( s t r u c t o r g f x s u r f a c e ∗ s u r f a c e ) ; 21 7.1.26 orgfx clear zbuffer Description: Clears the depth buffer. void o r g f x c l e a r z b u f f e r ( ) ; 7.1.27 orgfx enable alpha Description: Enables or disables alpha blending. void o r g f x e n a b l e a l p h a ( unsigned i n t enable ) ; 7.1.28 orgfx set alpha Description: Sets the alpha blending value. void o r g f x s e t a l p h a ( unsigned i n t alpha ) ; 7.1.29 orgfx enable colorkey Description: Enables or disables colorkey. void o r g f x e n a b l e c o l o r k e y ( unsigned i n t enable ) ; 7.1.30 orgfx set colorkey Description: Sets the colorkey color. void o r g f x s e t c o l o r k e y ( unsigned i n t colorkey ) ; 7.1.31 orgfx enable transform Description: Enables or disables hardware accelerated transformation of points. void o r g f x e n a b l e t r a n s f o r m ( unsigned i n t enable ) ; 7.1.32 orgfx set transformation matrix Description: Sets the 3 by 4 transformation matrix used in hardware. v o i d o r g f x s e t t r a n s f o r m a t i o n m a t r i x ( i n t aa , i n t ab , i n t ac , i n t tx , i n t ba , i n t bb , i n t bc , i n t ty , i n t ca , i n t cb , i n t cc , i n t t z ) ; 7.2 Extended newlib The extended library is provided in orgfx plus.h and orgfx plus.c, but orgfx.c also has to be compiled for it to work. Instead of using surface structs directly, the extended API hides surface management by returning id tags for each surface. The screen surface (defined by id -1) is handled as a single surface, even when double buffering is enabled. The driver defines the number of available surfaces (not counting the screen) with a static define. Change this if the default value is too low for your application. There are no 3D functions in this API. For the more advanced 3D functionality (meshes, depth buffering etc.), see the advanced API. 22 7.2.1 orgfxplus init Description: Initializes the screen with the supplied video mode and returns an id for the screen. The only supported bpp is 16. Double buffering and depth buffering can be enabled (and the appropriate buffers will be allocated). The depth buffer is allocated with the same size as the screen. There is no support in the driver to allocate more than one depth buffer. i n t o r g f x p l u s i n i t ( unsigned unsigned unsigned unsigned unsigned 7.2.2 i n t width , i n t height , c h a r bpp , char doubleBuffering , char z b u f f e r ) ; orgfxplus init surface Description: Unlike the basic API, this function both initializes a surface and loads a prepared image to it in one function call. The return value is an id that can be used to bind the surface. It changes render target during operation, but switches back to the last render target on completion. Since the screen(s) are already initialized by a call to init, they do not need to be loaded using this function. i n t o r g f x p l u s i n i t s u r f a c e ( u n s i g n e d i n t width , unsigned i n t height , u n s i g n e d i n t mem [ ] ) ; 7.2.3 orgfxplus bind rendertarget Description: Binds a surface as the current render target. void o r g f x p l u s b i n d r e n d e r t a r g e t ( i n t s u r f a c e ) ; 7.2.4 orgfxplus bind tex0 Description: Binds a surface as the current active texture. void o r g f x p l u s b i n d t e x 0 ( i n t s u r f a c e ) ; 7.2.5 orgfxplus flip Description: Swaps which buffer to draw on when using double buffering. Needs to be called once before anything shows up on screen! void o r g f x p l u s f l i p ( ) ; 7.2.6 orgfxplus clip Description: Sets the current clipping rect. This is reset to the size of the new render target when orgfxplus bind rendertarget is called. i n l i n e void o r g f x p l u s c l i p ( unsigned unsigned unsigned unsigned unsigned int int int int int 23 x0 , y0 , x1 , y1 , enable ) ; 7.2.7 orgfxplus fill Description: Draws a rectangle to the current render target with a flat color. v o i d o r g f x p l u s f i l l ( i n t x0 , i n t y0 , i n t x1 , i n t y1 , unsigned i n t c o l o r ) ; 7.2.8 orgfxplus line Description: Draws a line from (x0,y0) to (x1,y1) to the current render target with a flat color. v o i d o r g f x p l u s l i n e ( i n t x0 , i n t y0 , i n t x1 , i n t y1 , unsigned i n t c o l o r ) ; 7.2.9 orgfxplus triangle Description: Draws a triangle between the points (x0,y0),(x1,y1) and (x2,y2) and fills it with a color. v o i d o r g f x p l u s t r i a n g l e ( i n t x0 , i n t y0 , i n t x1 , i n t y1 , i n t x2 , i n t y2 , unsigned i n t c o l o r ) ; 7.2.10 orgfxplus curve Description: Draws a quadratic Bézier curve from (x0,y0) to (x2,y2) with the control point (x1,y1). Uses flat coloring. v o i d o r g f x p l u s c u r v e ( i n t x0 , i n t y0 , i n t x1 , i n t y1 , i n t x2 , i n t y2 , unsigned i n t inside , unsigned i n t c o l o r ) ; 7.2.11 orgfxplus draw surface Description: Draws a texture to the current render target. v o i d o r g f x p l u s d r a w s u r f a c e ( i n t x0 , i n t y0 , unsigned i n t s u r f a c e ) ; 7.2.12 orgfxplus draw surface section Description: Draws a section of a texture defined by src0, src1 to the current render target. v o i d o r g f x p l u s d r a w s u r f a c e s e c t i o n ( i n t x0 , i n t y0 , unsigned i n t srcx0 , unsigned i n t srcy0 , unsigned i n t srcx1 , unsigned i n t srcy1 , unsigned i n t s u r f a c e ) ; 24 7.2.13 orgfxplus colorkey Description: Sets the colorkey color and enables or disables the use of the colorkey. void o r g f x p l u s c o l o r k e y ( unsigned i n t colorkey , unsigned i n t enable ) ; 7.2.14 orgfxplus alpha Description: Sets the alpha value and enables or disables the use of the alpha blending. v o i d o r g f x p l u s a l p h a ( u n s i g n e d i n t alpha , unsigned i n t enable ) ; 7.3 Bitmap Fonts Note that bitmap fonts can be generated with the bitfontmaker utility. This utility generates an initialization function that calls the orgfx make bitmap font function and returns a valid font. 7.3.1 orgfx make bitmap font Creates a orgfx bitmap font from a image. glyphSpacing is the space in pixels between two glyphs in the string, and spaceWidth is the size of the space character. o r g f x b i t m a p f o n t o r g f x m a k e b i t m a p f o n t ( o r g f x t i l e s e t ∗ glyphs , u n s i g n e d i n t gl yp h Sp aci ng , u n s i g n e d i n t spaceWidth ) ; 7.3.2 orgfx put text Puts the text ”str” on the screen with the specified ”font” on position x0,y0. void o r g f x p u t t e x t ( o r g f x f o n t ∗ font , i n t x0 , i n t y0 , const wchar t ∗ s t r ) ; Note the use of wide strings (which enables the use of special characters such as åäö). Example usage: o r g f x p u t t e x t (& f o n t , x0 , y0 , L”Some example t e x t ” ) ; 7.4 Vector Fonts Note that vector fonts can be generated with the fonter utility. This utility generates an initialization function that calls the orgfx make vector font and orgfx init vector font functions and returns a valid font. 7.4.1 orgfx make vector font Creates a orgfx vector font from a series of glyphs. o r g f x v e c t o r f o n t o r g f x m a k e v e c t o r f o n t ( Glyph ∗ g l y p h l i s t , int size , Glyph ∗∗ g l y p h i n d e x l i s t , int glyphindexlistsize ) 25 7.4.2 orgfx init vector font Initializes the font for use. Needs to be called to set the index list. int 7.4.3 o r g f x i n i t v e c t o r f o n t ( orgfx vector font font ) ; orgfx put vector char Prints one glyph from the font with the current transformation matrix. If the glyph is not supported in the font the function will return without doing anything. void o r g f x p u t v e c t o r c h a r ( o r g f x v e c t o r f o n t ∗ font , wchar t text ) ; 7.4.4 orgfx put vector text Prints a string of characters using a vector font. This function sets the transformation matrix from the offset, scale and rotation parameters, then makes a series of calls to orgfx put vector char. void o r g f x p u t v e c t o r t e x t ( o r g f x v e c t o r f o n t ∗ font , orgfx point3 offset , orgfx point3 scale , orgfx point3 rotation , const wchar t ∗ str , unsigned i n t c o l o r ) ; 7.5 3D API There are two major parts of the 3D API, one is the transformation matrix interface and the other is the 3D mesh interface. 7.5.1 Transformations By setting the transformation matrix the ORGFX core can perform hardware accelerated transformations for every point sent to it, causing significantly less overhead than if this was done in software. The relevant functions are listed below: orgfx orgfx orgfx orgfx orgfx orgfx matrix matrix matrix matrix matrix matrix orgfx3d orgfx3d orgfx3d orgfx3d orgfx3d orgfx3d i d e n t i t y ( void ) ; r o t a t e X ( o r g f x m a t r i x mat , f l o a t rad ) ; r o t a t e Y ( o r g f x m a t r i x mat , f l o a t rad ) ; r o t a t e Z ( o r g f x m a t r i x mat , f l o a t rad ) ; s c a l e ( o r g f x m a t r i x mat , o r g f x p o i n t 3 s ) ; t r a n s l a t e ( o r g f x m a t r i x mat , o r g f x p o i n t 3 t ) ; i n l i n e v o i d o r g f x 3 d s e t m a t r i x ( o r g f x m a t r i x mat ) ; 7.5.2 orgfx3d make mesh Initializes a mesh with the necessary arrays generated by the meshmaker utility. o r g f x m e s h orgfx3d make mesh ( o r g f x f a c e ∗ f a c e s , u n s i g n e d i n t nFaces , orgfx point3 ∗ verts , u n s i g n e d i n t nVerts , o r g f x p o i n t 2 ∗ uvs , u n s i g n e d i n t nUvs ) ; 26 7.5.3 orgfx3d mesh texture size This should be called only once for each mesh that will be using texture coordinates. Since the ORGFX device uses pixel coordinates the UV coordinates must be updated with the size of the used texture. v o i d o r g f x 3 d m e s h t e x t u r e s i z e ( o r g f x m e s h ∗ mesh , u n s i g n e d i n t width , unsigned i n t height ) ; 7.5.4 orgfx3d draw mesh This function draws the mesh to screen, using the supplied translation, rotation and scale vectors to set the transformation matrix. If filled is set to zero, the mesh will be drawn as a colored wireframe. If filled is set to one and textured to zero, the mesh will be drawn with interpolated colors (the mesh format currently does not support materials). If filled is set to one and textured is also set to one, the mesh will be textured using interpolated uv texture coordinates. v o i d o rg f x3 d d r a w me s h ( o r g f x m e s h ∗ mesh , orgfx point3 translation , orgfx point3 rotation , orgfx point3 scale , int f i l l e d , int textured ) ; 7.6 Linux The current version of the core does not have a Linux driver. 7.7 Software emulation The entire device has a software implementation to make it easier to write applications for the device. The orgfx sw.c file replaces the orgfx.c and orgfx plus.c files, and renders pixels as they would be rendered by the graphics accelerator, but on a PC. The software implementation uses SDL as the backend. 7.8 7.8.1 Utilities Sprite maker utility A small application that converts an image into a header file that can be included in the project when compiled. The application generates an array of color values that can be loaded as a sprite. The application has support for reading common image file formats such as bmp, png and jpg (for a full list, see the supported file formats of the SDL image libaray). 8- 16- and 32-bit output is supported, and can be changed by passing a command line argument to the program (by default, the output is adjusted for 16 bit color mode). The resulting output header file, which is named after the input, can be included in a program using the extended bare metal driver. The easiest way to use the sprite is to use the generated initialize function defined in the header file. 7.8.2 Bitmap font maker utility Another application generates the data structures necessary to load bitmap fonts with very little effort. It takes an image and a grid spacing as input, and automatically generates offsets for all the glyphs in the font. The font generated by the program has 256 characters arranged according to the ASCII charset, as seen in figure 5 and 6. The application has support for reading common image file formats such as bmp, png and jpg (for a full list, see the supported file formats of the SDL image libaray). 8- 16- and 32-bit output is supported, and can be changed by passing a command line argument to the program (by default, 27 Figure 5: The ASCII table. Each number from 0 to 127 refers to a character. The numbers 0 to 31 cannot be printed. 28 Figure 6: The extended ASCII table. Each number from 128 to 255 refers to a character, mostly special characters not included in the basic table. 29 Figure 7: A font rendered by the software implementation of the ORGFX. Bézier curves are single colored while the triangles are interpolated between current color and black the output is adjusted for 16 bit color mode). Both vertical and horizontal grid spacing are set to 32 pixels by default, but this can be changed through command line arguments. The resulting output header file, which is named after the input, can be included in a program using the bare metal and font driver. The easiest way to use the bitmap font is to use the generated initialize function defined in the header file. 7.8.3 Mesh maker utility The mesh maker utility loads 3D objects and generates a header file that can be used by the advanced 3D API. Currently the utility only supports Wavefront .obj files which only contains 3rd order polygons. Any higher order polygons will be discarded, so all polygons in the model must be converted to triangles prior to running the utility. The application supports loading texture coordinates for each vertex, allowing for textured meshes. The resulting output header file, which is named after the input, can be included in a program using the bare metal 3D API. The easiest way to use the mesh is to use the generated initialize function defined in the header file. 7.8.4 Vector font maker utility The Font maker is a application that can convert a .TTF file to a format that the graphics card can handle. The Font maker outputs a .h file that can be included in a project to enable the graphics accelerators vector font capabilities. The converter finds all explicit vector points in a TTF file and then calculates the implicit points and checks where the glyphs contours end. The points are then sent to a Delaunay triangulation function based on the work of V. Domiter and B. Zalik and implemented by M. Green and T. Åhlén 1 . The generated .h file consists of two fields for each glyph, one field for Bézier writes and one for triangle writes. The generated header file will contain two lists for each glyph, one to store Bézier writes and one to store triangle writes. The rendered result can be seen in figure 7. 8 Programming examples The following piece of code shows how to use the extended interface for a bare metal implementation on the ORPSoCv2 platform. Bahamut cc.png.h is a 186 by 248 pixel image with a pinkish 1 http://code.google.com/p/poly2tri/ 30 background (rgb code ff00ff, or f81f in 16 bit). The header file is generated by the sprite maker utility at 16 bit color depth. #i n c l u d e ” o r g f x p l u s . h” #i n c l u d e ” Bahamut cc . png . h” i n t main ( v o i d ) { int i ; // I n i t i a l i z e s c r e e n t o 640 x480 −16@60 // No d o u b l e b u f f e r i n g i n t screen = o r g f x p l u s i n i t (640 , 480 , 16 , 0 ) ; // I n i t i a l i z e dragon s p r i t e int bahamut sprite = o r g f x p l u s i n i t s u r f a c e ( 1 8 6 , 2 4 8 , Bahamut cc ) ; // A c t i v a t e c o l o r k e y i n g o r g f x p l u s c o l o r k e y (0 xf81f , 1 ) ; // C l e a r s c r e e n , w h i t e c o l o r o r g f x p l u s f i l l (0 ,0 ,640 ,480 ,0 x f f f f ) ; // Draw a few l i n e s with d i f f e r e n t c o l o r s o r g f x p l u s l i n e (200 ,100 ,10 ,10 ,0 xf000 ) ; o r g f x p l u s l i n e (200 ,100 ,351 ,31 ,0 x0ff0 ) ; o r g f x p l u s l i n e (200 ,100 ,121 ,231 ,0 x00f0 ) ; o r g f x p l u s l i n e (200 ,100 ,321 ,231 ,0 xf00f ) ; // Draw t h e dragon a t d i f f e r e n t a l p h a s e t t i n g s orgfxplus alpha (64 ,1); o r g f x p l u s d r a w s u r f a c e (100 , 100 , bahamut sprite ) ; orgfxplus alpha (128 ,1); o r g f x p l u s d r a w s u r f a c e (120 , 102 , bahamut sprite ) ; orgfxplus alpha (255 ,1); o r g f x p l u s d r a w s u r f a c e (140 , 104 , bahamut sprite ) ; while ( 1 ) ; } More example programs are supplied with the implementation in the sw/examples directory. 31 B Appendix B, Enhanced VGA/LCD Specification 75 VGA/LCD Core v2.0 Specifications Author: Richard Herveille rherveille@opencores.org Document rev. 1.2 March 20, 2003 This page left intentionally blank OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 Revision History Rev. 0.1 0.1a Date 10/04/01 20/04/01 Author Richard Herveille Richard Herveille 0.2 21/05/01 Richard Herveille 0.3 0.4 28/05/01 03/06/01 Richard Herveille Richard Herveille 0.4a 04/06/01 Richard Herveille 0.5 15/07/01 Richard Herveille 0.6 31/07/01 Richard Herveille 0.7 10/19/01 Richard Herveille 0.8 28/01/02 Richard Herveille 1.0 28/03/02 Richard Herveille 1.1 1.2 20/04/02 18/03/03 Richard Herveille Richard Herveille www.opencores.org Description First Draft Changed proposal to specifications Added Appendix A Extended Register Specifications First official release Added OpenCores logo Changed Chapter 1, Introduction Finished Chapter 2, IO ports Finished Chapter 3, Registers Extended Chapter 4, Operation Changed Chapter 5, Architecture Added Appendix B Fixed some inconsistencies. Changed all references to address related subjects (core fix & documentation fix). Added Appendix C Fixed some minor typing errors in the document (credits: Rudolph Usselmann) Added Color Lookup Table bank switching. Added embedded CLUT section. Revised horizontal & vertical timing section. Added Power-on-Reset description. Changed CBSE & VBSE bits functionality. Added Bank Switch Section. Added VGA & CLUT section to Appendix B. Changed introduction page. Major VGA/LCD Core changes; core v2.0. Changed Manual to reflect core changes. Removed all references to external CLUT v2.0 core has CLUT internally. Fixed some typos. Added 32bpp mode. Added Bandwidth Issues section. Expanded Bandwidth Issues section. Added Hardware Cursor sections. Added Table of Contents. Added Appendix-D. Changed Architecture section. Changed Operation section. Changed introduction page. Changed table headers. Added OpenCores logo to page header. Revised entire document. Changed VGA timing section. Added support for WISHBONE revB.3 Synchronous Registered Feedback Cycles. Rev 1.2 Preliminary OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 Table of contents INTRODUCTION........................................................................................................1 IO PORTS.....................................................................................................................2 2.1 CORE PARAMETERS ...............................................................................................2 2.2 WISHBONE SYSCON INTERFACE CONNECTIONS .................................................2 2.3 WISHBONE SLAVE INTERFACE CONNECTIONS ...................................................3 2.4 WISHBONE MASTER INTERFACE CONNECTIONS ................................................4 2.5 VGA PORT CONNECTIONS ....................................................................................5 REGISTERS .................................................................................................................7 3.1 REGISTERS LIST .....................................................................................................7 3.2 ACCESSING RESERVED ADDRESS LOCATIONS .......................................................7 3.3 CONTROL REGISTER [CTRL] ................................................................................8 3.4 STATUS REGISTER [STAT]..................................................................................13 3.5 HORIZONTAL TIMING REGISTER [HTIM] ............................................................14 3.6 VERTICAL TIMING REGISTER [VTIM] .................................................................15 3.7 HORIZONTAL AND VERTICAL LENGTH REGISTER [HVLEN] ...............................15 3.8 VIDEO BASE ADDRESS [VBARA] [VBARB].......................................................16 3.9 HARDWARE CURSOR BASE ADDRESS [C0BAR] [C1BAR] .................................17 3.10 HARDWARE CURSOR (X,Y) REGISTER [C0XY] [C1XY]...................................17 3.11 HARDWARE CURSOR COLOR REGISTERS [C0CR] [C1CR]................................17 3.12 8BPP PSEUDO COLOR LOOKUP TABLE [PCLT]..................................................18 OPERATION..............................................................................................................19 4.1 VIDEO TIMING .....................................................................................................19 4.1.1 HORIZONTAL VIDEO TIMING ............................................................................19 4.1.2 VERTICAL VIDEO TIMING .................................................................................20 4.1.3 COMBINED VIDEO FRAME TIMING....................................................................21 4.2 PIXEL COLOR GENERATION .................................................................................22 4.2.1 COLOR PROCESSOR INTERNALS ........................................................................22 4.2.2 ADDRESS GENERATOR .....................................................................................22 4.2.3 DATA BUFFER ..................................................................................................22 4.2.4 COLORIZER .......................................................................................................22 4.2.5 COLOR LOOKUP TABLE ....................................................................................25 4.3 HARDWARE CURSORS .........................................................................................26 4.3.1 INTRODUCTION .................................................................................................26 4.3.2 CURSOR PATTERNS ...........................................................................................26 4.3.3 TURNING OFF 3D SUPPORT. ..............................................................................27 4.3.4 CURSOR PROCESSOR INTERNALS ......................................................................28 4.3.5 ADDRESS GENERATOR .....................................................................................28 4.3.6 CURSOR BUFFER...............................................................................................28 4.3.7 CURSOR0/CURSOR1 PROCESSOR ......................................................................29 4.4 BANK SWITCHING ................................................................................................30 4.4.1 INTRODUCTION .................................................................................................30 4.4.2 HOST NOTES .....................................................................................................30 4.4.3 SEQUENCE ........................................................................................................30 www.opencores.org Rev 1.2 Preliminary OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.5 BANDWIDTH ISSUES ............................................................................................31 4.5.1 INTRODUCTION .................................................................................................31 4.5.2 CALCULATIONS ................................................................................................31 4.5.3 EXAMPLES ........................................................................................................32 ARCHITECTURE .....................................................................................................33 5.1 COLOR LOOKUP TABLE .......................................................................................33 5.2 CURSOR BASE REGISTERS ...................................................................................34 5.2 CURSOR BUFFERS ................................................................................................34 5.3 CURSOR PROCESSOR ...........................................................................................34 5.4 COLOR PROCESSOR .............................................................................................34 5.5 LINE FIFO ...........................................................................................................34 5.6 VIDEO MEMORY BASE REGISTERS ......................................................................34 5.7 VIDEO TIMING GENERATOR ................................................................................34 5.8 WISHBONE MASTER INTERFACE ..........................................................................35 5.9 WISHBONE SLAVE INTERFACE .............................................................................35 VGA MODES .............................................................................................................36 A.1 VERTICAL TIMING INFORMATION COMMON VGA MODES .................................36 A.2 HORIZONTAL TIMING INFORMATION COMMON VGA MODES ............................36 TARGET DEPENDENT IMPLEMENTATIONS..................................................37 CORE STRUCTURE.................................................................................................38 DESIGN NOTES ........................................................................................................39 D.1 INTRODUCTION ...................................................................................................39 D.2 VGA_CURPROC....................................................................................................40 www.opencores.org Rev 1.2 Preliminary OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 1 Introduction • • • • • • • • • • • • • • • • • Features General Description CRT and LCD display support Separate VSYNC/HSYNC and combined CSYNC synchronization signals Composite BLANK signal User programmable video timing User programmable video resolutions User programmable video control signals polarization levels 32bpp, 24bpp and 16bpp color modes 8bpp grayscale and 8bpp pseudo-color modes Supports video- and/or color-lookuptable bank switching during vertical retrace Support for up to two hardware cursors Per cursor user selectable resolutions, 23x23 pixels and 64x64 pixels Alpha blending support for 3D cursors Triple display support 32bit WISHBONE RevB.3 compliant Slave and Master interfaces Operation from a wide range of input clock frequencies Static synchronous design Full synthesizability The OpenCores Enhanced VGA/LCD Controller Core provides VGA capabilities for embedded systems. It supports both CRT and LCD displays with user programmable resolutions and video timings, thus providing compatibility with almost all available LCD and CRT displays. The core supports a number of color modes, including 32bpp, 24bpp, 16bpp, 8bpp grayscale, and 8bpppseudo color. The video memory is located outside the primary core, thus providing the most flexible memory solution possible. It can be located onchip or off-chip, shared with the system’s main memory (VGA on demand) or be dedicated to the VGA system. The color lookup table is located inside the core, to reduce memory bandwidth requirements and to provide higher throughput. Image data is fetched automatically via the WISHBONE Master interface, making this an ideal “program-and-forget” video solution. More demanding video applications, like streaming video or video games, can benefit from the video-bank-switching function. Flicker and cluttered images are reduced by automatically switching between video-memory pages and/or color lookup tables on each vertical retrace. The optional hardware cursors provide additional flexibility through two 32x32 16bpp or 64x64 4bpp hardware generated cursors. The two cursors can be displayed at the same time. Core overview www.opencores.org Rev 1.2 Preliminary 1 of 40 OpenCores Enhanced VGA/LCD Core Datasheet Typically, one is for the GUI and one for user applications. Cursor patterns are stored in an off-screen portion of the video memory or, if accessible by the core, in the main memory and are automatically loaded into internal buffers to reduce memory bandwidth requirements. Moving the cursors on www.opencores.org 3/20/2003 the screen is as simple as changing a single register. The core can interrupt the host on each horizontal and/or vertical sync pulse. The horizontal, vertical, and composite synchronization polarization levels, as well as the blanking polarization level are programmable by software. Rev 1.2 Preliminary 2 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 2 IO ports 2.1 Core Parameters Parameter ARST_LVL LINE_FIFO_AWIDTH Type Bit Integer Default 1’b0 7 Description Asynchronous reset level Line Fifo Size 2.1.1 ARST_LVL The asynchronous reset level can be set to either active high (1’b1) or active low (1’b0). 2.1.2 LINE_FIFO_AWIDTH The line FIFO size can be altered by changing the amount of address bits the FIFO logic should use. The line FIFO depth (amount of entries) can be calculated as follows: entries = 2 LINE _ FIFO _ AWIDTH 2.2 WISHBONE Syscon Interface Connections Port wb_clk_i wb_rst_i rst_i wb_inta_o Width 1 1 1 1 Direction Input Input Input Output Description Master clock input Synchronous active high reset Asynchronous reset Interrupt request signal 2.2.1 wb_clk_i All internal WISHBONE logic is registered to the rising edge of the [wb_clk_i] clock input. The frequency range over which the core can operate depends on the technology used and the pixel clock needed; [wb_clk_i] may not be slower than the pixel clock [clk_p_i]. 2.2.2 wb_rst_i The active high synchronous reset input [wb_rst_i] forces the core to restart. All internal registers are preset and all state-machines are set to an initial state. 2.2.3 rst_i The asynchronous reset input [rst_i] forces the core to restart. All internal registers are preset and all state-machines are set to an initial state. The reset level, either active high or active low, is set by the ARST_LVL parameter. www.opencores.org Rev 1.2 Preliminary 2 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 rst_i is not a WISHBONE-compatible signal. It is primarily provided for FPGA implementations. Using [rst_i] instead of [wb_rst_i] can result in lower cell usage and higher performance, because most FPGAs provide a dedicated asynchronous reset path. Use either [rst_i] or [wb_rst_i]. Hardcode the unused reset input to a negated state. The core requires a power-on reset, allowing all internal registers to propagate to a known state. The power-on reset must be held asserted until all clocks are stable. When all clocks are stable the reset signal must remain asserted for at least 3 clock cycles of the slowest available clock [clk_p_i]. 2.2.4 wb_inta_o The interrupt request output is asserted when the core needs service from the host system. 2.3 WISHBONE Slave Interface Connections Port wbs_adr_i wbs_dat_i wbs_dat_o wbs_sel_i wbs_we_i wbs_stb_i wbs_cyc_i wbs_ack_o wbs_err_o Width 12 32 32 4 1 1 1 1 1 Direction Input Input Output Input Input Input Input Output Output Description Lower address bits Slave Data bus input Slave Data bus output Byte select signals Write enable input Strobe signal/Core select input Valid bus cycle input Bus cycle acknowledge output Bus cycle error output 2.3.1 wbs_adr_i The address array input [wbs_adr_i] is used to pass a binary coded address to the core. The most significant bit is at the higher number of the array. 2.3.2 wbs_dat_i The data array input [wbs_dat_i] is used to pass binary data from the current WISHBONE Master to the core. All data transfers are 32bit wide. 2.3.3 wbs_dat_o The data array output [wbs_dat_o] is used to pass binary data from the core to the current WISHBONE Master. All data transfers are 32bit wide. 2.3.4 wbs_sel_i The byte select array input [wbs_sel_i] indicates where valid data is placed on the [wbs_dat_i] input array during writes to the core, and where it is expected on the [wbs_dat_o] output array during reads from the core. The core requires all accesses to be 32bit wide [wbs_sel_i(3:0) = ‘1111’b]. www.opencores.org Rev 1.2 Preliminary 3 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 2.3.5 wbs_we_i When asserted, the write enable input [wbs_we_i] indicates whether the current bus cycle is a read or a write cycle. The signal is asserted during write cycles and negated during read cycles. 2.3.6 wbs_stb_i The strobe input [wbs_stb_i] is asserted when the core is being addressed. The core only responds to WISHBONE cycles when [wbs_stb_i] is asserted, except for the [wb_rst_i] and [rst_i] reset signals, which always receive a response. 2.3.7 wbs_cyc_i When asserted, the cycle input [wbs_cyc_i] indicates that a valid bus cycle is in progress. The logical AND function of [wbs_cyc_i] and [wbs_stb_i] indicates a valid transfer cycle to/from the core. 2.3.8 wbs_ack_o When asserted, the acknowledge output [wbs_ack_o] indicates the normal termination of a valid bus cycle. 2.3.9 wbs_err_o When asserted, the error output [wbs_err_o] indicates an abnormal termination of a bus cycle. The [wbs_err_o] output signal is asserted when the host tries to access the controller’s internal registers not using 32-bit aligned data; i.e. when [wbs_sel_i(3:0)] is unequal to ‘1111’b. 2.4 WISHBONE Master Interface Connections Port wbn_adr_o wbm_dat_i wbm_sel_o wbm_we_o wbm_stb_o wbm_cyc_o wbm_cti_o Wbm_bte_o wbm_ack_i wbm_err_i Width 32 32 4 1 1 1 3 2 1 1 Direction Output Input Output Output Output Output Output Output Input Input Description Address bus output Data bus input Byte select signals Write enable output Strobe signal Valid bus cycle output Cycle type identifier output Burst type extensions output Bus cycle acknowledge input Bus cycle error Input 2.4.1 wbm_adr_o The address array output [wbm_adr_o] is used to pass a binary coded address from the core to the external video memory. The most significant bit is at the higher number of the array. 2.4.2 wbm_dat_i The data array input [wbm_dat_i] is used to pass binary data from the external video memory to the core. All data transfers are 32bit wide. www.opencores.org Rev 1.2 Preliminary 4 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 2.4.3 wbm_sel_o The byte select array output [wbm_sel_o] indicates where valid data is expected on the [wbm_dat_i] input array. The core supports 32-bit wide accesses only [wbm_sel_o(3:0) = ‘1111’b]. 2.4.4 wbm_we_o When asserted, the write enable output [wbm_we_o] indicates whether the current bus cycle is a read or a write cycle. The core only reads from the external memory; therefore, [wbm_we_o] is always negated (‘0’). 2.4.5 wbm_stb_o The strobe output [wbm_stb_o] is asserted when the core wants to read from the external video memory. 2.4.6 wbm_cyc_o The cycle output [wbm_cyc_o] is asserted when the core wants to read from the external video memory. 2.4.7 wbm_cti_o The Wishbone revB.3 cycle type identifier output [wbm_cti_o] gives compliant slaves additional information about the current cycle. The vga core supports the Registered Feedback Cycles introduced in the Wishbone revB.3 specs. The core supports ‘Classic’ and ‘Incrementing Burst’ transfers. The table below shows the values [wbm_cti_o] can take, any other values should be considered a core error. wbm_cti_o 000b 010b 111b Meaning Wishbone Classic (i.e. revB.2) transfer Incrementing burst transfer End-of-Burst 2.4.8 wbm_bte_o The Wishbone revB.3 burst type extension output [wbm_bte_o] gives compliant slaves additional information about the requested burst. The vga core only supports linear incrementing bursts. Therefore [wbm_bte_o] is always 2’b00. 2.4.9 wbm_ack_i When asserted, the acknowledge input [wbm_ack_i] indicates the normal termination of a valid bus cycle. 2.4.10 wbm_err_i When asserted, the error input [wbm_err_i] indicates an abnormal termination of a bus cycle. When the [wbm_err_i] signal is asserted, the core stops the current transfer. After [wbm_err_i] has been asserted, the state of the core is undefined. 2.5 VGA Port Connections Port clk_p_I hsync_pad_o Width 1 1 www.opencores.org Direction Input Output Description Pixel Clock Horizontal Synchronization Pulse Rev 1.2 Preliminary 5 of 40 OpenCores vsync_pad_o csync_pad_o blank_pad_o r_pad_o g_pad_o b_pad_o Enhanced VGA/LCD Core Datasheet 1 1 1 8 8 8 Output Output Output Output Output Output 3/20/2003 Vertical Synchronization Pulse Composite Synchronization Pulse Blank signal Red Color Data Green Color Data Blue Color Data 2.5.1 clk_p_i All internal video logic is registered to the rising edge of the [clk_p_i] clock input. The frequency range over which the core can operate depends on the technology used and the pixel clock needed; [clk_p_i] may not be faster than the WISHBONE clock [wb_clk_i]. 2.5.2 hsync_pad_o The horizontal synchronization pulse is asserted when the raster scan ray needs to return to the start position (the left side of the screen). 2.5.3 vsync_pad_o The vertical synchronization pulse is asserted when the raster scan ray needs to return to the vertical start position (the top of the screen). 2.5.5 csync_pad_o The composite synchronization pulse is a combined horizontal and vertical synchronization signal. 2.5.6 blank_pad_o The blank output is asserted when no image is projected onto the screen, i.e during the back porch, the synchronization pulses, and the front porch. 2.5.7 r_pad_o, g_pad_o, b_pad_o Red, green, and blue pixel data: the RGB lines contain invalid data while the BLANK signal [blank_pad_o] is asserted. www.opencores.org Rev 1.2 Preliminary 6 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3 Registers 3.1 Registers List Name CTRL STAT HTIM VTIM HVLEN VBARa VBARb C0XY C0BAR C0CR C1XY C1BAR C1CR PCLT wbs_adr_i[11:0] 0x000 0x004 0x008 0x00C 0x010 0x014 0x018 0x01C-0x02C 0x030 0x034 0x038-0x03C 0x040-0x05C 0x060-0x06C 0x070 0x074 0x078-0x07C 0x080-0x09C 0x0A0-0x7FC 0x800-0xFFC Width 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 Access R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W Description Control Register Status Register Horizontal Timing Register Vertical Timing Register Horizontal and Vertical Length Register Video Memory Base Address Register A Video Memory Base Address Register B reserved Cursor0 X,Y Register Cursor0 Base Address Register reserved Cursor0 Color Registers reserved Cursor1 X,Y Register Cursor1 Base Address Register reserved Cursor1 Color Registers reserved 8bpp Pseudo Color Lookup Table 3.2 Accessing Reserved Address Locations It is not allowed to access reserved memory locations. No error is generated when these addresses are accessed; all transfers are terminated normally. Write accesses are ignored, read accesses return all zeros. www.opencores.org Rev 1.2 Preliminary 7 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.3 Control Register [CTRL] Bit # 31:26 25 Access R/W R/W 24 R/W 23:22 21 R/W R/W 20 R/W 19:16 15 R/W R/W 14 R/W 13 R/W 12 R/W 11 R/W 10,9 R/W 8,7 R/W 6 R/W 5 R/W 4 R/W Description reserved HC1R, Hardware Cursor1 Resolution 0: 32x32 pixel mode 1: 64x64 pixel mode HC1E, Hardware Cursor1 Enable 0: Hardware Cursor1 disabled 1: Hardware Cursor1 enabled reserved HC0R, Hardware Cursor1 Resolution 0: 32x32 pixel mode 1: 64x64 pixel mode HC0E, Hardware Cursor0 Enable 0: Hardware Cursor0 disabled 1: Hardware Cursor0 enabled reserved BL, Blanking Polarization Level 0: Positive 1: Negative CSL, Composite Synchronization Pulse Polarization Level 0: Positive 1: Negative VSL, Vertical Synchronization Pulse Polarization Level 0: Positive 1: Negative HSL, Horizontal Synchronization Pulse Polarization Level 0: Positive 1: Negative PC, 8-bit Pseudo Color 0: 8-bit grayscale 1: 8-bit pseudo color CD, Color Depth 11: 32 bits per pixel 10: 24 bits per pixel 01: 16 bits per pixel 00: 8 bits per pixel VBL, Video memory Burst Length 11b: 8 cycles 10b: 4 cycles 01b: 2 cycles 00b: 1 cycle CBSWE, CLUT Bank Switching Enable 0: Color lookup table bank switching disabled 1: Color lookup table bank switching enabled VBSWE, Video Bank Switching Enable 0: Video memory bank switching disabled 1: Video memory bank switching enabled CBSIE, CLUT Bank Switch Interrupt Enable www.opencores.org Rev 1.2 Preliminary 8 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 0: Color lookup table bank switching interrupt disabled 1: Color lookup table bank switching interrupt enabled 3 R/W VBSIE, Video Bank Switch Interrupt Enable 0: Video memory bank switching interrupt disabled 1: Video memory bank switching interrupt enabled 2 R/W HIE, HSync Interrupt Enable 0: Horizontal synchronization pulse interrupt disabled 1: Horizontal synchronization pulse interrupt enabled 1 R/W VIE, VSync Interrupt Enable 0: Vertical synchronization pulse interrupt disabled 1: Vertical synchronization pulse interrupt enabled 0 R/W VEN, Video Enable 0: Video system disabled 1: Video system enabled Reset Value: 0x00000000 3.3.1 BL The Blanking Polarization Level defines the voltage level of the blank output [blank_pad_o] when the blank signal is asserted. When BL is cleared (‘0’), [blank_pad_o] is at a high voltage level when the blank signal is asserted and at a low voltage level when the blank signal is negated (i.e. blank is active high). When BL is set (‘1’), [blank_pad_o] is at a low voltage level when the blank signal is asserted and at a high voltage level when the blank signal is negated (i.e. blank is active low). 3.3.2 CBSIE When the CLUT Bank Switch Interrupt Enable bit is set (‘1’) and a bank switch is requested, the host is interrupted. The Bank Switch interrupt is independent of the CLUT Bank Switch Enable bit setting. Setting this bit while the CLUT Bank Switch Interrupt Pending (CBSINT) flag is set generates an interrupt. Clearing this bit while CBSINT is set disables the interrupt request, but does not clear the interrupt pending flag. 3.3.3 CBSWE When the CLUT Bank Switch Enable bit is set (‘1’) and a complete video frame has been read into the line buffer, the core switches between the two available color lookup tables located at the memory addresses that are set in the CLUT Memory Base Address register. The Active CLUT Memory Page (ACMP) flag reflects the current active color lookup table. The core automatically clears this bit after the bank switch. Software should set this bit each time a bank switch is desired. 3.3.4 CD The Color Depth bits define the number of bits per pixel (bpp): 8, 16, 24, or 32 bits per pixel. CD 00b 01b 10b 11b Color Depth 8bpp 16bpp 24bpp 32bpp www.opencores.org Rev 1.2 Preliminary 9 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.3.5 CSL The Composite Sync Polarization Level defines the voltage level of the composite synchronization output [csync_pad_o] when the composite sync signal is asserted. When CSL is cleared (‘0’), [csync_pad_o] is at a high voltage level when the composite sync signal is asserted and at a low voltage level when the composite sync signal is negated (i.e. csync is active high). When CSL is set (‘1’), [csync_pad_o] is at a low voltage level when the composite sync signal is asserted and at a high voltage level when the composite sync signal is negated (i.e. csync is active low). 3.3.6 HC0E When the Hardware Cursor0 Enable bit is set (‘1’), the first hardware cursor will be displayed. When it is cleared (‘0’), the hardware cursor will be removed. To avoid corrupted images, displaying and removing the hardware cursor is synchronous to the vertical retrace; i.e. the cursor will be displayed/removed in the next video frame. All related registers should be set to their corresponding values before enabling the cursor. 3.3.7 HC1E When the Hardware Cursor1 Enable bit is set (‘1’), the second hardware cursor will be displayed. When it is cleared (‘0’), the hardware cursor will be removed. To avoid corrupted images, displaying and removing the hardware cursor is synchronous to the vertical retrace; i.e. the cursor will be displayed/removed in the next video frame. All related registers should be set to their corresponding values before enabling the cursor. 3.3.8 HC0R The Hardware Cursor0 Resolution bit sets the pattern size and the color depth for the first hardware cursor. When HC0R is set (‘1’), hardware cursor0 is set for a resolution of 64x64x4bpp. When HC0R is cleared (‘0’), hardware cursor0 is set for a resolution of 32x32x16bpp. It may not be changed while the cursor is being displayed. To change the cursor’s Resolution bit, first turn off the cursor by clearing the Hardware Cursor0 Enable bit, then change the cursor’s resolution bit value, (re)write the cursor’s Base Address register to load the new cursor pattern, and finally re-enable the cursors by setting the Hardware Cursor0 Enable bit. To avoid displaying corrupted cursors, wait for a vertical sync interrupt after clearing the Hardware Cursor0 Enable bit. 3.3.9 HC1R The Hardware Cursor1 Resolution bit sets the pattern size and the color depth for the second hardware cursor. When HC1R is set (‘1’), hardware cursor1 is set for a resolution of 64x64x4bpp. When HC1R is cleared (‘0’), hardware cursor1 is set for a resolution of 32x32x16bpp. It may not be changed while the cursor is being displayed. To change the cursor’s Resolution bit, first turn off the cursor by clearing the Hardware Cursor1 Enable bit, then change the cursor’s Resolution bit value, (re)write the cursor’s Base Address register to load the new cursor pattern, and finally re-enable the cursors by setting the Hardware Cursor1 Enable bit. To avoid displaying corrupted cursors, wait for a vertical sync interrupt after clearing the Hardware Cursor1 Enable bit. 3.3.10 HIE www.opencores.org Rev 1.2 Preliminary 10 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 When the Horizontal Interrupt Enable bit is set (‘1’) and a horizontal interrupt is pending, the host system is interrupted. Setting this bit while the Horizontal Interrupt Pending (HINT) flag is set generates an interrupt. Clearing this bit while HINT is set disables the interrupt request but does not clear the interrupt pending flag. 3.3.11 HSL The Horizontal Sync Polarization Level defines the voltage level of the horizontal synchronization output [hsync_pad_o] when the horizontal sync signal is asserted. When HSL is cleared (‘0’), [hsync_pad_o] is at a high voltage level when the horizontal sync signal is asserted and at a low voltage level when the horizontal sync signal is negated (i.e. hsync is active high). When HSL is set (‘1’), [hsync_pad_o] is at a low voltage level when the horizontal sync signal is asserted and at a high voltage level when the horizontal sync signal is negated (i.e. hsync is active low). 3.3.12 PC When in 8bpp mode, the pixel data can be used as black and white information (256 grayscales) or as an index to a color lookup table (pseudo color mode). When the PC bit is set (‘1’), the core operates in pseudo color mode and the pixel data is used to read the color data from the CLUT. When the PC bit is cleared (‘0’), the pixel-data is placed on the red, green, and blue outputs, effectively producing a black and white image with 256 different grayscales. 3.3.13 VBSIE When the Video Bank Switch Interrupt Enable bit is set (‘1’) and a bank switch is requested, the host is interrupted. The Bank Switch interrupt is independent of the Video Bank Switch Enable bit setting. Setting this bit while the Video Bank Switch Interrupt Pending (VBSINT) flag is set generates an interrupt. Clearing this bit while VBSINT is set disables the interrupt request but does not clear the interrupt pending flag. 3.3.14 VBSWE When the Video Bank Switch Enable bit is set (‘1’) and a complete video frame has been read into the line buffer, the core switches between the two available video pages located at the memory addresses set in the Video Memory Base Address (VBAR) registers. The Active Video Memory Page (AVMP) flag reflects the current active video page. The core automatically clears this bit after the bank switch. Software should set this bit each time a bank switch is desired. 3.3.15 VBL The Video Burst Length bits define the number of transfers during a single block read access to the video memory: 1 (single access), 2, 4, or 8 accesses per block read. The core will perform multiple consecutive block reads; the total number of accesses during a read is therefore always a multiple (i.e. one or more) of the Video Burst Length. VBL 00b 01b 10b 11b Burst length 1 transfer 2 transfers 4 transfers 8 transfers www.opencores.org Rev 1.2 Preliminary 11 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.3.16 VEN The video circuit is disabled when the Video Enable bit is cleared (‘0’). The video circuit is enabled when the Video Enable bit is set (‘1’). This bit must be cleared before changing any register contents. After (re)programming all registers, this bit may be set. 3.3.17 VIE When the Vertical Interrupt Enable bit is set (‘1’) and a vertical interrupt is pending, the host system is interrupted. Setting this bit while the Vertical Interrupt Pending (VINT) flag is set generates an interrupt. Clearing this bit while VINT is set disables the interrupt request but does not clear the interrupt pending flag. 3.3.18 VSL The Vertical Sync Polarization Level defines the voltage level of the vertical synchronization output [vsync_pad_o] when the vertical sync signal is asserted. When VSL is cleared (‘0’), [vsync_pad_o] is at a high voltage level when the vertical sync signal is asserted and at a low voltage level when the vertical sync signal is negated (i.e. vsync is active high). When VSL is set (‘1’), [vsync_pad_o] is at a low voltage level when the vertical sync signal is asserted and at a high voltage level when the vertical sync signal is negated (i.e. vsync is active low). www.opencores.org Rev 1.2 Preliminary 12 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.4 Status Register [STAT] Bit # Access Description 31:25 R reserved 24 R HC1A, Hardware cursor1 available 23:21 R reserved 20 R HC0A, Hardware cursor0 available 19:18 R reserved 17 R ACMP, Active CLUT Memory Page 16 R AVMP, Active Video Memory Page 15:8 R reserved 7 R/W CBSINT, CLUT Bank Switch Interrupt Pending 6 R/W VBSINT, Bank Switch Interrupt Pending 5 R/W HINT, Horizontal Interrupt Pending 4 R/W VINT, Vertical Interrupt Pending 3:2 R/W reserved 1 R/W LUINT, Line FIFO Under-Run Interrupt Pending 0 R/W SINT, System Error Interrupt Pending Reset Value: 0x00000000 ~ 0x00110000 3.4.1 ACMP The Active CLUT Memory Page flag is cleared (‘0’) when the active color lookup table is CLUT0; it is set (‘1’) when the active color lookup table is CLUT1. This flag is cleared when the Video Enable bit is cleared. Refer to the CLUT Base Address register for more information on CLUT0 and CLUT1. 3.4.2 AVMP The Active Video Memory Page flag is cleared (‘0’) when the active memory page is located at Video Base Address A (VBARa); it is set (‘1’) when the active memory page is located at Video Base Address B (VBARb). This flag is cleared when the Video Enable bit is cleared. 3.4.3 CBSINT The CLUT Bank Switch Interrupt Pending flag is set (‘1’) when all video data from the current active memory page has been translated into pixel colors by the currently active color lookup table. When the CBSIE bit is set (‘1’) and CBSINT is asserted, the host system is interrupted. Software must clear the interrupt by writing a (‘0’) to this bit. 3.4.4 HC0A The Hardware Cursor0 Available bit is a hard coded flag that is set (‘1’) when Hardware Cursor0 is available and cleared (‘0’) when Hardware Cursor0 is not available. 3.4.5 HC1A The Hardware Cursor1 Available bit is a hard coded flag that is set (‘1’) when Hardware Cursor1 is available and cleared (‘0’) when Hardware Cursor1 is not available. www.opencores.org Rev 1.2 Preliminary 13 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.4.6 HINT The Horizontal Interrupt Pending flag is set (‘1’) when the horizontal synchronization pulse [hsync_pad_o] is asserted. When the HIE bit is set (‘1’) and HINT is asserted, the host system is interrupted. Software must clear the interrupt by writing a (‘0’) to this bit. 3.4.7 LUINT The Line FIFO Under-Run Interrupt Pending flag is set (‘1’) when pixels are read from the Line FIFO while it is empty. This can be caused by a locked bus, reading from an illegal video memory address, or to few entries in the FIFO. When LUINT is asserted, the host system is interrupted. Software must clear the interrupt by writing a (‘0’) to this bit. The Line FIFO Under-Run Interrupt is a non-maskable interrupt. 3.4.8 SINT The System Error Interrupt Pending flag is set (‘1’) when [wbm_err_i] is asserted during a read from the video memory. When SINT is asserted, the host system is interrupted. Software must clear the interrupt by writing a (‘0’) to this bit. The System Error Interrupt is a non-maskable interrupt. 3.4.9 VBSINT The Video Bank Switch Interrupt Pending flag is set (‘1’) when all video data from the current active memory page has been read. When the VBSIE bit is set (‘1’) and VBSINT is asserted, the host system is interrupted. Software must clear the interrupt by writing a (‘0’) to this bit. 3.4.10 VINT The Vertical Interrupt Pending flag is set (‘1’) when the vertical synchronization pulse [vsync_pad_o] is asserted. When the VIE bit is set (‘1’) and VINT is asserted, the host system is interrupted. Software must clear the interrupt by writing a (‘0’) to this bit. 3.5 Horizontal Timing Register [HTIM] Bit # Access Description 31:24 R/W Thsync, Horizontal synchronization pulse width 23:16 R/W Thgdel, Horizontal gate delay time 15:0 R/W Thgate, Horizontal gate time Reset Value: 0x00000000 3.5.1 Thsync The horizontal synchronization pulse width, measured in pixels -1. Example: Thsync = 5 hsync length = 6 pixels 3.5.2 Thgdel The horizontal gate delay width, measured in pixels -1. Example: Thgdel = 12 gate delay = 13 pixels www.opencores.org Rev 1.2 Preliminary 14 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.5.3 Thgate The horizontal gate width, measured in pixels -1. Example: Thgate = 799 gate length = 800 pixels The horizontal gate width is dependent on the programmed Video memory Burst Length [VBL] and the Color Depth [CD]. It must be divisible by the burst length and the number of pixels per memory access; see the table below for more information. CD 00b 01b 10b 11b (Thgate +1) dividable by: 4 ∗ VBL 2 ∗ VBL 4 ∗ VBL 3 1 ∗ VBL 3.6 Vertical Timing Register [VTIM] Bit # Access Description 31:24 R/W Tvsync, vertical synchronization pulse width 23:16 R/W Tvgdel, vertical gate delay time 15:0 R/W Tvgate, vertical gate time Reset Value: 0x00000000 3.6.1 Tvsync The vertical synchronization pulse width, measured in horizontal lines -1. Example: Tvsync = 5 vsync length = 6 lines 3.6.2 Tvgdel The vertical gate delay time, measured in horizontal lines -1. Example: Tvgdel = 2 gate delay = 3 lines 3.6.3 Tvgate The vertical gate width, measured in horizontal lines -1. Example: Tgate = 479 gate length = 480 lines 3.7 Horizontal and Vertical Length Register [HVLEN] Bit # Access Description 31:16 R/W Thlen, horizontal length 15:0 R/W Tvlen, vertical length Reset Value: 0x00000000 3.7.1 Thlen The total horizontal line time, measured in pixels –1. Example: Thlen = 1023 line length = 1024 pixels www.opencores.org Rev 1.2 Preliminary 15 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.7.2 Tvlen The total vertical frame time, measured in horizontal lines -1. Example: Tvlen = 599 frame length = 600 lines 3.8 Video Base Address [VBARa] [VBARb] Bit # Access Description 31:2 R/W VBA, Video Base Address 1:0 R Always zero Reset Value: 0x00000000 3.8.1 Video Base Address The Video Base Address register defines the starting point of the video memory. The image is stored in consecutive memory locations, starting at this address. The byte memory location of a pixel can be calculated as follows: Adr = ((Y * Thgate) + X) * bytes_per_pixel; The core supports memories with burst capabilities. Burst transfers of 1, 2, 4, and 8 accesses are supported. The lower address bits must reflect the value entered in the Video Memory Burst Length bits as shown in the table below, where an ‘x’ represents a don’t care value. VBL 00b 01b 10b 11b VBAR[4:0]* xxx00b xx000b x0000b 00000b www.opencores.org Rev 1.2 Preliminary 16 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 3.9 Hardware Cursor Base Address [C0BAR] [C1BAR] Bit # Access Description 31:10 R/W CBA, Cursor Base Address 9:0 R Always zero Reset Value: 0x00000000 3.9.1 Cursor Base Address The Cursor Base Address register defines the starting point of the cursor pattern to use. The cursor pattern is stored in consecutive memory locations, starting at this address. 3.10 Hardware Cursor (X,Y) Register [C0XY] [C1XY] Bit # Access Description 31:16 R/W CY, Cursor Y location 15:0 R/W CX, Cursor X location Reset Value: 0x00000000 3.10.1 CY The cursor’s upper left pixel’s vertical position related to the upper left corner of the image. CY is always positive, i.e. a larger value means moving the cursor down the screen. A smaller value means moving the cursor up the screen. 3.10.2 CX The cursor’s upper left pixel’s horizontal position related to the upper left corner of the image. CX is always positive, i.e. a larger value means moving the cursor to the right of the screen. A smaller value means moving the cursor to the left of the screen. 3.11 Hardware Cursor Color Registers [C0CR] [C1CR] Bit # Access Description 31:16 R/W Color data (odd numbered color register) 15:0 R/W Color data (even numbered color register) Reset Value: 0x00000000 3.11.1 Cursor Color Register The Cursor Color registers define the cursor colors for 64x64x4bpp cursor mode, which is enabled when the Hardware Cursor Resolution bit is set (‘1’). In this mode each cursor pixel uses 4bits. The 4bits are used in a lookup table fashion to select a single color register from a total of 16. The 16 color registers are mapped to 8 addresses, where the 16LSBs store an even-numbered color register (i.e. 0, 2, 4, etc) and the 16MSBs store an odd-numbered color register (i.e. 1, 3, 5, etc). Address Cursor0 0x028 0x02c www.opencores.org Address Cursor1 0x058 0x05C Bit 31:16 Color Register 1 Color Register 3 Rev 1.2 Preliminary Bit 15:0 Color Register 0 Color Register 2 17 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 0x030 0x060 0x034 0x064 0x038 0x068 0x03C 0x06C 0x040 0x070 0x044 0x074 Reset Value: undefined Color Register 5 Color Register 7 Color Register 9 Color Register 11 Color Register 13 Color Register 15 3/20/2003 Color Register 4 Color Register 6 Color Register 8 Color Register 10 Color Register 12 Color Register 14 These registers are available only when the dedicated hardware cursor is implemented, i.e. C0CR is available when hardware cursor0 is available, and C1CR is available when hardware cursor1 is available. Whether or not a hardware cursor is implemented can be checked via the Status register. When a hardware cursor is not implemented the memory locations are reserved and the rules for accessing reserved memory locations apply. Note: The contents of these registers is undefined after a reset. 3.12 8bpp Pseudo Color Lookup Table [PCLT] 3.12.1 Color Lookup Table The color lookup table is mapped into the core’s address range. It can be accessed (read and write) via the WISHBONE Slave interface, starting at address 0x800. See section 4.2.5 Color Lookup Table for more information. www.opencores.org Rev 1.2 Preliminary 18 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4 Operation 4.1 Video Timing 4.1.1 Horizontal Video Timing Thsync Thgdel Thgate Thlen 4.1.1.1 Thsync The Horizontal Synchronization Time is the duration of the horizontal synchronization pulse, measured in pixel clock ticks. 4.1.1.2 Thgdel The Horizontal Gate Delay Time is the duration of the time between the end of the horizontal synchronization pulse and the start of the horizontal gate, measured in pixel clock ticks. The image can be shifted left/right over the screen by modifying Thgdel. In video timing diagrams, this is mostly referred to as the back porch. 4.1.1.3 Thgate The Horizontal Gate Time is the duration of the visible area of a video line, measured in pixel clock ticks. In video timing diagrams, this is mostly referred to as the active time. 4.1.1.4 Thlen The Horizontal Length Time is the duration of a complete video line, from the start of the horizontal synchronization pulse till the start of the next horizontal synchronization pulse, measured in pixel clock ticks. www.opencores.org Rev 1.2 Preliminary 19 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.1.2 Vertical Video Timing Tvsync Tvgdel Tvgate Tvlen 4.1.2.1 Tvsync The Vertical Synchronization Time is the duration of the vertical synchronization pulse, measured in horizontal lines. 4.1.2.2 Tvgdel The Vertical Gate Delay Time is the duration of the time between the end of the vertical synchronization pulse and the start of the vertical gate, measured in horizontal lines. The image can be shifted up/down the screen by modifying Tvgdel. In video timing diagrams, this is mostly referred to as the back porch. 4.1.2.3 Tvgate The Vertical Gate Time is the duration of the visible area of a video frame, measured in horizontal lines. In video timing diagrams, this is mostly referred to as the active time. 4.1.2.4 Tvlen The Vertical Length Time is the duration of a complete video frame, from the start of the vertical synchronization pulse till the start of the next vertical synchronization pulse, measured in horizontal lines. www.opencores.org Rev 1.2 Preliminary 20 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.1.3 Combined Video Frame Timing Thsync Thgdel Thgate Tvgdel Tvsync Thlen Total vertical image size Tvlen Tvgate Pixel (0,0) Visible Area Total horizontal image size The video frame is composed of Tvlen video lines, each Thlen pixels long. The logical AND function of the horizontal gate and the vertical gate defines the visible area, the rest of the image is blanked. www.opencores.org Rev 1.2 Preliminary 21 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.2 Pixel Color Generation 4.2.1 Color Processor Internals ADR_O DAT_I Address Generator Data Buffer Colorizer block RGB To Cursor Processor or Line FIFO CLUT The Color Processor, together with the WISHBONE Master interface and the Line FIFO, handles the pixel color generation. The internal structure of the Color Processor, including parts of the WISHBONE Master interface, is shown in the figure above. 4.2.2 Address Generator The address generator is part of the WISHBONE Master interface. It generates the video memory addresses, performs video memory bank switching, and keeps track of the number of pixels to read. When all pixels are read, the video memory bank is switched, the video memory offset (i.e. the pixel counter) is reset and - when enabled - the bank switch interrupt is generated. The bank switch interrupt is only dependent on the amount of pixels read, i.e. it has no fixed timing relation to the horizontal or vertical synchronization pulses. 4.2.3 Data Buffer The data buffer temporarily stores the data read from the video memory. It can contain 16 32-bit entries. The system tries to keep the data buffer at least half full. The data is read from the video memory by a consecutive address burst; i.e. [wbm_cab_o] is asserted. The burst length is determined by the Video memory Burst Length [VBL] bits in the control registers. It is possible that multiple burst accesses are executed within a single access cycle. All data is stored consecutively, and all available bits are used independent of color depth. In 8bpp mode, a 32-bit word stores 4 pixels; in 16bpp mode it stores 2 pixels, in 24bpp mode 1 1/3 pixels, and in 32bpp 1 pixel. 4.2.4 Colorizer The colorizer translates the data stored in the data buffer into colors (see the examples below). The table below shows the Data Buffer contents used in the examples. Only 8 out of the 16 possible entries are shown. The buffer is read from the top to the bottom, i.e. 0x01234567 is the first data read, and 0x89abcdef is the second etc. www.opencores.org Rev 1.2 Preliminary 22 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 Data Buffer contents 0x01234567 0x89abcdef 0x01234567 0x89abcdef 0x01234567 0x89abcdef 0x01234567 0x89abcdef 4.2.4.1 32bpp example. In 32-bits-per-pixel mode, the lower 24 bits carry the pixel data. The upper 8 bits are ignored, they can be used for Z-buffer, alpha channel, stencil buffer, or similar purposes. The table below shows the RGB values generated from the sample data in the Data Buffer. Only the first 4 pixels are shown. Color Data 0x01234567 0x89abcdef 0x01234567 0x89abcdef R 0x23 0xab 0x23 0xab G 0x45 0xcd 0x45 0xcd B 0x67 0xef 0x67 0xef 4.2.4.2 24bpp example. In 24-bits-per-pixel mode, the RGB values are generated as shown in the following sequence: Da(31:8), Da(7:0)Db(31:16), Db(15:0)Dc(31:24), Dc(23:0). The table below shows the RGB values generated from the sample data in the Data Buffer. Color Data 0x12345 0x6789ab 0xcdef01 0x234567 R 0x01 0x67 0xcd 0x23 G 0x23 0x89 0xef 0x45 B 0x45 0xab 0x01 0x67 4.2.4.3 TrippleDisplay mode The system is capable of driving up to three different displays at the same time. The system operates in TrippleDisplay mode when it is setup for 24bpp mode, but each of the three colors contains grayscale information for a single display. www.opencores.org Rev 1.2 Preliminary 23 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.2.4.4 16bpp example. In 16-bits-per-pixel mode, the upper 16bits carry the data for the first pixel and the lower 16 bits carry the data for the second pixel. The 24-bit RGB data is extracted from the 16-bit color data as follows: R(7:0) = color_data(15:11), 000b G(7:0) = color_data(10:5), 00b B(7:0) = color_data(4:0), 000b The table below shows the RGB values generated from the sample data in the Data Buffer. Only the first 4 pixels are shown. Color Data 0x0123 0x4567 0x89ab 0xcdef R 0x00 0x40 0x88 0xc8 G 0x24 0xac 0x34 0xbc B 0x18 0x38 0x58 0x78 4.2.4.5 8bpp grayscale example. In 8-bits-per-pixel grayscale mode, the color data for each of the three colors are equal. The information stored in one byte is sent to all three colors, effectively producing a black-and-white image with 256 grayscales. The table below shows the RGB values generated from the sample data in the Data Buffer. Only the first 4 pixels are shown. Color Data 0x01 0x23 0x45 0x67 R 0x01 0x23 0x45 0x67 G 0x01 0x23 0x45 0x67 B 0x01 0x23 0x45 0x67 4.2.4.6 8bpp pseudo-color example. In 8-bits-per-pixel pseudo-color mode, the color data represents an offset in the internal color lookup table (CLUT). The CLUT contains the RGB color information. This way it is possible to generate an image with 256 different colors with minimal memory requirements. R = clut_data_out(23:16) G = clut_data_out(15:8) B = clut_data_out(7:0) The table below shows the CLUT addresses for the first 4 pixels. Color Data 0x01 0x23 0x45 0x67 www.opencores.org CLUT offset 0x01 0x23 0x45 0x67 Rev 1.2 Preliminary 24 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.2.5 Color Lookup Table The color lookup table (or CLUT) is a 512x24 bit single-clock synchronous static random access memory divided into two separate CLUTs, of 256x24 bit each. Either one of them is accessed by the colorizer, depending on the Active CLUT Memory Page [ACMP] flag in the Status register. When the ACMP flag is cleared (‘0’), CLUT0 is accessed. When the ACMP flag is set (‘1’), CLUT1 is accessed. The CLUT memory is mapped into the core’s address range. It can be externally accessed (read and write) via the WISHBONE Slave interface, starting at address 0x800. CLUT0 is located at memory range 0x800 – 0xBFC, CLUT1 at 0xC00 – 0xFFC. All external accesses to the CLUT are 32-bit, but the CLUT itself is only 24 bit wide. The top-most bits[31:24] are ignored for write accesses and are always zero for read accesses. www.opencores.org Rev 1.2 Preliminary 25 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.3 Hardware Cursors 4.3.1 Introduction The Enhanced VGA/LCD Core provides up to two hardware cursors. If and which of the two cursors are implemented is dependent on the system designer. The core takes two definition-parameters (VGA_HWC0 and VGA_HWC1) as input. The define statements are located in the “vga_defines.v” file. If both definition parameters are undefined, no logic is generated for the hardware cursors. If a definition parameter is defined, logic for the appropriate cursor is generated. Cursor0 is normally used to provide the arrow pointer in GUI applications and operating systems. Cursor1 has no pre-assigned purpose; it can be used to provide some form of user cursor in a pop-up window. Off-screen memory in the frame buffer or, if accessible by the core, system memory is used to provide the locations where the patterns for both cursors are stored. This allows each cursor to be displayed and used without altering the main display image stored in the frame buffer. The hardware takes care of selecting between the cursor and the image. The Cursor Base Address register determines the cursor’s pattern location. Each cursor may have multiple patterns stored in memory, making it possible to change each cursor’s appearance by switching from one pattern to another by simply changing the appropriate Base Address register. 4.3.2 Cursor Patterns The amount of memory allocated for each cursor pattern is 16Kbit. The cursor resolutions are user-selectable, either 32x32 pixels and 16bpp color depth, or 64x64 pixels and 4bpp color depth. The cursor pattern is stored in consecutive memory locations, starting at the address set by the cursor’s Base Address register. Each address location contains data for multiple cursor pixels: 2 pixels in 32x32 pixel mode and 8 pixels in 64x64 pixel mode. 4.3.2.1 32x32 Pixel Mode In 32x32 pixel mode, each pixel has a 16-bit color depth, divided into a selection bit and 15-bit cursor colors (32Kcolors) or an 8-bit alpha channel. The MSB selects between cursor color mode and alpha channel mode. The alpha channel is used to generate transparent pixels or 3D effects (see Pattern Color Data). 4.3.2.2 64x64 Pixel Mode In 64x64 pixel mode, each pixel has a 4-bit color depth. The 4 bits are used in a lookup table fashion to select a Color register from the available 16 Cursor Color registers. Each Color register contains a 16-bit value, that has the same features as the cursor pattern data in the 32x32 pixel mode, i.e. 1 selection bit and 15 color bits or an 8-bit alpha channel. 4.3.2.3 Cursor Pixel Data Each cursor pixel is represented by a 16-bit color value. The MSB selects between color mode and Alpha/Transparency mode. www.opencores.org Rev 1.2 Preliminary 26 of 40 OpenCores bit 15 0 1 Enhanced VGA/LCD Core Datasheet bit 14:8 bit 7:0 Color Data always zero Alpha Data 3/20/2003 Color mode Alpha / Transparency mode In color mode the LSBs represent a 15-bit RGB value, resulting in 32K colors. The 32K cursor colors are generated by equally distributing the color information over the RGB components, i.e. 5 bits for red, 5 bits for green, and 5 bits for blue. Internally the 5 bit R, G, and B values are extended to 8 bits, the lower 3 bits for each color are set to zero. In Alpha/Transparency mode, the LSBs are divided into two sections. The first section (bits 14:8) is reserved and should always be read and written as zero. The second section (bits 7:0) represents an 8-bit alpha value. The alpha value is a crossfader setting between the image pixel value and the black level (RGB = 0). Alpha is normally defined as a value between 0 and 1, where 0 = ‘00’hex and 1 = ‘FF’hex. Setting the Alpha value to 0 results in the black level being displayed. Setting the Alpha value to 1 results in the image pixel being displayed. Any value between 0 and 1 results in a linear mix between the image pixel value and black. This can be used to add the effect of shadow to a cursor, thus creating 3D cursors. The image below shows how to create the 3D cursor from the Redwood scheme. (0,0) (0,31) (0,63) Transparent, 0 < Alpha < 1 Gray White Black Transparent, Alpha = 1 (31,0) (63,0) (31,31) (63,63) 4.3.3 Turning off 3D support. The alpha-blending logic requires quite an amount of resources. Therefore, the ability to turn off the 3D support has been provided. When 3D support is turned off, the cursor processor ignores the alpha data and generates a transparent pixel. Instead of a shadow effect, the image pixel is displayed. This behavior guarantees that the 3D and non-3D cursors are displayed correctly. 3D cursor support is enabled when VGA_HWC_3D is defined. 3D cursor support is disabled when VGA_HWC_3D is undefined. www.opencores.org Rev 1.2 Preliminary 27 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.3.4 Cursor Processor Internals Address Generator ADR O DAT I Cursor Buffer RGB From Color Processor Cursor Buffer Cursor1 Processor Cursor0 Processor RGB To Line FIFO Cursor Processor The Cursor Processor handles the hardware cursors together with the WISHBONE Master interface. The internal structure of the Cursor Processor, including parts of the WISHBONE Master interface is shown in the figure above. If a cursor is not implemented, it is a pass-through function. The above schematic still applies, but no logic is generated for that cursor. 4.3.5 Address Generator The address generator is part of the WISHBONE Master interface. When copying a cursor into one of the cursor buffers, it generates the memory addresses and writes the data read into the buffers. The cursor processors issue a cursor read request to the address generator when their corresponding Cursor Base address [C0BAR][C1BAR] is written to. When the WISHBONE Master finishes reading the current video frame it honors one cursor read request. The cursor data is read in one continuous stream before the start of the next frame. If both cursors need to be reloaded, one is reloaded before the next frame. It’s cursor read request is negated. The second cursor read request is not honored; it remains asserted. When the WISHBONE Master finishes reading the new frame, it honors the second cursor read request. Cursor0 has a higher priority than Cursor1. When both cursors need to be reloaded, Cursor0 is reloaded first. This implies that continuously reloading Cursor0 results in Cursor1 never being reloaded. However, this situation should never occur during normal operation. 4.3.6 Cursor Buffer The cursor buffers are 512x32 bit synchronous static random access memories. The address generator writes a copy of the cursor pattern into the cursor buffer whenever the cursor base address [C0BAR][C1BAR] is written to. www.opencores.org Rev 1.2 Preliminary 28 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.3.7 Cursor0/Cursor1 Processor The two cursor processors are the intelligent part of the cursor system. Each cursor processor handles 1 cursor. It keeps track of the raster-scan position, determines whether or not the cursor pattern should be updated, whether or not the cursor should be displayed, and generates the cursor colors including the alpha mixing. www.opencores.org Rev 1.2 Preliminary 29 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.4 Bank switching 4.4.1 Introduction The bank switching system is implemented as a double buffering scheme, also known as a Ping-Pong system. The core reads pixel information from one memory bank while the second bank is being filled. When the second bank has been filled, the host sets the Video Bank Switch Enable bit [VBSE] and/or the Color Lookup Table Bank Switch Enable bit [CBSE]. The core finishes reading the current bank until the entire frame has been read. It then switches to the second bank and starts reading the new frame. The core automatically resets the VBSE and CBSE bits to avoid accidentally switching to the previous bank. A Video Bank Switch Interrupt is generated when the core switches between the two video memory banks, and a CLUT Bank Switch Interrupt is generated when the core switches between the two Color Lookup Tables. 4.4.2 Host notes The host should not set the VBSE or CBSE bits until all frame information has been written to the video memory. The host system should wait for the Bank Switch Interrupt before filling the previous memory bank. 4.4.3 Sequence 1) Fill video bank0. 2) Fill video bank1. 3) Set VBSE, CBSE, BSIE. 4) Wait for interrupt. 5) Fill video bank0. 6) Set VBSE, CBSE. 7) Wait for interrupt. 8) Fill video bank1. 9) Set VBSE, CBSE. 10) Go to step 4. www.opencores.org Rev 1.2 Preliminary 30 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 4.5 Bandwidth Issues 4.5.1 Introduction Video displays are real-time devices. The video data stream needs to be generated uninterrupted, or images will be corrupted. The VGA_LCD core provides some flexibility through the use of internal FIFOS, including the large dual-clocked LineFIFO. But still the average bandwidth required by the video must be met. 4.5.2 Calculations The required video bandwidth can be calculated using the following formula: BWvideo = Hpix * Vlin * Frefr ( pps ) Hpix = number _ of _ visible _ horizontal _ pixels (Thgate ) Vlin = number _ of _ visible _ vertical _ lines (Tvgate ) Frefr = refresh _ rate ( Hz ) For example, a standard VGA display with 640*480 visible pixels and a refresh rate of 60Hz requires a bandwidth of BW = 640 * 480 * 60 = 18.4 Mpixels_per_sec (Mpps). A SVGA display with 1024*768 pixels and a 75Hz refresh rate requires 59Mpps. Note that this number also represents the pixel-clock frequency, because only 1 pixels is displayed at a time. The required host bus bandwidth is dependent on the required number of bits per pixel, as shown in the next formula: BWrequired = BWvideo ∗ N bits _ per _ pixel (bps ) Using the previous examples we can calculate the following table: Color depth 640*480 @60Hz 32bpp 590Mbps 24bpp 443Mbps 16bpp 295Mbps 8bpp 147Mbps 1024*768 @75Hz 1.9Gbps 1.4Gbps 944Mbps 472Mbps The host bus occupation is dependent on the total host bus bandwidth, the initial memory latency, the memory access/acknowledge latency, and the programmed video burst length. It can be calculated as follows: Obus = BWrequired BWbus * 100% BWbus = host _ bus _ bandwidth ( Mbps ) Or more detailed: www.opencores.org Rev 1.2 Preliminary 31 of 40 OpenCores Obus = BWrequired Fbus ∗ N bus Enhanced VGA/LCD Core Datasheet ∗ ( Mlat initial +VBL∗ Mlat acc VBL 3/20/2003 )*100% Fbus = host _ bus _ frequency ( Hz ) N bus = host _ bus _ width (bits ) Mlat initial = initial _ video _ memory _ latency (clk _ cycles) Mlat acc = video _ memory _ access _ latency (clk _ cycles) VBL = Video _ Burst _ Length 4.5.3 Examples 4.5.3.1 Example 1 Assume the following system: 200MHz, 32-bit host system using SDRAMS as video memory, running at half the bus frequency, displaying a 1024*768 image @75Hz 24bpp. Fbus = 200MHz Nbus = 32-bit BWrequired = 1.4Gbps Mlat(initial) = 6 (2* CAS-latency of 3) Mlat(acc) = 2 (single cycle bursts at half the bus frequency) Video_burst_length = 4 Total host bus occupation = 77.4% 4.5.3.2 Example 2 Assume a system with an average memory bandwidth of 250MBps displaying an 800*600 image @60Hz 16bpp. BWrequired = 461Mbps BWbus = 2Gbps Total host bus occupation = 23% 4.5.3.3 Example 3 Assume the following system: 30MHz, 32-bit host system using SRAMS as video memory, displaying a 320*240 image @60Hz 8bpp. Fbus = 30MHz Nbus = 32-bit BWrequired = 37Mbps Mlat(initial) = 1 (access selector) Mlat(acc) = 2 (address setup) Video_burst_length = 8 Total host bus occupation = 8.2% Note that these numbers are for reading only. The video memory needs to be filled in order to be able to display something. Depending on the application, filling the video memory can require a considerable amount of bandwidth too. www.opencores.org Rev 1.2 Preliminary 32 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 5 Architecture DAC clock LCD clock From Host WISHBONE SLAVE Interface Timing Registers Video Timing Generator Control Register HSYNC VSYNC CSYNC BLANK Status Register Video Memory Base Registers Cursor Base Registers Cursor (x,y) Registers wb_inta_o To Video memory WISHBONE MASTER Interface Color Lookup Table Cursor Buffers Color Processor Cursor Processor Line FIFO 5.1 Color Lookup Table The Color Lookup Table (or CLUT) is a 512x24 bit single clock synchronous static random access memory divided into two separate CLUTs of 256x24 bit each. Each color lookup table contains a 24-bit RGB value for each entry. The color processor www.opencores.org Rev 1.2 Preliminary 33 of 40 R(7:0) G(7:0) B(7:0) OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 uses 8bpp pseudo color data as an address input to the color lookup table. The output from the color lookup table is the RGB data for the current pixel. 5.2 Cursor Base Registers The Cursor Base registers contain the starting address of the current cursors. Each cursor is 32x32 pixels large. Each pixel is always in 16bpp color mode. Therefore, 512 address locations are required to store a single cursor. A cursor is stored consecutively, starting at pixel (0,0) representing the upper left corner of the cursor, then continuing to pixels (0,1), (0,31), (1,0), and (1,31) etc. A cursor can be located anywhere in memory as long as the memory is accessible by the VGA_LCD core and it starts at a cursor boundary, i.e. the lower 10 address bits must be zero. 5.2 Cursor Buffers The cursor buffers are 512x32 bit single clock synchronous static memories. Each buffer contains a copy of the current cursor pattern. The core reads the cursor patterns from the external memory and stores them in the cursor buffers, thus avoiding having to read it every frame. The core copies a cursor pattern whenever the Cursor Base Address register is written to. This also opens the possibility to display another cursor than is actually stored in the external memory. Simply rewriting the same address to the Cursor Base Address register is enough to read the new cursor data and display the new cursor. 5.3 Cursor Processor The cursor processor translates the stored cursor pattern into a visible cursor. It manages the cursor location and determines the pixel information for the current pixel - being image or cursor - including cursor transparency and alpha blending. 5.4 Color Processor The Color Processor translates the received pixel data to RGB color information. When in 32-bit and 24-bit color mode, this is a pass-through function. In 16-bit color mode this is a linear translation: 5-bit Red, 6-bit Green, and 5-bit Blue. When in 8-bit grayscale mode the same data is placed on the red, green, and blue color outputs, effectively generating a black-and-white image. When in 8-bit pseudo color mode the received pixel data is sent through the internal color lookup table. 5.5 Line FIFO The dual-clocked Line FIFO ensures a continuous data stream towards the VGA or LCD display and ensures a correct transformation from the WISHBONE clock domain to the VGA clock domain. 5.6 Video Memory Base Registers The Video Memory Base registers contain the starting addresses of the external video memory banks. 5.7 Video Timing Generator The Video Timing Generator generates the horizontal synchronization pulse [hsync_pad_o], the vertical synchronization pulse [vsync_pad_o], the corresponding interrupt signals [HINT] and [VINT], the composite synchronization pulse www.opencores.org Rev 1.2 Preliminary 34 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 [csync_pad_o], the blanking signal [blank_pad_o] and the read request to the Line FIFO. 5.8 Wishbone Master Interface The WISHBONE Master interface manages all accesses to the external memory. It consists of a number of interacting state machines. The color processor and the cursor processor issue requests to the WISHBONE Master. The WISHBONE Master interface then generates the memory addresses for the image and the cursors. 5.9 Wishbone Slave Interface The WISHBONE Slave interface manages all accesses to user readable/writeable registers. www.opencores.org Rev 1.2 Preliminary 35 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 Appendix A VGA Modes This appendix describes some common VGA modes. A.1 Vertical Timing Information Common VGA Modes Mode QVGA VGA VGA SVGA SVGA SVGA • • Resolution 320x240 640x480 640x480 800x600 800x600 800x600 Line Refresh Width rate usec 60 Hz 60 Hz 31.78 72 Hz 26.41 56 Hz 28.44 60 Hz 26.40 72 Hz 20.80 Sync Pulse usec lin 63 79 56 106 125 2 3 1 4 6 Back porch Active time Front porch Frame Total usec usec lin usec usec lin 15382 12782 17177 15945 12563 484 484 604 604 604 16683 13735 17775 16579 13853 525 520 625 628 666 953 686 568 554 436 lin 30 26 20 21 21 285 184 728 lin 9 7 -1* -1* 35 The Active Time includes 4 overscan borderlines. Some timing tables include these into the back and front porch. When the Active Time is increased, it passes the rising edge of the vsync signal, hence the –1 Front Porch. A.2 Horizontal Timing Information Common VGA Modes Mode QVGA VGA VGA SVGA SVGA SVGA • Resolution 320x240 640x480 640x480 800x600 800x600 800x600 Refresh rate 60 Hz 60 Hz 72 Hz 56 Hz 60 Hz 72 Hz Pixel Clock MHz Sync Pulse usec pix 25.175 31.5 36 40 50 3.81 1.27 2 3.2 2.4 96 40 72 128 120 Back porch Active time Front porch Line Total pix pix pix pix 45 125 125 85 61 646 646 806 806 806 13 21 21 37 53 800 832 1024 1056 1040 The Active Time includes 6 overscan borderlines. Some timing tables include these into the back and front porch. Partially taken from Jere Makela, Software Design for a Video Conversion Equipment. Master’s Thesis, Helsinki Univerity of Technology. www.opencores.org Rev 1.2 Preliminary 36 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 Appendix B Target Dependent Implementations The parts of the system that could be target dependent for FPGA implementations and are absolutely target dependent for ASIC implementations are the dual clocked RAM block for the Line FIFO as well as the single clock RAM blocks for the color lookup table and the cursor buffers. The RAM blocks are instantiated by the generic_spram.v and generic_dpram.v files. These files contain an FPGA-synthesizable model, that has been tested with Exemplar’s LeonardoSpectrum and Symplicity’s Synplify for Altera (FLEX, ACEX, APEX) and Xilinx devices (Virtex, Virtex-E, Spartan-II). They also contain modules for some ASIC technologies. The technology is set by a define statement in the vga_defines.v file. `define VENDOR_FPGA use FPGA (Xilinx and Altera) synthesizable model `define VENDOR_ARTISAN use Artisan memories `define VENDOR_VIRTUALSILICON use VirtualSilicon memories . . . Check the generic_spram.v and generic_dpram.v files for more information. www.opencores.org Rev 1.2 Preliminary 37 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 Appendix C Core Structure Name VGA File vga_enh_top.v Name WISHBONE Name CLUT Name Line FIFO File vga_wb_slave.v File vga_csm_pb.v File vga_fifo_dc.v Name clut_mem Name fifo_dc_mem File generic_spram.v File generic_dpram.v Name WISHBONE Name Pixel Generator File vga_wb_master.v File vga_pgen.v Name CLUT switch Fifo Name Timing Generator File vga_fifo.v File vga_tgen.v Name Data Fifo File vga_fifo.v Name RGB Fifo File vga_fifo.v Name Color Processor File vga_colproc.v Name Cursor Processors File vga_curproc.v Name Horizontal Timing Name Vertical Timing File vga_vtim.v File vga_vtim.v Name SyncPulseCounter Name GateDelayCounter Name GateCounter Name LengthCounter File ro_cnt.v File ro_cnt.v File ro_cnt.v File ro_cnt.v Name counter Name counter Name counter Name counter File ud_cnt.v File ud_cnt.v File ud_cnt.v File ud_cnt.v www.opencores.org Rev 1.2 Preliminary 38 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 Appendix D Design Notes D.1 Introduction This section contains flow and timing diagrams of the core’s internal blocks. The diagrams are provided for reference only. They are intended to provide a better understanding of the internal signal flow. They are not intended to serve as a detailed step-through discussion of the core’s internals. www.opencores.org Rev 1.2 Preliminary 39 of 40 OpenCores Enhanced VGA/LCD Core Datasheet 3/20/2003 D.2 vga_curproc This section shows the signal flow inside the cursor processor blocks. The letters in the data busses are intended to ease the data flow overview. They represent signals that are somehow related to each other and have a common timing spec, for example cbuf_a-A represents address-A into the cursor-buffer, cbuf_q-A is the cursor buffer’s output at address-A. clk idat wreq didat wreq ddidat wreq inbox signals xcnt inbox x xdone ycnt inbox y inbox dinbox ddinbox ddinbox A cursor buffer access signals cbuf a cbuf q A B C B C A cursor 64x64 pixels signals cc adr cc dat i A r, g, b, alpha dr, dg, db, dalpha idat didat ddidat dddidat C B C A cursor 32x32 pixels signals image data B B B A A Y Z A Y B Z Y C C B C C D A B C D Z A B C Y Z RGB generation RGB A Y Z B C A B wreq generation store1 store2 wreq www.opencores.org Rev 1.2 Preliminary 40 of 40