Jim Fischer
Transcription
Jim Fischer
CiNIC – Calpoly Intelligent NIC A Thesis Presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment Of the Requirements for the Degree Master of Science in Electrical Engineering By James D. Fischer June 2001 Authorization for Reproduction of Master’s Thesis I hereby grant permission for the reproduction of this thesis, in whole or in part, without further authorization from me. Signature (James D. Fischer) Date ii Approval Page Title: CiNIC – Calpoly Intelligent NIC Author: James D. Fischer Date Submitted: June 26, 2001 Dr. James G. Harris Advisor Signature Dr. Mei-Ling Liu Committee Member Signature Dr. Fred W. DePiero Committee Member Signature iii Abstract CiNIC – Calpoly Intelligent NIC by: James D. Fischer The Calpoly intelligent NIC (CiNIC) Project offloads TCP/IP stack processing from an i686-based (Pentium) host system running either Microsoft Windows 2000 or Redhat Linux 7.1, and onto a dedicated EBSA-285 co-host that is running the ARM/Linux networking operating system. The goals of this thesis are threefold: 1) implement a software development environment for the EBSA-285 platform; 2) implement a low-level system environment for the EBSA-285 that can boot an ARM/Linux kernel on the EBSA-285 board, and 3) implement performance instrumentation that gathers latency data from the CiNIC architecture during network send operations. The first half of this thesis presents a brief history of the CiNIC Project and an overview of the hardware and software architectures that comprise the CiNIC platform. An overview of the software development tools for the ARM/Linux environment on the EBSA-285 is also provided. The second half of the thesis develops the performance instrumentation technique Ron J. Ma and I developed to gather latency data from the CiNIC platform during network send operations. Samples of performance are provided that demonstrate the capability of the performance instrumentation technique. iv Acknowledgements I would like to thank the following individuals and companies for their invaluable support throughout this project. Paul Rudnick, John Dieckman, Carl First, Charles Chen, Anand Rajagopalan and others at 3Com Corporation who generously provided equipment, financial support, technical support and moral support to Cal Poly’s Network Performance Research Group over the years. Jim Medeiros and Pete Hawkins of Ziatech/Intel Corporation for volunteering their time and knowledge of PCI-based systems design. Adaptec Corporation for their generous donation of a SCSI-RAID card to the CiNIC project. I would particularly like to thank Dr. Jim Harris, my thesis advisor, mentor, and friend for his seemingly inexhaustible enthusiasm, knowledge, and support. Dr. Mei-Ling Liu, Dr. Hugh Smith, and the student members of Cal Poly’s Network Performance Research Group. These bright, creative, stimulating, and thoroughly enjoyable individuals have contributed more to my Cal Poly education than they can possibly imagine. In particular I would like to thank Ron J. Ma for assisting me in adapting my original instrumentation technique for use in the new CiNIC architecture. And finally, I extend my heartfelt thanks to my family for their boundless support and love. v Table of Contents LIST OF TABLES .............................................................................................................VIII LIST OF FIGURES ..............................................................................................................IX CHAPTER 1 INTRODUCTION .................................................................................... 1 CHAPTER 2 CINIC ARCHITECTURE ....................................................................... 3 2.1 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 2.3 2.3.1 2.3.2 2.3.3 HISTORICAL OVERVIEW ........................................................................................... 3 HARDWARE ARCHITECTURE .................................................................................... 7 Host Computer System .....................................................................................................8 Intel 21554 PCI-PCI Nontransparent Bridge ..................................................................9 Intel EBSA-285 StrongARM / 21285 Evaluation Board.................................................11 3Com 3c905C Fast Ethernet Network Card ..................................................................12 Promise Ultra66 PCI-EIDE Hard Disk Adaptor ...........................................................12 Maxtor 531DX Ultra 100 Hard Drive............................................................................12 Intel EBSA-285 SDRAM Memory Upgrade....................................................................12 FuturePlus Systems FS2000 PCI Analysis Probe ..........................................................13 SOFTWARE ARCHITECTURE ................................................................................... 14 Integrating CiNIC with Microsoft Windows 2000..........................................................14 Integrating CiNIC with Redhat Linux 7.x.......................................................................17 Service Provider Example: Co-host Web Caching.........................................................20 CHAPTER 3 3.1 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.4 3.4.1 3.4.2 DEVELOPMENT / BUILD ENVIRONMENT.................................... 23 OVERVIEW OF THE SOFTWARE INFRASTRUCTURE ................................................. 23 CREATING AN I686–ARM TOOL CHAIN ................................................................ 23 Caveat Emptor ...............................................................................................................23 Getting Started ...............................................................................................................25 Creating the i686-ARM Development Tools with an i686 host ......................................27 Creating the i686-ARM C Library with an i686 host .....................................................30 Creating the ARM-ARM Tool Chain on an i686 host ....................................................32 Copying the ARM-ARM Tool Chain to the EBSA-285 ...................................................34 Rebuild the ARM-ARM Tool Chain on the EBSA-285....................................................37 SYSTEM ENVIRONMENT ......................................................................................... 38 EBSA-285 BIOS .............................................................................................................38 Booting Linux on the EBSA-285.....................................................................................42 Linux Bootstrap Sequence..............................................................................................43 Linux Operating System Characteristics........................................................................46 Synchronizing the System Time with the Current Local Time ........................................47 USER ENVIRONMENT ............................................................................................. 50 Building applications with the i686-ARM cross tool chain............................................51 Installing from the i686 host to the EBSA-285’s file system ..........................................53 vi vii 3.4.3 Rebuilding everything with the ARM-ARM tool chain ...................................................54 CHAPTER 4 4.1 4.2 4.2.1 4.3 4.3.1 4.3.2 4.4 4.4.1 4.4.2 4.4.3 4.4.4 BACKGROUND ........................................................................................................ 56 PERFORMANCE METRICS ....................................................................................... 58 Framework for Host Performance Metrics ....................................................................59 STAND-ALONE PERFORMANCE INSTRUMENTATION .............................................. 60 Theory of Operation .......................................................................................................60 IPv4 Fragmentation of UDP Datagrams in Linux .........................................................66 HOST / CO-HOST COMBINED PLATFORM PERFORMANCE INSTRUMENTATION ...... 71 Theory of Operation .......................................................................................................72 CiNIC Architecture Polling Protocol Performance Testbed..........................................79 CiNIC Host Socket Performance Testbed ......................................................................81 Secondary PCI Bus “Posted Write” Phenomenon.........................................................82 CHAPTER 5 5.1 5.2 5.3 5.4 5.4.1 5.4.2 PERFORMANCE DATA AND ANALYSIS........................................ 85 STAND-ALONE PERFORMANCE – MICROSOFT WINDOWS 2000 (SP1) .................... 86 STAND-ALONE PERFORMANCE – LINUX 2.4.3 KERNEL ON I686 HOST ................. 90 STAND-ALONE PERFORMANCE – LINUX 2.4.3 KERNEL ON EBSA-285 ................. 95 CINIC ARCHITECTURE – POLLING PROTOCOL PERFORMANCE TESTBED ........... 100 CiNIC Polling Protocol TCP / IPv4 Performance .......................................................101 CiNIC Polling Protocol UDP / IPv4 Performance ......................................................114 CHAPTER 6 6.1 6.2 INSTRUMENTATION .......................................................................... 56 CONCLUSIONS AND FUTURE WORK .......................................... 123 CONCLUSIONS ...................................................................................................... 123 FUTURE WORK ..................................................................................................... 124 BIBLIOGRAPHY ............................................................................................................... 125 APPENDIX A ACRONYMS......................................................................................... 129 APPENDIX B RFC 868 TIME PROTOCOL SERVERS .......................................... 131 APPENDIX C EBSA-285 SDRAM UPGRADE .......................................................... 136 vii List of Tables TABLE 2-1 DESCRIPTION OF THE HOST COMPUTER SYSTEM......................................................................9 viii List of Figures FIGURE 2-1 ANACONDA TESTBED ............................................................................................................4 FIGURE 2-2 INTEL’S EBSA-285 REPLACES 3COM ANACONDA AS CO-HOST ..............................................5 FIGURE 2-3 THE 21554 BRIDGES TWO DISTINCT PCI BUS DOMAINS .........................................................6 FIGURE 2-4 CINIC HARDWARE ARCHITECTURE ......................................................................................8 FIGURE 2-5 SCHEMATIC DIAGRAM OF THE CINIC HARDWARE ARCHITECTURE .......................................8 FIGURE 2-6 INTEL 21554 NONTRANSPARENT PCI-PCI BRIDGE ..............................................................10 FIGURE 2-7 PRIMARY PCI BUS’S VIEW OF THE 21554 AND THE SECONDARY PCI BUS ...........................10 FIGURE 2-8 SECONDARY PCI BUS’S VIEW OF THE 21554 AND THE PRIMARY PCI BUS ...........................10 FIGURE 2-9 ORIGINAL WINDOWS 2000 SOFTWARE ARCHITECTURE .......................................................15 FIGURE 2-10 MICROSOFT WINDOWS OPEN SYSTEM ARCHITECTURE [37] .............................................16 FIGURE 2-11 “WOSA FRIENDLY” WINDOWS 2000 SOFTWARE ARCHITECTURE .....................................17 FIGURE 2-12 REDHAT LINUX 7.X NETWORKING ARCHITECTURE ............................................................18 FIGURE 2-13 OVERVIEW OF THE CINIC ARCHITECTURE ........................................................................19 FIGURE 2-14 EXAMPLE OF A WEB CACHING PROXY ON A NON-CINIC HOST...........................................21 FIGURE 2-15 CINIC-BASED WEB CACHING SERVICE PROVIDER ..............................................................22 FIGURE 3-1 SOME ON-LINE HELP RESOURCES FOR LINUX SOFTWARE DEVELOPERS ................................27 FIGURE 3-2 USING AN I686 HOST TO CREATE THE I686-ARM TOOL CHAIN ...........................................28 FIGURE 3-3 CREATING THE INSTALLATION DIRECTORY FOR THE I686-ARM TOOL CHAIN .....................29 FIGURE 3-4 CONFIGURE OPTIONS FOR THE I686-ARM BINUTILS AND GCC PACKAGES ............................29 FIGURE 3-5 COPYING THE ARM/LINUX HEADERS INTO THE I686-ARM DIRECTORY TREE ....................31 FIGURE 3-6 CONFIGURE OPTIONS FOR THE GNU GLIBC C LIBRARY .......................................................31 FIGURE 3-7 ADDING THE I686-ARM TOOL CHAIN PATH TO THE ‘PATH’ ENVIRONMENT VARIABLE .....32 FIGURE 3-8 USING AN I686 HOST TO CREATE AN ARM-ARM TOOL CHAIN ...........................................33 FIGURE 3-9 INITIAL CONFIGURE OPTIONS WHEN BUILDING THE ARM-ARM TOOL CHAIN .....................34 FIGURE 3-10 CREATE A MOUNT POINT ON THE I686 HOST FOR THE EBSA-285’S FILE SYSTEM ..............35 FIGURE 3-11 /ETC/FSTAB ENTRY FOR /EBSA285FS MOUNT POINT ON I686 HOST .....................................35 FIGURE 3-12 COPY THE ARM-ARM TOOL CHAIN TO THE EBSA-285’S FILE SYSTEM ...........................35 FIGURE 3-13 CREATE SYMLINKS AFTER INSTALLING FROM I686 HOST TO EBSA-285’S FILE SYSTEM ...36 FIGURE 3-14 USING THE LDCONFIG COMMAND TO UPDATE THE EBSA-285’S DYNAMIC LINKER ...........36 FIGURE 3-15 CREATE LDCONFIG ON THE I686 HOST AND INSTALL IT ON THE EBSA-285.......................37 FIGURE 3-16 RECREATION OF THE NATIVE ARM-ARM TOOL CHAIN ....................................................37 FIGURE 3-17 CONFIGURE OPTIONS WHEN REBUILDING THE ARM-ARM TOOL CHAIN ...........................38 FIGURE 3-18 EBSA-285 JUMPER POSITIONS FOR PROGRAMMING THE FLASH ROM [24].......................39 FIGURE 3-19 EBSA-285 FLASH ROM CONFIGURATION ........................................................................42 FIGURE 3-20 THE PLATFORM USED BY ERIC ENGSTROM TO PORT LINUX ONTO THE EBSA-285 ............43 FIGURE 3-21 ARM/LINUX BIOS’S BOOTP SEQUENCE ............................................................................44 FIGURE 3-22 LINUX KERNEL BOOTP SEQUENCE......................................................................................46 FIGURE 3-23 BASH ENVIRONMENT VARIABLES FOR COMMON I686-ARM CONFIGURE OPTIONS ............51 FIGURE 3-24 CONFIGURE STEP FOR APPS THAT ARE USED DURING THE LINUX BOOT SEQUENCE............53 ix x FIGURE 3-25 CONFIGURE STEP FOR APPS THAT ARE NOT USED DURING THE LINUX BOOT SEQUENCE .....53 FIGURE 3-26 COPYING THE /ARM-ARM DIRECTORY TO THE EBSA-285’S ROOT DIRECTORY ..................54 FIGURE 3-27 CREATE SYMLINKS AFTER INSTALLING FROM I686 HOST TO EBSA-285’S FILE SYSTEM ...54 FIGURE 3-28 BASH ENVIRONMENT VARIABLES FOR THE ARM-ARM CONFIGURE OPTIONS ...................55 FIGURE 4-1 “STAND-ALONE” HOST PERFORMANCE INSTRUMENTATION.................................................61 FIGURE 4-2 CONTENTS OF THE PAYLOAD BUFFER FOR STAND-ALONE MEASUREMENTS .........................61 FIGURE 4-3 WRITE TO UNIMPLEMENTED BAR ON 3C905C NIC CORRESPONDS TO TIME T 0 ..................63 FIGURE 4-4 WIRE ARRIVAL TIME T 1 .......................................................................................................64 FIGURE 4-5 WIRE EXIT TIME T 2..............................................................................................................64 FIGURE 4-6 “SPREADSHEET FRIENDLY” CONTENTS OF TEST RESULTS FILE ............................................65 FIGURE 4-7 IP4 FRAGMENTATION OF “LARGE” UDP DATAGRAM ..........................................................66 FIGURE 4-8 IP4 DATAGRAM SEND SEQUENCE FOR UDP DATAGRAMS > 1 MTU....................................67 FIGURE 4-9 PCI BUS WRITE BURST OF IP DATAGRAM N WHERE |IP DATAGRAM N| < |START MARKER| ...69 FIGURE 4-10 LOGIC ANALYZER PAYLOAD “MISS” ..................................................................................69 FIGURE 4-11 CINIC ARCHITECTURE INSTRUMENTATION CONFIGURATION ............................................71 FIGURE 4-12 WELL-DEFINED DATA PAYLOAD FOR THE COMBINED PLATFORM ......................................72 FIGURE 4-13 WRITE TO UNIMPLEMENTED BAR TRIGGERS THE LOGIC ANALYZER .................................73 FIGURE 4-14 WIRE ARRIVAL TIME T 1,1 ON PRIMARY PCI BUS ..............................................................74 FIGURE 4-15 WIRE ARRIVAL TIME T 2,1 ON SECONDARY PCI BUS .........................................................75 FIGURE 4-16 WIRE EXIT TIME T 1,,2 ON PRIMARY PCI BUS ....................................................................75 FIGURE 4-17 WIRE EXIT TIME T 2,2 ON SECONDARY PCI BUS ................................................................76 FIGURE 4-18 WIRE ARRIVAL TIME T 2,3 ON SECONDARY PCI BUS ...........................................................77 FIGURE 4-19 WIRE EXIT TIME T 2,4 ON SECONDARY PCI BUS ..................................................................78 FIGURE 4-20 DATA PATH TIME LINE .......................................................................................................78 FIGURE 4-21 CINIC POLLING PROTOCOL PERFORMANCE TESTBED ........................................................80 FIGURE 4-22 CINIC ARCHITECTURE HOST SOCKET PERFORMANCE TESTBED .........................................81 FIGURE 4-23 START MARKER APPEARS FIRST ON SECONDARY PCI BUS ................................................82 FIGURE 5-1 WINDOWS 2000 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 32 TO 16K BYTES .....................87 FIGURE 5-2 WINDOWS 2000 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES ...................88 FIGURE 5-3 WINDOWS 2000 UDP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES ..................89 FIGURE 5-4 LINUX 2.4.3 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 64 TO 9K BYTES .............................91 FIGURE 5-5 LINUX 2.4.3 TCP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES .........................92 FIGURE 5-6 LINUX 2.4.3 UDP/IP4 ON I686 HOST, PAYLOAD SIZES: 64 TO 9K BYTES ............................93 FIGURE 5-7 LINUX 2.4.3 UDP/IP4 ON I686 HOST, PAYLOAD SIZES: 512 TO 65K BYTES ........................94 FIGURE 5-8 LINUX 2.4.3-RMK2 TCP/IP4 ON EBSA285, PAYLOAD SIZES: 64 TO 4.3K BYTES ...............96 FIGURE 5-9 LINUX 2.4.3-RMK2 TCP/IP4 ON EBSA285, PAYLOAD SIZES: 512 TO 50K BYTES...............97 FIGURE 5-10 LINUX 2.4.3-RMK2 UDP/IP4 ON EBSA285, PAYLOAD SIZES: 64 TO 9K BYTES ................98 FIGURE 5-11 LINUX 2.4.3-RMK2 UDP/IP4 ON EBSA285, PAYLOAD SIZES: 512 TO 31K BYTES ............99 FIGURE 5-12 CINIC WIRE ENTRY TIMES...............................................................................................101 FIGURE 5-13 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ...........102 FIGURE 5-14 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ...........103 FIGURE 5-15 PARTIAL PLOT OF CINIC WIRE ENTRY TIMES ..................................................................104 FIGURE 5-16 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ...........105 FIGURE 5-17 CINIC WIRE EXIT TIMES ..................................................................................................106 FIGURE 5-18 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............107 FIGURE 5-19 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............108 x xi FIGURE 5-20 PARTIAL PLOT OF CINIC WIRE EXIT TIMES ......................................................................109 FIGURE 5-21 LINUX 2.4.3 TCP/IP4 ON CINIC, POLING PROTOCOL, WIRE EXIT LATENCIES ................110 FIGURE 5-22 21554 OVERHEAD ON WIRE ENTRY AND EXIT TIMES ........................................................111 FIGURE 5-23 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, 21554 OVERHEAD .....................112 FIGURE 5-24 SOCKET TRANSFER TIME .................................................................................................113 FIGURE 5-25 LINUX 2.4.3 TCP/IP4 ON CINIC, POLLING PROTOCOL, TESTAPP SEND TIMES (PCI2)......114 FIGURE 5-26 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ..........115 FIGURE 5-27 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ..........116 FIGURE 5-28 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE ENTRY LATENCIES ..........117 FIGURE 5-29 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............118 FIGURE 5-30 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............119 FIGURE 5-31 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, WIRE EXIT LATENCIES ..............120 FIGURE 5-32 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, 21554 OVERHEADS ...................121 FIGURE 5-33 LINUX 2.4.3 UDP/IP4 ON CINIC, POLLING PROTOCOL, TESTAPP SEND TIMES (PCI2) .....122 FIGURE 6-1 CINIC DEVELOPMENT CYCLE ............................................................................................124 xi 1 Chapter 1 Introduction The Calpoly intelligent Network Interface Card (CiNIC) Project introduces a networking architecture that offloads network-related protocol stack processing from an Intel i686 host system onto a dedicated ARM/Linux co-host attached to the i686 system. Placing these processing tasks on a dedicated co-host relieves the host system of the myriad peripheral tasks that go along with managing the flow of information through a networking protocol stack – e.g., servicing interrupts from the network card, managing port connection requests, error detection and recovery, managing flow control, etc. Placing network-related protocol stack processing on a dedicated co-host also allows one to implement low-cost, “value-added” service providers such as firewall protection, encryption, information caching, QOS, etc., on the dedicated co-host with little or no performance hit on the host system. The goals of this thesis are threefold: • Implement a software development environment for the co-host platform. To this end, I discuss the two GNU software development tool chains I implemented to support ARM/Linux software development for the EBSA285 co-host platform. • Implement a system environment on the co-host that is capable of bootstrapping an ARM/Linux kernel on the co-host when power is applied to the co-host. To this end, I discuss my implementation of the ARM/Linux BIOS for the EBSA-285. • Implement performance instrumentation that gathers latency data from the CiNIC architecture during network send operations. To this end, I discuss the logic analysis / systems approach I developed to gather latency data from the 2 individual components that comprise the CiNIC platform – i.e., the host system performance by itself and the EBSA-285 co-host system performance by itself – and the combined host/co-host CiNIC platform. Samples of the performance data obtained with this instrumentation technique are provided for the purpose of demonstrating the capabilities of this measurement technique. The remainder of this thesis is organized as follows. A brief historical perspective of the CiNIC project is provided in Chapter 2, along with an overview of the project’s current hardware and software architectures. Chapter 3 covers the CiNIC Project’s software infrastructure – i.e., the i686-ARM and ARM-ARM software development tool chains, the ARM/Linux BIOS being used to boot a Linux kernel on the EBSA-285, and a brief overview of the bootstrap process. The performance instrumentation technique used to capture performance data from the CiNIC platform during network sends is discussed in Chapter 4. The performance metrics used by the Cal Poly Network Performance Research Group are also mentioned in Chapter 4. Samples of the performance data are provided in Chapter 5 for the purpose of demonstrating the capabilities of the aforementioned instrumentation technique. Chapter 6 presents the thesis’s conclusions and proposes some areas for future work. Appendix A contains a list of acronyms that are used throughout this document. Information on RFC-868 Time Servers is provided in Appendix B. And finally, Appendix C provides information on adding SDRAM to an EBSA-285. NOTE: A CD-ROM disc accompanies this thesis. This disc contains the source code and executable programs that accompany this work. The disc also contains “README” files that contain additional / last-minute information that is not presented herein. 2 3 Chapter 2 2.1 CiNIC Architecture Historical Overview The first efforts for incorporating a co-host for network operation were directed towards off-loading the NIC driver to a co-host. Maurico Sanchez’s I20 development work was the project’s first effort [49]. He ported a 3Com 3c905x NIC device driver from the host machine to the I2O platform. Subsequent performance measurements indicated that there was a significant latency penalty paid for this arrangement, due to the overhead associated with the I2O message handling. The seeds of the CiNIC project were sewn in the summer of 1999 when the CPNPRG received two “Anaconda” development boards from 3Com Corporation. Each Anaconda contained an Intel StrongARM (SA-110) CPU, an Intel 21285 Core Logic chip, two Hurricane NICs (3Com’s 10/100 Mbps Ethernet ASIC), a custom DMA engine, and 2 MB of RAM (see Figure 2-1). With these boards, Angel Yu began experimenting with the idea of transplanting a host machine’s TCP/IP protocol stack onto a parasitic “network appliance” plugged into the host’s PCI bus [56]. The hope was that by offloading the host machine’s low-level networking tasks onto a dedicated “co-host” we could, among other things, free up a significant number of CPU clock cycles on the host system. 4 Host Computer System PCI bus 3Com “Anaconda” 21285 ASIC NIC SA-110 RAM NIC LAN Figure 2-1 Anaconda testbed Yu’s work yielded some valuable results, but one unexpected result was the realization that the Anaconda was not particularly well suited for the task at hand. Its limited memory capacity made it almost impossible to fit the required software items within it – i.e., the Anaconda microcode, the TCP/IP protocol stack, an operating system of some sort, data processing and storage areas, etc. It was not well suited as a development platform in that there was no “easy” way to reload our software into the Anaconda’s RAM during the boot process (e.g., after a software “crash”). Also, the “mapmem” technique we were using to integrate the Anaconda’s RAM into the “virtual machine” environment of a Microsoft Windows NT 4.0 host was problematic, at best. Despite the aforementioned difficulties with the Anaconda, we genuinely liked the board’s architectural concept. We felt that Intel’s StrongARM CPU and 21285 Core Logic chip set were particularly well suited to performing TCP/IP stack operations on a dedicated co-host. Furthermore, if we could increase the amount of RAM on the co-host, we could perform not only TCP/IP stack operations but “value added” tasks as well – e.g., web caching, QOS, encryption, firewall support, etc. 5 So late in the fall of 1999 we began searching for a development platform to replace the Anaconda and came across the Intel EBSA-285 – a single-board computer system with the form factor of a PCI add-in card, that serves as an evaluation platform for Intel’s StrongARM CPU and 21285 Core Logic chip set [25]. The EBSA285 board had all of the features we were looking for in an ARM-based development platform and more. The addition of an EBSA-285 board to the CiNIC project allowed us to make additional progress toward the goal of offloading the TCP/IP stack onto a dedicated co-host. Of course, the incorporation of an EBSA-285 into the CiNIC project did not immediately solve all our problems. In fact, it quickly became apparent that the EBSA-285’s built-in boot loader / monitor / debugging software (a.k.a., “Angel”) was less than optimal for our development efforts. What we really needed was a fullblown operating system environment on the EBSA-285 itself that would facilitate our development efforts rather than impede them. Fortunately, Russell King, a systems programming guru over in Great Britain [29], had ported the Linux operating system to a variety of ARM-based single-board computers by this time, including the EBSA285. Thus began the CPNPRG’s effort to implement the Linux operating system – and ultimately a complete development environment – on our EBSA-285. Win2K or Linux on the Host Computer System Primary PCI Linux on the EBSA-285 21285 SA-110 Figure 2-2 Intel’s EBSA-285 replaces 3Com Anaconda as co-host 6 The decision to implement the Linux operating system on the EBSA-285 was a move in the right direction, but it introduced a new and significant problem: we now needed a way to isolate the Linux operating system on the EBSA-285 from the operating system environment on the host machine (which would be either Windows 2000 or Linux). Without an isolation layer between the host and co-host operating system environments, the two operating systems would battle each other for control of the host machine’s PCI bus. Such a conflict would apparently throw the entire system into a state of chaos, and therefore had to be resolved. Thankfully, the task of isolating the host and co-host operating systems had a straightforward solution: the Intel 21554 Nontransparent PCI-to-PCI Bridge [23]. The 21554 is specifically designed to bridge two distinct PCI bus domains (a.k.a., “processor domains”). Win2K or Linux on the Host Computer System Primary PCI 21554 Secondary PCI Linux on the EBSA-285 21285 SA-110 Figure 2-3 The 21554 bridges two distinct PCI bus domains The addition of a 21554 bridge seemed a perfect solution to the problem of isolating the co-host’s OS from the primary PCI bus, but its incorporation introduced yet another significant architectural change to the CiNIC project. Specifically, the incorporation of a 21554 bridge into the CiNIC architecture implied: a) The EBSA-285 must now reside on a secondary PCI bus system, and, 7 b) The secondary PCI bus system must somehow be connected to the host system’s PCI backplane via the 21554 nontransparent bridge, and, c) A host/co-host communications protocol would need to be created to pass information back and forth between the host and co-host processor domains via the 21554 bridge. Nevertheless, the benefits of having the 21554 were obvious and more than justified its incorporation into the project. Ihab Bishara, in his senior project [3], addressed points a) and b) above by locating a 21554 evaluation board that suited our needs perfectly. 2.2 Hardware Architecture The current hardware architecture of the CiNIC project is comprised of the following components: a PC-based host computer system, an Intel 21554 evaluation board, an Intel EBSA-285 evaluation board, a 3Com 3c905C network interface card [NIC], a Promise Ultra66 EIDE card and a Maxtor 2R015H1 15GB hard drive. As an option, a second NIC may be attached to the system as well. These components are shown pictorially in Figure 2-4 and schematically in Figure 2-5. The instrumentation components consist of two FuturePlus Systems FS2000 PCI bus probe cards, a Hewlett-Packard 16700A Logic Analysis mainframe, and three Hewlett-Packard 16555D State/Timing modules. (The HP16700A Logic Analysis system and its three 16555D State/Timing modules are not shown.) 8 Host PC Intel EBSA-285 Secondary PCI host bridge slot on the 21554 Intel 21554 & Secondary PCI FuturePlus FS2000 PCI Probe 3Com 3c905C NIC Promise Ultra 66 EIDE Card ( not installed as shown ) Figure 2-4 CiNIC Hardware Architecture Host Computer System Primary PCI 21554 Op tional Secondary PCI EIDE 21285 EBSA 285 NIC NIC LAN Internet Hard Disk Figure 2-5 Schematic diagram of the CiNIC hardware architecture 2.2.1 Host Computer System The host computer, whose DNS name is Hydra, is a Dell Dimension XPS T450 desktop computer whose characteristics of interest are itemized in Table 2-1. This machine was purchased specifically for the CiNIC project, and has five 32-bit PCI slots on its motherboard instead of the usual four. 9 Table 2-1 Description of the host computer system Component Characteristics CPU Intel Pentium-III (Katami); stepping 03; 450 MHz; 16KB L1 instruction cache; 16KB L1 data cache; 512 KB L2 cache; “fast CPU save and restore” feature enabled Primary PCI backplane PCI BIOS revision 2.10; 5 expansion slots; Device list: Host bridge: Intel Corporation 00:00.0 440BX/ZX - 82443BX/ZX Host bridge (rev 03) PCI bridge: Intel Corporation 00:01.0 440BX/ZX - 82443BX/ZX AGP bridge (rev 03) ISA bridge: Intel Corporation 00:07.0 82371AB PIIX4 ISA (rev 02) IDE interface: Intel Corporation 00:07.1 82371AB PIIX4 IDE (rev 01) USB Controller: Intel Corporation 00:07.2 82371AB PIIX4 USB (rev 01) Bridge: Intel Corporation 00:07.3 82371AB PIIX4 ACPI (rev 02) Multimedia audio controller: Yamaha Corporation 00:0c.0 YMF-724F [DS-1 Audio Controller] (rev 03) Ethernet controller: 3Com Corporation 00:0d.0 3c905C-TX [Fast Etherlink] (rev 74) Bridge: Digital Equipment Corporation 00:10.0 DECchip 21554 (rev 01) VGA compatible controller: ATI Technologies Inc 01:00.0 3D Rage Pro AGP 1X/2X (rev 5c) Memory 128 MB 2.2.2 Intel 21554 PCI-PCI Nontransparent Bridge The Intel 21554 is a nontransparent PCI-PCI bridge that performs PCI bridging functions for embedded and intelligent I/O (I20) applications [23]. Unlike a transparent PCI-to-PCI bridge, the 21554 is specifically designed to provide bridging support between two distinct PCI bus domains (a.k.a., “processor domains”) as shown in Figure 2-6. 10 Host 21554 Primary PCI Bus EBSA-285 Secondary PCI Bus Figure 2-6 Intel 21554 nontransparent PCI-PCI bridge Briefly, the 21554 passes PCI transactions between the primary and secondary PCI buses by translating primary PCI bus addresses into secondary PCI bus addresses and vice versa. Because the 21554 is nontransparent, it presents itself to the primary PCI bus as a single PCI device. Consequently, the host bridge on the primary PCI bus is unaware of the presence of the secondary PCI bus on the “downstream” side of the 21554 as shown in Figure 2-7: Host 21554 EBSA-285 Single PCI Device Primary PCI Bus Secondary PCI Bus Figure 2-7 Primary PCI bus’s view of the 21554 and the secondary PCI bus Likewise, the host bridge on the secondary PCI bus sees the 21554 as a single PCI device, and is unaware of the presence of the primary PCI bus on the “upstream” side of the 21554 as shown in Figure 2-8. Host 21554 EBSA-285 Single PCI Device Primary PCI Bus Secondary PCI Bus Figure 2-8 Secondary PCI bus’s view of the 21554 and the primary PCI bus 11 The opacity of the 21554 allows two separate host bridges on two PCI busses to operate more-or-less independently of each other, while simultaneously providing a communications path that facilitates the transfer of data between the primary and secondary PCI address domains. In addition to the 21554’s PCI-PCI isolation feature, the 21554 can also provide PCI support services on the secondary PCI bus. For example, the 21554 can be configured as the PCI bus arbiter for up to nine PCI devices on the secondary PCI bus. See [3] and [23] for complete coverage of the feature set and capabilities of the Intel 21554. 2.2.3 Intel EBSA-285 StrongARM / 21285 Evaluation Board The Intel EBSA-285 is a single-board computer with the form-factor of a “standard long” PCI add-in card. It is an evaluation board for the Intel SA-110 “StrongARM” microprocessor and Intel 21285 Core Logic chip set. The board can be configured either as a PCI add-in card or as a PCI host bridge. In the CiNIC architecture, the EBSA-285 is configured for host bridge mode (a.k.a., “central function” mode) [24], and is therefore installed in the “host” slot on the Intel 21554 evaluation board [8] (see also: Figure 2-4). When configured for host bridge operation, the EBSA-285 may optionally be configured as the arbiter for the secondary PCI bus. However, when the EBSA-285 is configured for arbitration, its LED indicators and flash ROM image selector switch are unavailable for use. Since we wanted to use both the LEDs and the flash ROM image selector in the CiNIC project, the job of arbitration on the secondary PCI bus was delegated to the 21554. For complete coverage of the feature set and capabilities of the EBSA-285, as well as a discussion of the hardware 12 configuration choices for the EBSA-285 as it pertains to the CiNIC project, refer to [3], [24] and [25]. 2.2.4 3Com 3c905C Fast Ethernet Network Card 3Com’s 3c905C network card (p/n: 3CSOHO100-TX) provides connectivity between the secondary PCI bus and a 10 Mbps Ethernet or 100 Mbps Fast Ethernet network. The card features a built-in “link auto-negotiation” feature that transparently configures the card for use with either a 10-Mbps or 100-Mbps Ethernet network. 2.2.5 Promise Ultra66 PCI-EIDE Hard Disk Adaptor The Promise Ultra66 EIDE card provides connectivity between the secondary PCI bus and an EIDE hard drive. The Promise Ultra66 EIDE card was chosen because it is supported by the ARM/Linux BIOS [30] we are using to boot the EBSA-285. 2.2.6 Maxtor 531DX Ultra 100 Hard Drive The Maxtor 531DX Ultra 100 hard drive (p/n: 2R015H1) is a “higher-reliability” single head / single platter hard drive. The drive has an Ultra ATA/100 interface that can transfer up to 100 megabytes per second. The drive also utilizes Maxtor’s Silent StoreTM technology for “whisper-quiet” acoustic performance. The 531DX Ultra 100 product line targets entry-level systems and consumer electronics applications [34]. 2.2.7 Intel EBSA-285 SDRAM Memory Upgrade The Intel EBSA-285 comes from the factory with 16 MB of SDRAM installed. This is a sufficient amount of RAM for booting a Linux kernel and running a few programs “on top of” a command shell, but not much more. Software packages 13 comprised of one or more “large” binary images – e.g., the GNU C library (glibc), the Linux kernel, the GNU C and C++ compilers (gcc and g++, respectively), etc. – generally cannot be built on an EBSA-285 with only 16 MB of RAM installed. For this reason, we decided to increase the amount of SDRAM on the EBSA-285 by adding an additional 128 MB DIMM to the existing 16 MB DIMM, to give us a total RAM capacity of 144MB. See Appendix C for the particulars of this upgrade. 2.2.8 FuturePlus Systems FS2000 PCI Analysis Probe The FuturePlus Systems FS2000 PCI Analysis Probe and extender card performs three functions [13]: • The first is to act as an extender card, extending a module up 3 inches from the motherboard. • The second is to provide test points very near (less than one inch) from the PCI gold finger card edge to measure the power and signal fidelity. • The third is to provide a complete interface between any PCI add-in slot and HP Logic Analyzers. The Analysis Probe interface connects the signals from the PCI Local bus to the logic analyzer inputs. The FS2000 card comes with inverse assembly software that is loaded onto the HP 16700A logic analyzer. This software converts captured PCI bus signals into human-readable textual descriptions of the PCI transactions captured by the analyzer. When a PCI add-in card is installed in the FS2000 card edge connector, JP2 should have the jumper connected across pins 1 and 2. If no card is installed in the FS2000 card edge connector, JP2 should have the jumper connected across pins 2 and 3. For complete information regarding the possible jumper configurations on the FS2000 card, see page 9 of the FS2000 Users Manual [13]. 14 2.3 Software Architecture From the very beginning, the ability to use the CiNIC with a variety of operating system environments was a design goal of the CiNIC project. To this end, we focused our efforts on integrating the CiNIC co-host into two different operating system environments: Microsoft Windows 2000 and Redhat Linux. Microsoft’s Windows 2000 was chosen because it is representative of the family of “32-bit” Microsoft Windows operating systems that are currently the predominant operating system on Intel x86-based desktop PCs. Redhat Linux (or more specifically, the Linux operating system in general) was chosen primarily because it is an open source operating system – i.e., the source code for both the operating system and the software tools that are needed to build the operating system are freely available via the Internet. Off-loading the TCP/IP protocol stack from the host machine onto the EBSA285 co-host was just one of jobs we had in mind for the EBSA-285 platform. Another of the design goals of the CiNIC project was extensibility. Specifically, we wanted the EBSA-285 to provide value-added services coincident to the offloaded TCP/IP protocol stack. For example, the TCP/IP stack on the EBSA-285 might be configured to support tasks such as web caching, encryption, firewall protection, traffic management, QOS support, and so on. 2.3.1 Integrating CiNIC with Microsoft Windows 2000 Integrating the CiNIC co-host into the Windows 2000 operating system environment is (not surprisingly) a non-trivial task. The CPNPRG’s original design, described in Peter Huang’s master’s thesis [22], recommends replacing the existing Winsock 2 15 DLL (WS2_32.dll) with two custom software packages: a CiNIC-specific Winsock 2 DLL and a host-CiNIC communication interface / device driver. Win2K Host CiNIC co-host User-layer Application Linux Device Driver Custom Winsock DLL ( ws2_32.dll ) EBSA-285 Communication Interface Linux Socket Calls TCP/IP Stack W2k Device Driver Host Communication Interface Linux OS Shared Memory NIC Driver Figure 2-9 Original Windows 2000 software architecture This is essentially a “brute force” approach in that it completely eliminates Windows’ own networking subsystem and replaces it with a CiNIC-specific subsystem. Clearly, this design is not feasible for the “real world” since it renders useless all non-CiNIC network devices – whether these be virtual devices (e.g., “loopback” devices) or real (e.g., Token Ring, X.25, etc.). Fortunately, the task of integrating the CiNIC into the Windows 2000 environment is solvable by other means. The Windows Sockets 2 Architecture (a.k.a., the Windows Open System Architecture [WOSA]) is Microsoft’s networking infrastructure for their 32-bit Windows operating systems. The WOSA defines a pair of service provider interfaces (SPI) at the bottom edge of Microsoft’s Winsock 2 DLL that developers can use to integrate vendor-defined transport and/or name space providers into the existing Win32 networking subsystem. 16 Figure 2-10 Microsoft Windows Open System Architecture [37] The WOSA gives user-layer applications simultaneous access to multiple service providers (i.e., protocol stacks and/or service providers) via Microsoft’s Winsock 2 DLL. The Win32 application simply asks the Winsock 2 DLL to enumerate the list of available service providers, and then the application connects itself to the desired provider. Microsoft’s Winsock 2 SDK also includes a Win32 applet called sporder.exe that exports a set of functions (from a DLL named sporder.dll) that device vendors can use to reorder the list of available service providers. Software installation tools can leverage the sporder.exe interface to programmatically reorder the available service providers to suit their device’s needs. Of course, end users can also run the sporder.exe applet to establish, for example, a particular TCP/IP protocol stack as the default TCP/IP provider if more than one such stack is present on the system. Taking the WOSA feature set into consideration, a “new and improved” design for the CiNIC software architecture would probably look something like Figure 2-11. Specifically, it would: a) leave Microsoft’s Winsock DLL in place, and, 17 b) create a CiNIC-specific transport service provider DLL that would interface with the Winsock 2 SPI, and c) use the sporder.exe applet and sporder.dll library to make the CiNIC transport service the default transport service provider. Host User-layer Application CiNIC co-host Microsoft’s Winsock 2 DLL ( ws2_32.dll ) CiNIC-specific TCP/IP trans port service DLL Host Communication Interface Linux Device Driver EBSA-285 Communication Interface Linux OS Linux Socket Calls TCP/IP Stack Shared Memory NIC Driver Figure 2-11 “WOSA friendly” Windows 2000 software architecture 2.3.2 Integrating CiNIC with Redhat Linux 7.x Figure 2-12 shows the pre-CiNIC architecture of a typical Redhat Linux 7.x host. In very general terms, when a user-layer application wishes to send (or receive) data over a network connection, it does so by calling the appropriate functions in a hostspecific (i.e., custom) “socket” library. The functions in the socket library call the appropriate functions in the host’s TCP/IP protocol stack, which then perform whatever processing is necessary on the user-defined payload. And finally, the TCP/IP stack utilizes the NIC’s device driver routines to send the data out on the wire. Network receive operations are, in essence, performed in the reverse order. 18 User Space A p p li ca tio n Kernel Space HOS T S y ste m ca l ls S o cke t A P I T C P / IP S ta ck N IC D e v ice D r iv e r N IC Figure 2-12 Redhat Linux 7.x networking architecture With the CiNIC architecture, the host’s own networking subsystem is effectively replaced by the networking subsystem on the parasitic co-host. In other words, the functional blocks comprising the shaded region of Figure 2-12 – i.e., • The low-level interface between the System Calls block and the top of the TCP/IP stack block • The entire TCP/IP protocol stack • The NIC’s device driver • The NIC itself. are in essence transplanted from the host system to the EBSA-285 co-host, as shown in Figure 2-13. 19 Kernel Space User Space HOST EBSA-285 Application Socket API System calls Protocol hostmem _p module 21554 ebsamem _p module Protocol Handlers Service Providers System calls TCP / IP Stack NIC Device Driver NIC Figure 2-13 Overview of the CiNIC architecture Of course, a certain amount of “glue logic” is required to support such a transplant. Specifically, any host-side system call that formerly made calls into the 20 host-side TCP/IP protocol stack must now be redesigned to communicate with the networking subsystem on the co-host, and vice versa. As shown in Figure 2-13, this logic is implemented as two new functional blocks on both the host and co-host sides – i.e., a Protocol block and an xxxmem_p module block. The two xxxmem module blocks – i.e., hostmem_p on the host side and ebsamem_p on the co-host side – are a complementary pair of device drivers, designed and implemented by Mark McClelland [35], that establish a low-level communications pathway between the primary and secondary PCI busses via the Intel 21554 PCI-to-PCI nontransparent bridge. The host and co-host Protocol blocks were designed and implemented by Rob McCready [36]. These two protocol blocks implement the networking data path as follows. The host-side Protocol block intercepts (a.k.a., “hijacks”) all network-related system calls made by the host-side Socket API. The host-side Protocol block reroutes these system calls to its complementary block on the co-host side, the Protocol Handlers block, using the PCI-based communications channel established by Mark McClelland’s hostmem_p and ebsamem_p modules. Note that in addition to interfacing with the underlying system calls on the co-host side, the Protocol Handlers block on the co-host also provides a mechanism for shimming user-defined service providers into the networking data path. The idea here is that these custom service providers will perform value added processing on the data as it leaves and/or enters the system. 2.3.3 Service Provider Example: Co-host Web Caching As an example of the service provider shim mentioned in the previous section, assume for the moment that someone wants to implement a web caching utility. On a 21 non-CiNIC host, the architecture for a web caching utility might look something like User Space Figure 2-14. Kernel Space Netscape Web Caching Proxy Loop back 127.0.0.1 NIC Figure 2-14 Example of a web caching proxy on a non-CiNIC host With this design, a user manually configures her web browser– e.g., Netscape Navigator – to use a custom web caching proxy. (It is assumed that the web caching proxy is running on the same machine as the web browser software.) In such a design, the web browser would probably establish a communications channel between itself and the proxy via some sort of loopback connection. Consequently, when the browser sends a web page request to the network it is intercepted by the web caching proxy before it reaches the network. The proxy checks to see if the requested web page is currently stored in the local system’s web page cache. If it is, the proxy then checks whether the locally cached page is current or not. If the requested page is not present in the local cache, or a newer version of the requested web exists, the web caching proxy retrieves the requested web page and stores it in the system’s local cache. 22 Finally, the proxy copies the web page from the local cache to the browser software where it is then displayed. This design was, in fact, implemented by Jenny Huang in her Master’s Thesis. Refer to [21] for the details of her implementation. With the CiNIC architecture, the proxy server would be replaced with a web caching service provider on the co-host side. The service provider would be shimmed into the data path via the Protocol Handlers block as shown in Figure 2-13 above, and schematically would probably look something like Figure 2-15. As shown in Figure 2-15, HTTP requests from the host are demultiplexed from the host’s outbound network traffic. The filtered out HTTP requests are then passed to the web caching service provider which performs the required cache lookup, cache refresh, cache update, or whatever. Note too that the output from the web caching service provider could theoretically be piped into yet another service provider – e.g., a content filter that blocks access to specific web sites (e.g., a “Net Nanny” type content filter for use in elementary schools). non-HTTP traffic Host’s outbound traffic Web Caching Service HTTP DMUX Level 2 Cache Hard Disk MUX Level 1 Cache RAM Disk Figure 2-15 CiNIC-based web caching service provider 23 Chapter 3 3.1 Development / Build Environment Overview of the Software Infrastructure This chapter discusses the software infrastructure of the CiNIC project. Specifically, I discuss my work creating the i686-ARM and ARM-ARM versions of the GNU software development tool chains to facilitate systems-level and application-level software development for (and in) the ARM/Linux environment on the EBSA-285. I then describe my implementation of Russell King’s ARM/Linux basic I/O subsystem [BIOS] that boots the ARM/Linux kernel on the EBSA-285. The chapter ends with a brief description of the Redhat 7.x / System V user environment I created on the EBSA-285. 3.2 3.2.1 Creating an i686–ARM Tool Chain Caveat Emptor Creating an i686-ARM (pronounced “i686 cross ARM”) version of the GNU software development tool chain [GNU/SDTC] is supposed to be a relatively straightforward process – at least, that’s the claim made by the folks that create the packages that comprise the GNU tool chain. My experiences with creating a functional i686-ARM tool chain were problematic at best – particularly when building the GNU C compiler gcc, and the GNU C library, glibc. To give some idea of just how problematic this was for me, I worked full time for more than three months before I finally succeeded at building my first, fully functional, i686-ARM 24 cross tool chain. Unfortunately, this seems to be the usual state of affairs for neophyte developers of the GNU tool chains. So if you are feeling brave and want to try building a GNU software development tool chain for yourself, don’t be surprised when you run into major problems doing this; you are definitely not alone. With regard to building the GNU tool chains, I would like to pass along the following comments and suggestions. • Use only a high-end computer platform to build the sources. By “high-end” I mean a computer that has a very high speed CPU (preferably an SMP machine with multiple “very high speed CPUs”!), lots of RAM, and lots of free disk space. The CPNPRG purchased a high-end Dell Precision 420 workstation specifically for the task of building (and, rebuilding, and rebuilding, etc.) the GNU tool chains and Linux kernel. This machine has two 750 MHz Pentium-III CPUs, 512 MB of RAM, and two 18 GB hard drives and is perfect for the task at hand. Specifically, this particular Dell workstation is able to build the entire GNU tool chain in about one hour. Prior to having this Dell machine, the build process took more than 6 hours from start to finish. This is an altogether intolerable (and frustrating!) situation because a particular build sequence may run for a few hours before crashing, and depending on the reason for the crash you might need to restart the entire build from scratch.) • Do not hesitate to use the appropriate newsgroups and mailing lists to ask for help when you feel hopelessly stuck. The GNU home page (www.gnu.org) has links to the GNU C compiler and GNU C library home pages, and from there you can access many useful help resources (e.g., online documents, links to mailing lists, etc.). • Become familiar with a scripting language such as Bash or Perl. The steps one must take to create a complete tool chain are next to impossible to perform by typing the required commands by hand at a shell command prompt. It is much easier – and therefore more efficient – to write the required command sequences into a text file (using a text editor like vim or emacs) and then to have a scripting 25 language read those commands from the text file and execute them on your behalf. 3.2.2 Getting Started Before one can develop software on an ARM platform, one apparently needs an ARM-compatible software development tool chain – i.e., an ARM-compatible assembler, an ARM-compatible C compiler, an ARM-compatible linker / loader, and so on. Note, however, that the only way to build an ARM-compatible tool chain is to have an existing ARM-compatible tool chain! Fortunately, the GNU folks have solved this “chicken or the egg” conundrum by designing their tool chain so that one can build a “cross” tool chain that runs on a non-ARM host – e.g., an Intel i686 host – but produces executable programs that run on an ARM host. This cross tool chain, then, is used to create a “native” ARM-compatible tool chain – i.e., a tool chain that can be used directly within the ARM environment. Each software package in the GNU/SDTC contains a “read me” file named INSTALL that provides instructions for configuring, building, and installing the software in that particular package. (This file is typically located in the top-level directory of the package’s source tree.) So the starting point, when building these GNU software packages, is reading and understanding the information in the INSTALL file. Be aware, too, that the INSTALL file typically contains GNU-specific jargon that seems unintelligible to neophyte package builders. So to fully understand how to configure, build, and install the software in a GNU software package, you must also understand the idioms used by the GNU software developers. 26 Three GNU-derived jargon terms one must understand before attempting to build a GNU software package – and in particular the packages that comprise the GNU/SDTC – are build machine, host machine, and target machine. These three terms are used at the very beginning of the build process during the software configuration phase. Note that if you incorrectly specify any of these machine types during the configuration phase, one of two outcomes will occur: a) the build process will fail, or b) the resulting executables will not work. The following descriptions, then, are intended as supplements to the descriptions found in the GNU documents: • The build machine is the machine you are using “right now” to convert source code files into executable images (or libraries, etc.). • The build machine creates executables that run on a particular host machine. Note that the host machine need not be the same machine type as the build machine type. For example, an i686 machine can create executable images (e.g., a C compiler, a game, a Linux kernel, etc.) for a PowerPC machine. So the build machine, in this case, is specified as i686, and the target machine is specified as PowerPC. • Assume the build machine creates an executable image that is used on a specific host machine. Assume further that the executable image is some sort of source code translator. (For example, an i686 build machine creates a C compiler (e.g., GNU gcc) for a PowerPC host machine.) When the source code translator is used on the host machine, it produces executable images that run on a target machine. (For example: The build machine creates a C compiler that runs on a PowerPC host machine. When the C compiler runs on the PowerPC, it translates source code files into executable images that run on a StrongARM target machine.) In addition to specifying the build, host, and target machine types, one must also specify, during the configuration process, the installation path(s) for the package 27 that’s being built. For example, after the build machine creates the executable images for the GNU C compiler, these executable images must be installed on the host machine. The directories in which these executable images are to be installed on the host machine must be specified during the configure process. For additional information on the creation and installation of the software packages that comprise the GNU/SDTC, the Internet is an invaluable resource. The starting point for GNU-related software packages should probably be the GNU web site. A complete list of Linux-related “how to” documents can be found at the Linux Documentation Project and the Tucows websites, just to mention a few. There are also a number of Linux-related newsgroups on Usenet that provide peer-to-peer support for Linux users. A brief list of some of these online resources is provided in Figure 3-1. World Wide Web http://www.gnu.org/ http://www.linuxdoc.org/ http://howto.tucows.com/ Usenet Newsgroups news://news.calpoly.edu/gnu.gcc.help news://news.calpoly.edu/comp.os.linux.development.apps Figure 3-1 Some on-line help resources for Linux software developers 3.2.3 Creating the i686-ARM Development Tools with an i686 host As shown in Figure 3-2, an i686 build machine will be used to create executable images of the GNU/SDTC (from the GNU/SDTC sources). The resulting GNU/SDTC images will subsequently be used on an i686 host machine to generate executable images that run on a StrongARM target machine. In other words, the goal is to create an i686-ARM cross compiler – i.e., a compiler that runs on an i686 machine, but creates executable images that run a StrongARM machine. 28 Step #1 B: i686 H: i686 T: ARM i686 PC 1b i686→i686 i686→ARM 1a Sources Figure 3-2 Using an i686 host to create the i686-ARM tool chain The starting point in this process is the GNU binutils package. This package creates the “binary utilities” that are commonly found in any software development tool chain – e.g., a linker, an assembler, a librarian, etc. Note that you DO NOT want to install the i686-ARM tools in the same directories that currently contain the host’s versions of the GNU/SDTC. If you inadvertently overwrite the native i686-i686 tool chain with the i686-ARM tool chain, you will loose the ability to create programs that run on an i686 host! So during the configuration step (see the INSTALL file that accompanies the binutils package), be sure to specify an installation path that is outside the directory tree that currently holds the host’s own software development tool chain. I recommend creating a directory named “/i686-arm” in the root of the host’s file system, and then specifying this directory as the installation directory when configuring the software packages that comprise the GNU/SDTC. I also recommend setting the permissions on this directory to 777 – i.e., all users have read, write, and execute access to this directory, as shown in Figure 3-3. Without these permissions, one must be logged in as the super user “root” to perform the installation step, and installing the i686-ARM binaries as root is about the best way to really mess up the 29 host system if, for some reason, the installation procedure decides to overwrite the host’s own tool chain. (Note that if you are not logged in as the super user root, you won’t have the necessary permissions to overwrite the system’s tool chain. So if you are not logged in as root, and you perform an installation step that tries to overwrite the host’s own tool chain, the install step will simply fail, and will not damage the host’s tool chain.) NOTE The i686-ARM tool chain will eventually be used to create an ARMARM tool chain. This tool chain must be (temporarily) installed on the i686 host during the build process, so while we’re making directories, we might as well also create a directory called “/arm-arm” in which to install the ARM-ARM tool chain (see Figure 3-3). $ su Password: <root’s password here> # mkdir -m 777 /i686-arm # mkdir -m 777 /arm-arm # exit $ Figure 3-3 Creating the installation directory for the i686-ARM tool chain Assuming a directory named “/i686-arm” has been created on the i686 host as shown in Figure 3-3, Figure 3-4 shows the applicable arguments one should use when configuring the binutils package. % ./configure \ --build=i686-pc-linux-gnu \ --host=i686-pc-linux-gnu \ --target=arm-ebsa285-linux-gnu \ --prefix=/i686-arm/usr --exec-prefix=/i686-arm/usr ... Figure 3-4 Configure options for the i686-ARM binutils and gcc packages 30 After configuring the binutils package, build and install the package according to the instructions in the INSTALL file. Following the creation and installation of the binutils package, the next step in the process is the creation of an i686-ARM C compiler. After reading the INSTALL file that accompanies the GNU gcc source package, use the same configure settings shown in Figure 3-4 when configuring the GNU gcc package. After configuring, build and install the gcc sources as instructed by the gcc package’s INSTALL file. 3.2.4 Creating the i686-ARM C Library with an i686 host Creating an i686-ARM version of the GNU C compiler does not also create the standard C library. So the goal at this point is to use the i686-ARM cross compiler to build a C library that is compatible with the i686-ARM C compiler. The i686-ARM C compiler and i686-ARM standard C library can then be used in conjunction with the i686-ARM binutils programs to create, on the i686 host, executable programs that run on a StrongARM target machine. In order to create the i686-ARM C library, the i686-ARM C compiler must have access to the header files for an ARM/Linux kernel. This requirement may seem strange at first, but in fact it makes good sense. Note that these header files describe in complete detail the architecture of the underlying hardware. Without this information, it would be virtually impossible to create a C library that is compatible with the EBSA-285. Since the header files for the Linux kernel do not yet exist (or perhaps do exist but are in some incomplete state), they apparently must be created at this point. Fortunately, the task of building a Linux kernel and/or the kernel’s header 31 files does not depend on the existence of a standard C library. So the Linux kernel and its header files can be built even if a standard C library does not yet exist. NOTE The accompanying CD-ROM contains a Bash shell script and additional documentation that demonstrates how to use the i686-ARM C compiler to create the ARM/Linux kernel header files. After creating the ARM/Linux kernel header files, the files must be copied from the Linux kernel source tree to the /usr/include directory in the “/i686-arm” tree as shown in Figure 3-5. $ cp -a /<path_to_ARM/Linux_sources>/include/linux \ > /i686-arm/usr/include Figure 3-5 Copying the ARM/Linux headers into the i686-ARM directory tree Now that the i686-ARM tool chain has access to the ARM/Linux header files, one can create an i686-ARM compatible C library. Keep in mind that the i686-ARM C library must contain ARM object code – i.e., machine code that executes on an ARM CPU. The i686-ARM C library must not contain object code that executes on an i686 CPU. When performing the software configuration step for the GNU glibc C library package, then, the host machine type must be specified as ARM and not i686 (see Figure 3-6). And as always, do not forget to specify the “/i686-arm” directory as the root installation directory for the library! % ./configure \ --build=i686-pc-linux-gnu \ --host=arm-ebsa285-linux-gnu \ --target=arm-ebsa285-linux-gnu \ --prefix=/i686-arm/usr/ --exec-prefix=/i686-arm/usr ... Figure 3-6 Configure options for the GNU glibc C library 32 3.2.5 Creating the ARM-ARM Tool Chain on an i686 host At this point the /i686-arm directory contains an i686-ARM tool chain that can (at least in theory!) create executable images that run on an ARM processor in the Linux environment. Keep in mind that many of the ARM-ARM tool chain executables in the /i686-arm directory have the prefix “arm-ebsa285-linux-gnu-” attached to them – e.g., arm-ebsa285-linux-gnu-gcc, arm-ebsa285-linux-gnu-ld, etc. So if you want to build ARM executable images, you must use the arm-ebsa285-linux-gnu-* executables to do so. You should also put the /i686-arm/bin directory in your shell’s PATH environment variable so that the cross tools can be found during the build process. Figure 3-7 shows how to do this if you are using the Bash shell. $ PATH="/i686-arm/bin:$PATH" Figure 3-7 Adding the i686-ARM tool chain path to the ‘PATH’ environment variable In practice, the i686-ARM tool chain can build most of the software packages one would find on a “typical” Linux host. There are, however, a few software packages that cannot be built with the i686-ARM tool chain (or with any cross tool chain, for that matter). These packages must be built on the same host that will ultimately execute them. A typical example is where a package’s build sequence creates one or more executable images that test the capabilities of the underlying system (e.g., checking for overflow errors during integer calculations). If this package were created with the i688-ARM cross compiler, the compiler would generate ARM executables (not i686 executables), and these ARM executables clearly will not run on the i686 host that’s performing the build. Such a package, then, must be built 33 natively – i.e., the compiling and linking must be performed on the same platform that will ultimately execute the application program(s) that are being created. The next logical step, then, is to create a “native” ARM-ARM tool chain – i.e., a tool chain that is installed on an ARM platform (not an i686 platform), that runs on an ARM platform (not on an i686 platform), and that creates executable images for an ARM platform. i686 PC i686→ARM Sources 2a 2b Step #2 B: i686 H: ARM T: ARM ARM→ARM Figure 3-8 Using an i686 host to create an ARM-ARM tool chain The native tool chain is initially created on an i686 host using the i686-ARM tool chain (see Figure 3-8). The i686-ARM-generated ARM-ARM tool chain is then installed on the EBSA-285 where it is then used to rebuild the entire ARM-ARM tool chain. (See the documentation that comes with the GNU C compiler for complete information on this process. This documentation also describes a test procedure one can use to verify the correctness [to some degree] of the C compiler that is built using the i686-ARM-generated C compiler on the ARM platform.) The steps one takes to build ARM-ARM versions of the GNU binutils, gcc, and glibc packages using the i686-ARM cross compiler are essentially the same as for the i686-ARM versions of these tools – i.e., the software is configured, built, and 34 installed, in that order. The configure step is a bit different, though, because one now wants to create executables (on the i686 build machine) that will execute on an ARM host. Furthermore, the resulting ARM-ARM tool chain should not be installed in the same directory tree as the i686-ARM tool chain, as this would replace the i686-ARM tool chain with the ARM-ARM tool chain. The applicable settings when configuring for ARM-ARM versions of the GNU binutils, gcc, and glibc packages are shown in Figure 3-9. % ./configure \ --build=i686-pc-linux-gnu \ --host=arm-ebsa285-linux-gnu \ --target=arm-ebsa285-linux-gnu \ --prefix=/arm-arm/usr --exec-prefix=/arm-arm/usr ... Figure 3-9 Initial configure options when building the ARM-ARM tool chain 3.2.6 Copying the ARM-ARM Tool Chain to the EBSA-285 After building the ARM-ARM tool chain on the i686 host, the ARM-ARM tool chain will be installed on the i686 host in the “/arm-arm” subdirectory. The contents of this subdirectory must now be copied over to the EBSA-285’s file system. The easiest way to do this is to create a subdirectory on the i686 host on which to mount the EBSA-285’s file system. Assuming the EBSA-285’s file system resides on a host named pictor, and assuming the NFS server on pictor is configured so that multiple hosts may mount the EBSA-285’s file system, create a directory named /ebsa285fs on the i686 host as shown in Figure 3-10. 35 $ su Password: <root’s password> # mkdir -m 777 /ebsa285fs Figure 3-10 Create a mount point on the i686 host for the EBSA-285’s file system While still logged in on the i686 host as the super user ‘root’, edit the file /etc/fstab and add the line shown in Figure 3-11 (if this line is not already in the file, of course). NOTE The text in Figure 3-11 is actually one long line, and not two separate lines as shown. So when typing the text into the /etc/fstab file on the i686 host, be sure to type all of the text in Figure 3-11 on a single line. Also, there is no space between the comma that follows the word noauto and the beginning of the word users on the second line – i.e., this sequence should be typed as ‘…,exec,noauto,users,async,…’. pictor:/ebsa285fs /ebsa285fs nfs rw,suid,dev,exec,noauto, users,async,rsize=8192,wsize=8192,nfsvers=2,hard 0 0 Figure 3-11 /etc/fstab entry for /ebsa285fs mount point on i686 host While still logged in on the i686 host as the super user ‘root’, mount the EBSA-285’s file system on the i686 host and copy the ARM-ARM tool chain from the i686 host to the EBSA-285’s file system as shown in Figure 3-12. # mount /ebsa285fs # cp -a /arm-arm/* /ebsa285fs/ # cp -a /arm-arm/.[^.]* /ebsa285fs/ Figure 3-12 Copy the ARM-ARM tool chain to the EBSA-285’s file system The final step in the installation process is the creation of some symbolic links on the EBSA-285’s file system. Note that some packages hard code the name of the installation directory into the executable programs the package creates (e.g., the GNU 36 gcc package does this). Since the install directory was specified as /arm-arm/<whatever> during the configure step, the next step is to: a) create a directory named /arm-arm in the root of the EBSA-285’s file system, and b) create some appropriate symlinks within that directory, as shown in Figure 3-27. # # # # > > > > # cd /ebsa285fs if [ ! -d arm-arm ]; then mkdir arm-arm; fi cd arm-arm for item in $(ls ..); do if [ -d "../${item}" ]; then ln -s ../${item} ${item} fi done rm -f arm-arm Figure 3-13 Create symlinks after installing from i686 host to EBSA-285’s file system Now that the ARM-ARM tool chain is installed on the EBSA-285’s file system, you probably need to log on to the EBSA-285 as the super user ‘root’ (e.g., via a telnet session) and use the ldconfig command to update the necessary links and cache files for the GNU run-time (a.k.a., “dynamic”) linker (see Figure 3-14). [jfischer@fornax jfischer]$ telnet ebsa285 Trying 192.168.24.96... Connected to ebsa285. Escape character is '^]'. Cal Poly Network Performance Research Group Kernel 2.4.3-rmk2 on an armv4l ebsa285 login: root Password: [root@ebsa285: /root]$ ldconfig Figure 3-14 Using the ldconfig command to update the EBSA-285’s dynamic linker If the ldconfig command does not yet exist on the EBSA-285, create it on the i686 host using the i686-ARM cross compiler, and then copy the ldconfig executable image from the i686 host to the /usr/bin/ directory on the EBSA-285’s file system (see Figure 3-15). 37 # # # # # # cd <path_to_ldconfig_sources> ./configure <etc...> make mount /ebsa285fs cp -a ./ldconfig /ebsa285fs/usr/bin/ umount /ebsa285fs Figure 3-15 Create ldconfig on the i686 host and install it on the EBSA-285 3.2.7 Rebuild the ARM-ARM Tool Chain on the EBSA-285 Believe it or not, the last step in the creation of the ARM-ARM tool chain is to rebuild the entire tool chain using the i686-ARM-generated ARM-ARM tool chain (i.e., using the tool chain that is currently installed in the “/arm-arm/usr/bin” subdirectory on the EBSA-285’s file system (see Figure 3-16). ARM→ARM Sources 3 Step #3 B: ARM H: ARM T: ARM EBSA-285 Figure 3-16 Recreation of the native ARM-ARM tool chain Figure 3-17 shows the applicable configuration settings for building the GNU binutils, gcc, and glibc packages on the ARM with the i686-ARM-generated ARMARM tool chain. Note that the build, target, and host machines are all three specified as ARM machines in this case. Also note that the install step should install the ARMARM binaries in the /usr/bin/ subdirectory this time, not in the /arm-arm/usr/bin/ subdirectory. 38 % ./configure \ --build=arm-ebsa285-linux-gnu \ --host=arm-ebsa285-linux-gnu \ --target=arm-ebsa285-linux-gnu \ --prefix=/usr --exec-prefix=/usr ... Figure 3-17 Configure options when rebuilding the ARM-ARM tool chain 3.3 3.3.1 System Environment EBSA-285 BIOS When the EBSA-285 ships from the factory, the first two banks of its Flash ROM contain the Angel boot / monitor / debugging utility (bank 0) and a self-test utility (bank 1). The Angel debugger is designed for use with ARM Ltd.’s Software Development Toolkit [ARM/SDT] and works quite well in this capacity. Unfortunately, the Angel boot loader is not designed to function as a boot loader for the Linux operating system. To remedy this situation, Russell King (“the” ARM/Linux guru) created a custom, Linux-based BIOS for the EBSA-285 that is “Linux kernel friendly.” King’s BIOS can either replace the Angel boot loader outright, or it can be installed along side Angel in the EBSA’s Flash ROM. The ARM/Linux BIOS [A/L-BIOS] is open source and available from King’s FTP site [30]. The documentation that accompanies the A/L-BIOS sources describes how to configure and build the A/L-BIOS for your specific needs. Since the A/L-BIOS executes on the EBSA-285, the BIOS’s sources (apparently) must be compiled with the i686-ARM cross compiler or the native ARM-ARM compiler. Before building the A/L-BIOS you must edit the Makefile and specify the Flash ROM bank number that you will be loading the A/L-BIOS into. And, of course, after you create the A/L-BIOS image you must load the image into the Flash ROM 39 bank number specified in the Makefile. If you don’t do this, the A/L-BIOS will not work. The documentation that accompanies the sources discusses this requirement (and others) in detail. After creating an executable image of the A/L-BIOS, the next step is to store this image in the EBSA-285’s Flash ROM. Prepare the EBSA-285 for this procedure by performing the following steps: q If the EBSA is powered on, remove its power source and wait a few seconds for the power supplies to completely die down. q Remove the EBSA from its host and lay the board down on a staticfree surface. q Record the positions of jumpers J9, J10, and J15 pins 4-5-6 (see Figure 3-18 for the locations of these three jumpers). q Reconfigure the EBSA’s jumpers as shown in Figure 3-18. Specifically, the EBSA-285’s must be configured for “add-in” card mode (jumpers on J9 and J10 pins 1-2), and the SA-110 microprocessor must be held in “blank programming mode” by placing a jumper on J15 pins 5-6 [24]. J15 (5-6) J9 J10 Figure 3-18 EBSA-285 jumper positions for programming the Flash ROM [24] 40 After configuring the EBSA’s jumpers as shown in Figure 3-18, make certain the MS-DOS host is turned off. If the MS-DOS host is currently turned on, turn it off and wait a few seconds for the power supplies to die off. Install the EBSA-285 in any available PCI slot in the MS-DOS host machine and then turn on the DOS host. Make certain the host boots the MS-DOS operating system. NOTE The Flash Management Utility [FMU] software will not work if the host machine boots a non-MS-DOS operating system (e.g., Linux or Microsoft Windows). NOTE The FMU software is installed when the ARM/SDT is installed. So you may need to install the ARM/SDT software in order to have the FMU software available to you. When the DOS prompt appears, navigate to the directory where the FMU software is installed. Note that the FMU application is a 32-bit, protected mode DOS program, and not a 16-bit “real mode” DOS program. Therefore, you must invoke the DOS/4GW 32-bit DOS protected mode emulation utility dos4gw.exe before running the FMU application. For example: C:\FMU> dos4gw fmu [enter] // {blah blah text here...} FMU> The dos4gw command starts the 32-bit protected mode DOS emulator, which in turn launches the FMU application program, FMU.EXE. When the “FMU>” command prompt appears, you can get a listing of the FMU command set by typing a question mark ‘?’ and pressing the ENTER key. To learn how each of FMU commands is used, refer to the chapter titled “Flash Management Utility” (chapter 7) in [24]. A brief description of some of these 41 commands is provided in the next few paragraphs. The FMU’s List command lists the Flash ROM image numbers that currently have programs in them (and, by omission, it also identifies the images that currently do not have programs in them). If you want to replace an existing image with a new image, you must first delete the existing image with the FMU’s “Delete <image-number>” command. The FMU’s Program command is explained in [24], but the following example shows how one might use this command. This example replaces the contents of image #5 with a BIOS image that is stored on a floppy disk: FMU> delete 5 [enter] Do you really want to do this (y/N)? yes Deleting flash blocks: 5 Scanning Flash blocks for usage FMU> program 5 ARM/Linux-BIOS A:\bios 5 Writing A:\bios into flash block 5 // yadda yadda, lotsa output here... ####################################### Scanning Flash blocks for usage [enter] FMU> When the “FMU>” command prompt reappears, the programming stop is complete. Exit the FMU by issuing the Quit command at the “FMU>” command prompt. Turn off the MS-DOS host and wait a few seconds for the power supplies to die out. Remove the EBSA-285 from the DOS host, lay it down on a static-free surface, and return the EBSA-285’s jumpers to their original positions. Before reinstalling the EBSA-285 in its original host, be sure to set the EBSA’s Flash Image Selector switch (see Figure 3-19) to the image number that 42 holds the new BIOS. For example, if the new BIOS was programmed into image #5, set the EBSA’s Image Selector Switch so that it points at the number 5 (see below). Flash ROM Layout 0 Angel Debugger 1 Diagnostics 2 < unused > 3 < unused > 4 < unused > 5 ARM/Linux BIOS . .. C < unused > D < unused > E < unused > F < unused > Figure 3-19 EBSA-285 Flash ROM configuration 3.3.2 Booting Linux on the EBSA-285 When the A/L-BIOS boots on the EBSA-285, it performs a power-on system test [POST]. During the POST, the A/L-BIOS tries to determine whether any “bootworthy” devices are connected to the system. In particular, the A/L-BIOS checks to see if an IDE hard drive is connected to the EBSA-285. If a hard drive is indeed connected, the A/L-BIOS tries to load and boot a Linux kernel from the hard drive. If this fails, the A/L-BIOS then tries to locate a NIC. If a NIC is found, the A/L-BIOS tries using the BOOTP and TFTP protocols to download a Linux kernel via the network. If this fails, the BIOS attempts to download a Linux kernel via the EBSA285’s serial port. If this fails, the A/L-BIOS panics and everything comes to a screeching halt. 43 When Eric Engstrom began the task of porting Linux to the EBSA-285 [11], he utilized the A/L-BIOS’s ability to download a Linux kernel via the EBSA’s serial port as shown in Figure 3-20. Using this configuration, he was able to get his first “test” kernels up and running on the EBSA-285. Actually, Engrstom’s first test kernels were not Linux kernels at all; they were simple little 3-line “hello world” programs. Nevertheless, the A/L-BIOS dutifully downloaded and launched these programs as if they were “the real thing.” Of course, once these test programs ran to completion they would simply terminate and the EBSA would hang. Regardless, this was indeed a major milestone for Engstrom and the entire CiNIC project. CiNIC v0.1a EBSA-285 Hello, World! 21554 Host PC Hydra RS-232C Serial Connection Fornax Figure 3-20 The platform used by Eric Engstrom to port Linux onto the EBSA-285 3.3.3 Linux Bootstrap Sequence It is a well-known fact that the Linux operating system will not successfully bootstrap itself if it does not have access to a “root” file system. So the next step in the evolutionary process of porting Linux to the EBSA-285 was creating the EBSA’s root file system, and, more importantly, making this file system accessible to the EBSA285. To do this, Engstrom and I leveraged information we found in [31], [39], [52], and [54] to set up a diskless “NFS/root” file system for the EBSA-285. In a nutshell, we: 44 • Attached a 3Com 3c905C NIC to the secondary PCI bus (i.e., to the 21554 evaluation board) and then connected the NIC to the lab’s Ethernet LAN. • Created a “bare bones” root file system for the EBSA in a subdirectory of the root file system owned by the host machine pictor • Installed and configured the Network File System [NFS] protocol on pictor so that the EBSA’s could access its root file system on pictor via an Ethernet LAN. • Installed and configured the BOOTP [54] and TFTP [52] protocols on pictor. (Support for these protocols is already built into the A/L-BIOS.) With these protocols in place, the A/L-BIOS could ask “who am I?” and receive its IP4 address in reply. The A/L-BIOS can now use the TFTP protocol to download a Linux kernel from pictor (see Figure 3-21 below). • Configured and built an ARM/Linux kernel with NFS/root file system support built into it, and installed this kernel image on pictor in accordance with [31], [39], [52], and [54]. Pictor 100 Mbps Fast Ethernet EBSA-285 21554 3c905C 1 • DNS name • IP4 address • Host PC BIOS’s bootp request ( Who am I ? ) 2 /tftproot/kernel tftp download request: 3 “get /tftproot/kernel” 4 EBSA285 Linux kernel Figure 3-21 ARM/Linux BIOS’s bootp sequence 45 After the BIOS downloads the Linux kernel from the host machine pictor, it decompresses the kernel image and then hands control of the EBSA-285 over to the Linux kernel. From this point on, the A/L-BIOS no longer controls the EBSA-285. As the Linux kernel comes to life on the EBSA-285, it eventually reaches a point where it must mount and use its root file system. But before this can happen, the kernel must first determine the EBSA-285’s IP4 address. It does this by sending out a BOOTP (i.e., “who am I?”) request. The BOOTP server pictor hears this request and sends back two pieces of information: the EBSA’s IP4 address and the name of the directory (relative to pictor’s own file system) that contains the EBSA’s root file system (see Figure 3-22 below). With these items of information in hand, the Linux kernel sends an NFS mount request back to pictor, asking it to mount the specified directory (“/ebsa285fs” in this case). If pictor complies with this mount request (and it generally does), the EBSA/Linux kernel has its root file system and can continue its bootstrap sequence. Otherwise the EBSA/Linux kernel panics and halts. 46 Pictor 100 Mbps Fast Ethernet EBSA-285 21554 3c905C bootp request ( Who am I ? ) 1 • DNS name • IP4 address • /ebsa285fs Host PC 2 nfs mount request: 3 mount pictor:/ebsa285fs Time Line 100 Mbps Fast Ethernet EBSA-285 21554 EBSA285 3c905C Host PC Figure 3-22 Linux kernel bootp sequence 3.3.4 Linux Operating System Characteristics A complete description of the evolution of the Linux operating system environment on the EBSA-285 is provided in Eric Engstrom’s senior project report [11]. At the time of this writing, the operating system environment on the EBSA-285 is (more-orless) the same System V setup used by Redhat Linux 7.x distributions. The Linux kernel is explicitly configured to support the creation and use of multiple RAM disks to provide “disk-like” support for Jenny Huang’s web caching research [21]. 47 3.3.5 Synchronizing the System Time with the Current Local Time As is often the case with small, single-board computer systems, the EBSA- 285 does not have an on-board real-time clock chip (e.g., such as the Intel 82801AA LPC Interface Controller – a CMOS chip with battery backup – that maintains, among other things, the current local time in many desktop PCs). So when Linux boots on the EBSA-285 it default initializes its system clock to the “zero epoch” time of 12:00:00.0 midnight, January 1, 1970. Keep in mind that that Linux’s system clock is a software pseudo-clock that Linux creates and manages, whose time value resides in system RAM. Consequently, whenever the EBSA-285 reboots, the current system time value is lost. If Linux is unable to synchronize its system clock with the current local time, it cannot properly manage some of its most important subsystems – particularly its file systems. Furthermore, a number of application programs (e.g., the GNU make utility) do not function properly when the system time is not properly synchronized with the current local time. So in order for Linux to function correctly on the EBSA285, its system clock must somehow be synchronized with the current local time during the boot sequence. One way to accomplish this is to log on to the EBSA-285 as the system administrator – i.e., as the super-user ‘root’ – and manually synchronize the system time with the current local time via the date command. This should be done immediately after Linux boots on the EBSA-285, before any other use of the system, in order to minimize the possibility of corruption within the EBSA-285’s time-sensitive subsystems. Of course, manually setting the system clock like this is a somewhat risky solution to the clock synchronization issue because it assumes the 48 system administrator will always be on hand immediately after Linux boots. If this is not the case and a system- or application-layer process happens to use a timedependent resource before the system time is synchronized to the current local time, the EBSA-285 can (and typically will) become unstable. An example of an application-layer process that is highly dependent on the current system time is the make utility. If the current system time is not coordinated with the time stamps that appear on the underlying file system, the make utility will exhibit a variety of failure modes – e.g., failing to recompile stale source files, linking stale object modules with new object modules, etc. Another example of an application-layer process that requires some degree of accuracy in the system clock is the Concurrent Versioning System (CVS) utility. This utility is used by a group of programmers who collectively work on (i.e., make modifications to) the source files that comprise a given software package. If a user happens to use the CVS utility check out, or check back in, or edit, etc., before the system time is synchronized with the current local time, the entire CVS database can become hopelessly corrupted (i.e., the CVS database for that particular software package would probably need to be recreated from scratch). Assuming the system clock is somehow synchronized with the current local time during the boot sequence, the system clock’s long-term stability now becomes an issue of concern. Because the system clock is a software-based clock it does not exhibit the same degree of long-term stability and/or accuracy as a hardware-based real-time clock. Consequently, the system clock time always drifts away from the current local time after a synchronization operation. Furthermore, the system clock’s 49 drift rate is heavily influenced by the amount of hardware and/or software interrupt activity on the Linux host, so the magnitude of the drift rate varies over time. Because the system time inevitably drifts away from the current local time, the system administrator must periodically re-execute the date command (or some similar command) to bring the system clock back into synchronization with the current local time. Since manual resynchronization of the system clock is a tedious and error prone procedure, a preferred solution would be one that automatically obtains the current local time from an external time reference and then resynchronizes the local system time to the time value obtained from the external reference. Fortunately, a number of time synchronization protocols exist that perform exactly this function. Of these protocols, perhaps the simplest is the RFC 868 Time Protocol [42]. A host configured as an RFC-868 Time Protocol server responds to incoming service requests by sending back a 32-bit, RFC-defined value that represents the number of seconds that have elapsed since January 1, 1900. The requestor – i.e., the RFC-868 client – can then use this value to set its internal clock(s). On Linux hosts, the rdate command is an RFC 868-based utility that obtains the current local time from a remote RFC-868 Time Protocol server via a TCP/IPbased network connection. The rdate command is easily integrated into Linux’s boot scripts (e.g., the /etc/rc.d/rc.sysinit script) for the purpose of automatically synchronizing the system time with the current local time during the boot sequence. The task of periodically resynchronizing the system clock with the current local time can be accomplished with the help of Linux’s cron utility. The files 50 /sbin/syncsysclock and /etc/cron.hourly/local.cron on the accompanying CD-ROM show how boot-time and periodic synchronization is performed on our EBSA-285. NOTE The clockdiff utility can be used to observe the amount of drift in the EBSA’s system clock over time, relative to other Linux hosts, with a resolution of 1 millisecond (0.001 second). For additional information on network-based time synchronization and/or system timing issues, see [38] and [41]. 3.4 User Environment The evolution of the initial user environment on the EBSA-285 is documented in Eric Engstrom’s senior project report [11]. Engstrom’s initial user environment was intentionally minimal, providing only enough functionality to support the most basic tasks – e.g., logging on to the EBSA-285, listing the contents of a directory, changing to a new directory, and so on and not much else. There were no fullfeatured command shells, no editors, no file system utilities, etc.; these were yet to be created and installed on the EBSA-285. Since I was the one with the most experience using the i686-ARM cross development tool chain, I began the long, tedious process of building and installing the software packages that comprise a “typical” Linux user environment on the EBSA-285 – e.g., command shells, file system utilities, system configuration files, networking support, etc. For the most part, I used the source packages that came with the Redhat 7.0 and 7.1 distributions when creating the EBSA-285’s user environment. Apparently, then, the EBSA-285’s user environment is essentially the same “System V” user environment found on Redhat 7.x workstations (with the possible exception that there is currently no support for the X-windows environment on the EBSA-285). 51 3.4.1 Building applications with the i686-ARM cross tool chain Initially, I used the i686-ARM cross development tool chain to build the applications and utilities one typically finds on a Redhat 7.x workstation. The steps required to use the i686-ARM tool chain are not always obvious, and experimentation is generally needed before one succeeds at building a software package with it. In some cases, one must edit a package’s Makefile(s) before the build process will work. As is the case when building the GNU software development tool chain, one must be extremely careful when specifying the build and install options during the package configuration phase. In my experience, the use of command shell environment variables helps minimize the possibility of incorrectly specifying these options. Figure 3-23 shows some of the bash environment variables I used when configuring software packages on the i686 host, before building and installing them on the i686 host with the i686-ARM cross tool chain. export arm_arm_root="/arm-arm" export common_options="\ --build=i686-pc-linux-gnu \ --host=arm-ebsa285-linux-gnu \ --target=arm-ebsa285-linux-gnu \ --sysconfdir=$(arm_arm_root)/etc \ --sharedstatedir=$(arm_arm_root)/usr/share \ --localstatedir=$(arm_arm_root)/var \ --mandir=$(arm_arm_root)/usr/share/man \ --infodir=$(arm_arm_root)/usr/share/info \ " export prefix_bin="\ --prefix=$(arm_arm_root)/bin \ --sbindir=$(arm_arm_root)/sbin \ --exec-prefix=$(arm_arm_root)/bin \ " export prefix_usr="\ --prefix=$(arm_arm_root)/usr \ --sbindir=$(arm_arm_root)/sbin \ --exec-prefix=$(arm_arm_root)/usr \ " Figure 3-23 Bash environment variables for common i686-ARM configure options 52 The arm_arm_root environment variable specifies the root directory on the i686 host where the resulting ARM-compatible executables will be installed (i.e., after they are created with the i686-ARM cross tool chain). Note that the contents of this directory tree will eventually be copied en masse over to the EBSA-285’s file system on Pictor. WARNING If you choose to use the environment variables shown in Figure 3-23, make sure the arm_arm_root environment variable is initialized as shown. Do not omit the definition for the arm_arm_root environment variable, and do not define it as the root directory ‘/’. Doing either of these will potentially wipe out the i686 applications of the same name on the build host, rendering the i686 build host inoperable in the process! The common_options environment variable shown in Figure 3-23 specifies the build and install options that are more-or-less common to all of the packages being built with the i686-ARM cross tool chain. The prefix_bin environment variable shown in Figure 3-23 specifies the build and install options that are common to source packages that create application programs that are typically used during the Linux boot sequence. See the /sbin and /bin directories to find out which programs are generally required during the Linux kernel boot sequence. For example, Figure 3-24 shows how the common_options and prefix_bin environment variables are used when configuring the software package that creates the /sbin/insmod program – a program that is typically used during the Linux boot sequence to load kernel-loadable modules. 53 % ./configure \ $(common_options) \ $(prefix_bin) \ ... # other package options here Figure 3-24 Configure step for apps that are used during the Linux boot sequence The prefix_usr environment variable shown in Figure 3-23 specifies the build and install options that are common to source packages that create programs that typically are not required during the Linux boot sequence (e.g., the emacs editor). For example, Figure 3-25 shows how the common_options and prefix_usr environment variables are used when configuring the software package that creates the /usr/bin/emacs text editor program – a program that is typically not used during the Linux boot sequence. % ./configure \ $(common_options) \ $(prefix_usr) \ ... # other package options here Figure 3-25 Configure step for apps that are not used during the Linux boot sequence 3.4.2 Installing from the i686 host to the EBSA-285’s file system After building these packages with the cross development tool chain, the packages must be temporarily installed on the i686 host. Assuming one uses the same bash environment variables shown in Figure 3-23, performing the ‘make install’ step on the i686 host installs the ARM executables in the /arm-arm subdirectory on the i686 host. Following the i686 install step, the contents of the /arm-arm directory must be copied to the EBSA-285’s file system. As mentioned in §3.2.6, the easiest way to 54 do this is to use the NFS feature to mount the EBSA-285’s file system on the i686 host, and then simply copy everything from the /arm-arm directory to the /ebsa285fs directory as shown in Figure 3-26. # mount /ebsa285fs # cp -a /arm-arm/* /ebsa285fs/ # cp -a /arm-arm/.[^.]* /ebsa285fs/ Figure 3-26 Copying the /arm-arm directory to the EBSA-285’s root directory The final step in the installation process is (once again) the creation of some symbolic links on the EBSA-285’s file system. As mentioned in §3.2.6, some packages hard code the name of the installation directory into the executable programs the package creates. Since the install directory was specified as /arm-arm/<whatever>, some programs will not work correctly if they are not installed in this specific directory tree on the EBSA-285’s file system. The quick and easy fix for problem is to create a directory named /arm-arm in the root of the EBSA-285’s file system and create some appropriate symlinks within that directory, as shown in Figure 3-27. % % % % > > % cd /ebsa285fs mkdir arm-arm cd arm-arm for item in $(ls ..); do [ -d "../${item}" ] && ln -s ../${item} ${item} done rm -f arm-arm Figure 3-27 Create symlinks after installing from i686 host to EBSA-285’s file system 3.4.3 Rebuilding everything with the ARM-ARM tool chain Eventually, everything that is built with the i686-ARM cross tool chain should be rebuilt natively on the EBSA-285 platform using the native ARM-ARM tool 55 chain. This should be done as soon as possible to maximize the stability of the runtime applications on the EBSA-285 platform. Note that cross compiling sometimes yields ARM executables that are less stable than ARM executables that are built natively. So I recommend using the i686-ARM cross compiler to build only what is absolutely necessary in §3.4.2 – i.e., the utilities that are commonly used when building software packages (e.g., the GNU make utility, the Perl scripting language, the GNU sed and awk utilities, the GNU fileutils package, etc.). When building a software package on the EBSA-285 with the native ARMARM tool chain, one uses essentially the same configure options that were used when building the same package on the i686 host with the i686-ARM tool chain. The only difference is that the arm_arm_root environment variable is now undefined (or defined to the empty string “”) as shown in Figure 3-28. export arm_arm_root="" export common_options="\ --build=i686-pc-linux-gnu \ --host=arm-ebsa285-linux-gnu \ --target=arm-ebsa285-linux-gnu \ --sysconfdir=$(arm_arm_root)/etc \ --sharedstatedir=$(arm_arm_root)/usr/share \ --localstatedir=$(arm_arm_root)/var \ --mandir=$(arm_arm_root)/usr/share/man \ --infodir=$(arm_arm_root)/usr/share/info \ " export prefix_bin="\ --prefix=$(arm_arm_root)/bin \ --sbindir=$(arm_arm_root)/sbin \ --exec-prefix=$(arm_arm_root)/bin \ " export prefix_usr="\ --prefix=$(arm_arm_root)/usr \ --sbindir=$(arm_arm_root)/sbin \ --exec-prefix=$(arm_arm_root)/usr \ " Figure 3-28 Bash environment variables for the ARM-ARM configure options 56 Chapter 4 Instrumentation When developing a new hardware / software architecture, a natural question that arises is “what are the performance characteristics of the new architecture?” The answers to this question are particularly important if the new architecture represents an evolutionary or revolutionary change to some preexisting architecture. Such is the case with the CiNIC Project. This chapter describes the performance instrumentation techniques I created, with the assistance of Ron J. Ma [33], to gather latency data from the CiNIC architecture during network send operations. There are two implementations of this instrumentation technique: “stand-alone” and “CiNIC.” The stand-along implementation gathers performance data from the host system by itself, or from the EBSA-285 co-host system by itself. The stand-alone system does not gather performance data from the combined host/co-host CiNIC platform. The CiNIC instrumentation technique, on the other hand, gathers performance data only from the combined host/co-host CiNIC platform. Samples of the performance data obtained with these two instrumentation techniques are provided in Chapter 5. 4.1 Background Historically, the CPNPRG has relied on techniques such as time stamping and profiling to harvest performance data from a particular hardware / software platform at runtime. For example, Maurico Sanchez used these techniques to glean CPU utilization and throughput performance data from his I2O development efforts [49]. 57 Sanchez also wanted to obtain latency data for the I2O platform, but was ultimately unable to do so, primarily because of restrictions that were imposed on him by the Novel Netware environment he was using. This difficulty was another example of the problems that spawned the idea of using a logic analyzer to passively gather latency data from a host system via its PCI bus, an idea that was ultimately implemented with great success – albeit on platforms other than Sanchez’s I2O platform [12], [32], [57]. An overview of this technique is provided in §4.3. Following the conclusion of Sanchez’s I2O development work, the CPNPRG obtained the source code for an experimental Win32 IPv6 protocol stack from Microsoft Research (a.k.a., “msripv6”). Peter Xie instrumented this code with software time stamps to determine its performance characteristics [56], and I subsequently used the logic analysis measurement method described in [12] to obtain latency data for the same msripv6 protocol stack. Unfortunately, Xie and I were gathering our performance data from two different machines, and the performance characteristics we were measuring were also quite different. Consequently, the correlation between his measurement data and mine was not particularly good. To remedy this situation, Bo Wu and I set about the task of integrating Xie’s work with mine so that we could simultaneously harvest performance data from the msripv6 protocol stack via: a) the software timestamps Xie and Wu had placed in the msripv6 stack, and b) the logic analysis measurement method described in [12]. This approach imposed a tight correlation on the time stamp data gathered by the instrumented msripv6 stack, the timestamp data gathered by the test application, and the latency times observed by the logic analyzer on the host’s PCI bus. The combined data from 58 this experiment revealed some interesting characteristics of the msripv6 stack that had previously gone undetected [12]. The integration of the host and co-host platforms into the CiNIC architecture marks the current evolution of the logic analysis / systems approach to gathering latency data from a host’s PCI bus in response to network send operations at the application layer. Ron J. Ma and I worked together to extend the logic analyzer’s measurement capability to gather latency data from two different PCI busses – i.e., the primary PCI bus on the host system, and the secondary PCI bus on the co-host – in response to network send transactions from the host system. This work was performed in parallel with the development efforts of Rob McCready and Mark McClellend (§2.3.2), who at the time were still working on the host/co-host communication infrastructure. Because this infrastructure was only partially available to us, Ma and I had to devise a proverbial “plan B” host/co-host communications protocol we could use in the interim to send data from the host system down to the EBSA-285 co-host and out to the NIC ([33] and §4.4.2). This simple protocol allowed Ma and I to meet the goal of extending the logic analyzer’s measurement capability to gather latency data from two different PCI busses – i.e., the primary PCI bus on the host system, and the secondary PCI bus on the co-host – in response to network send transactions from the host system. 4.2 Performance Metrics At present, the CPNPRG is currently interested in determining latency, throughput, and CPU utilization characteristics for the pre- and post-CiNIC architectures. In this thesis, I am interested in obtaining PCI-level wire entry and exit latencies for: 59 • The host system Hydra as a “stand-alone” platform (i.e., before its TCP/IP stack processing is off-loaded onto the EBSA-285 co-host), and, • The EBSA-285 co-host as a “stand-alone” platform (i.e., the EBSA-285 is operating independently of the host system, and plays no part in the host system’s networking transactions), and, • The combined host / co-host CiNIC architecture that off-loads the host system’s TCP/IP stack processing onto the EBSA-285 co-host. 4.2.1 Framework for Host Performance Metrics Before one begins the task of obtaining data for some particular performance characteristic, one must first define fundamental concepts such as ‘metrics’ and ‘measurement methodologies’ that allow us to speak clearly about measurement issues [41]. These concepts must be clearly defined for each performance characteristic being measured, and must include some discussion of the measurement uncertainties and errors involved in the quantity being measured and the proposed measurement process. Clearly, these definitions must be derived from standardized metrics and measurement methodologies to obtain results that are meaningful to the scientific community at large. In the realm of IP performance metrics, Vern Paxson’s “Framework for IP Performance Metrics” (RFC 2330) is the de facto framework for implementing performance metrics for IP-related traffic on the Internet at large. Specifically, Paxson’s document is: “[a] memo that attempts to define a general framework for particular metrics to be developed by the IETF’s IP Performance Metrics effort, begun by the Benchmarking Methodology Working Group (BMWG) of the Operational Requirements Area, and being continued by the IP Performance Metrics Working Group (IPPM) of the Transport Area” [41]. 60 While this memorandum deals specifically with IP performance issues for the Internet at large, we (the CPNPRG) feel Paxson’s measurement framework (or portions thereof) are also useful for gathering performance data within an endpoint node. To this end, the CPNPRG has adopted Vern Paxson’s “Framework for IP Performance Metrics” (RFC 2330) as a reference document for the measurement methodologies employed by the group. 4.3 Stand-alone Performance Instrumentation This section describes the instrumentation technique used to measure the “stand alone” performance characteristics of either the host machine hydra, or the EBSA285 co-host. This instrumentation technique is not used to gather performance data concurrently from the combined host / co-host CiNIC architecture. (Section 4.4 below discusses the instrumentation technique used with the CiNIC architecture.) NOTE In section 4.3, the term “system under test” [SUT] refers to either the host machine hydra, or the EBSA-285 co-host. It does not refer to the combined host / co-host CiNIC architecture. 4.3.1 Theory of Operation The instrumentation configuration shown in Figure 4-1 below is essentially the same configuration described in [12], with the exception that there is no longer a connection between the host computer’s parallel port and the logic analyzer. A “new-andimproved” PCI-based triggering mechanism is now being used to trigger the logic analyzer in lieu of the parallel port triggering mechanism described in [12]. 61 System Under Test [ SUT ] PCI Bus NIC HP 16700A Logic Analyzer LAN Figure 4-1 “Stand-alone” host performance instrumentation To quickly review the measurement technique described in [12], a test application, running on the SUT, allocates storage for a “payload buffer” of some desired size. The test application then initializes the buffer’s contents with a “well formed” payload value. The specific content of the payload buffer after initialization depends upon the transport protocol being used. For TCP transactions, the test application initializes the payload buffer with the content shown in Figure 4-2. 0 AAAAAAAAAAAA ******** EEEEEEEEEEEE TCP / UDP payload Payload buffer Figure 4-2 Contents of the payload buffer for stand-alone measurements The TCP-based payload buffer always begins with a contiguous sequence of twelve “A” characters (a.k.a., the “start marker” characters), and it always ends with a contiguous sequence of twelve “E” characters (a.k.a., the “stop marker” characters). The bytes in the middle of the payload buffer – i.e., between the start and stop marker characters – are initialized with asterisk characters ‘*’. This arrangement ensures: a) the start marker characters are the first characters of the TCP payload to reach the 62 PCI bus (and therefore the network media), and b) the stop marker characters are the last characters of the TCP payload to reach the PCI bus, in response to the test application sending the payload buffer contents to a remote host, via a TCP connection, using the SUT’s networking subsystem. For UDP-based send transactions in the Linux environment, initialization of the payload buffer requires some special processing to ensure the start marker characters are the first characters of the UDP payload to appear on the PCI bus, and the stop marker characters are the last characters of the UDP payload to appear on the PCI bus. Section 4.3.2 below explains in detail why this special processing is required on Linux hosts, and shows where to place the start and stop markers within the UDP payload buffer. The goal here is to arrange the contents of the UDP payload buffer so that the start and stop marker characters arrive on the PCI bus with the same sequence shown in Figure 4-2, in response to the test application sending the contents of the payload buffer to a remote host, via a UDP connection, using the SUT’s networking subsystem. Referring now to Figure 4-3 below, the newly developed PCI-based logic analyzer triggering mechanism works as follows. A test application, running on the SUT, writes a known 32-bit value to an unimplemented base address register [BAR]. This unimplemented BAR can belong to any PCI function that is connected to the SUT’s PCI bus. For this project, the value 0xFEEDFACE is written to the 3rd BAR on a 3Com 3c905C NIC. The logic analyzer’s triggering subsystem is manually configured to detect this write transaction on the PCI bus, and the analyzer uses this 63 event as its “time = 0 seconds” reference time t0 for all subsequent PCI transactions it observes on the PCI bus (see Figure 4-3). System Under Test [ SUT ] PCI Bus 0xFEEDFACE t0 BAR NIC HP 16700A Logic Analyzer Figure 4-3 Write to unimplemented BAR on 3c905C NIC corresponds to time t 0 From this point on, the instrumentation technique used to gather performance data from the SUT is virtually the same as described in [12]. So immediately after the test application triggers the logic analyzer by writing a known value to an unimplemented BAR, the test application sends the contents of the payload buffer to a remote host’s discard port ([43], [47]) via the network. The payload is sent using socket calls that are native to the operating system environment that’s controlling the SUT. After a bit of processing within the SUT’s TCP/IP protocol stack, the payload buffer’s start marker characters reach the PCI bus on their way to the NIC (see Figure 4-4). The analyzer detects the arrival of these characters on the PCI bus and records this event as the “wire arrival time” t1 (formerly referred to as the “start marker” time in [12]). In other words, time t1 represents the processing latency associated with moving the first bytes of the payload buffer from the application layer, down through the SUT’s networking subsystem, onto the PCI bus, and out to the NIC. 64 System Under Test [ SUT ] PCI Bus “AAAA” t NIC 0 t HP 16700A Logic Analyzer 1 LAN Figure 4-4 Wire arrival time t 1 In a similar manner, when the payload buffer’s stop marker sequence arrives at the PCI bus (see Figure 4-5), the analyzer detects and records this event as the “wire exit time” t2 (formerly referred to as the “stop marker” time in [12]). Time t2, then, represents the processing latency associated with moving the last bytes of the payload buffer from the application layer, down through the SUT’s networking subsystem, onto the PCI bus, and out to the NIC. System Under Test [ SUT ] PCI Bus “EEEE” t NIC t HP 16700A Logic Analyzer LAN 0 t 1 2 Figure 4-5 Wire exit time t 2 After the logic analyzer records the wire entry and exit times t1 and t2, the test application (on the SUT) downloads these values from the analyzer for postprocessing purposes. In general, the test application repeats the aforementioned steps multiple times for a given payload size. For example, the test application might: a) allocate a buffer for a 200-byte payload, and b) initialize the buffer’s contents for a TCP connection 65 (see Figure 4-2), and c) send the buffer’s contents a total of thirty times to a remote host’s discard port via a TCP connection. Each of these send operations generates a {t1, t2} data pair, and the resulting collection of thirty {t1,t2} data pairs comprise a sample set of latency times for a 200-byte TCP payload. The test application sorts the t1 and t2 values in the sample set and determines the minimum, maximum, and median values for these latency times. The resulting min, max, and median t1 and t2 times are then written to a text file in such a way that the file’s contents have a “spreadsheet-friendly” input format (see Figure 4-6). For example, importing the results files into Microsoft Excel and then plotting the median latency times for t1 and t2 created the X-Y scatter plots found in Chapter 5. After the test application collects and processes a sample set of latency times for a given payload size, it can (if programmed to do so) perform the aforementioned steps again on different payload sizes. Protocol Server Trials Start size Stop size Step size Bad data Timeouts Date/Time File name OS/Version Size 512 1024 1536 2048 2560 . . . : : : : : : : : : : : TCP/IPv4 Send -- Blocking 192.168.24.65 30 512 65536 512 0 4 Mon Apr 16 13:07:39 2001 LT4SB_512-65536-512_2001-04-16_13-07-39.dat Red Hat Linux release 7.0 (Guinness) [2.4.3-ipv6] |-------- t1 --------| min median max 20.024 21.916 24.136 23.816 24.608 25.472 24.840 27.088 29.272 25.912 27.112 28.848 26.072 27.080 28.224 |-------- t2 --------| min median max 23.848 25.728 27.952 31.496 32.268 33.136 43.008 45.236 47.440 53.888 55.216 85.624 91.512 92.816 93.784 Figure 4-6 “Spreadsheet friendly” contents of test results file 66 4.3.2 IPv4 Fragmentation of UDP Datagrams in Linux When the Linux kernel’s IP4 sub-layer receives a UDP datagram, it checks to see whether the resulting IP datagram will fit within a single network frame. (Recall that the maximum transmission unit [MTU] of the underlying network media determines the size of a network frame.) If the resulting IP datagram will indeed fit within a single network frame, no further processing is required by the IP4 sub-layer and the datagram is sent on its way. On the other hand, if the size of a UDP datagram is such that the resulting IP datagram will exceed the capacity of a single network frame, the IP4 sub-layer must fragment the UDP datagram into a collection of smaller datagrams whose sizes are compatible with the MTU of the underlying network media. When fragmenting a “large” UDP datagram into multiple IP datagrams, Linux’s IP4 sub-layer fragments the UDP datagram from front to back as shown in Figure 4-7. Depending on the length of the UDP datagram, the size of the last IP datagram (IP Datagram n in Figure 4-7) will be less than or equal to the MTU of the underlying network media. The sizes of the remaining IP datagrams – i.e., IP Datagram 1 through IP Datagram n-1 – are the same size, and are directly proportional to the MTU of the underlying network media. UDP Datagram UDP Header IP Datagram [ ∝ 1 MTU ] UDP Payload 1 IP Datagram [ ∝ 1 MTU ] 2 IP Datagram [ ≤ 1 MTU ] Figure 4-7 IP4 fragmentation of “large” UDP datagram n 67 After fragmenting the UDP datagram into a collection of network-compatible IP datagrams, we have observed that the IP4 sub-layer sends the IP datagrams to the network in “back-to-front” fashion – i.e., IP Datagram n is the first to be sent, followed by IP Datagram n-1 and so on, so that the last IP datagram sent to the network is IP Datagram 1. For UDP payloads, then, the start and stop marker character sequences should be positioned at the beginning of the IP Datagram n, and at the end of the IP Datagram , respectively (see Figure 4-8 below). This placement scheme will ensure the start 1 marker characters are the first characters of the UDP payload to appear on the PCI bus, as a result of the IP4 sub-layer sending IP datagram n to the network first. Likewise, placing the stop marker at the end of IP Datagram 1 will ensure the stop marker characters are the last characters to reach the PCI bus, as a result of the IP4 sub-layer sending IP Datagram 1 to the network last. UDP Datagram UDP Header **EEEE IP Datagram [ = 1 MTU ] ***** ... IP Datagram [ = 1 MTU ] 1 ***** AAAAA** ... **** IP Datagram [ ≤ 1 MTU ] 2 n Send Order 1st IP AAAA** ... **** IP Datagram n [ ≤ 1 MTU ] 2nd IP ***** ... ***** IP Datagram n - 1 [ 1 MTU ] 3nd ... IP ***** ... ***** IP Datagram n - 2 [ 1 MTU ] last IP ... UDP ***...***EEEE ... IP Datagram 1 [ 1 MTU ] Figure 4-8 IP4 datagram send sequence for UDP datagrams > 1 MTU 68 Having established a general-purpose algorithm for placing the start and stop markers within a large UDP payload, one must now check for special cases. The most obvious special case is where the last IP datagram (IP Datagram n in Figure 4-8) is too small to hold the entire 12-character start marker sequence. For this case, I chose to place the start marker at the beginning of the previous IP datagram – i.e., IP Datagram n-1 – rather than trying to split the start marker across two separate IP datagrams. The reasoning behind this design decision is as follows. The start marker is a contiguous sequence of twelve ASCII “A” characters – i.e., AAAAAAAAAAAA. (For a detailed description of why the start [and stop] marker is 12 bytes long, see [12].) Assume for the moment that the length of IP Datagram n’s payload section is exactly eleven bytes long. These eleven bytes will typically be grouped together into three 32-bit “DWORD” values before being sent across the PCI bus to the NIC. (Note that one of the three 32-bit DWORDs will have an unused byte in it, as there are eleven bytes-worth of payload being transferred to the NIC, and a total of twelve bytes are available within the three DWORDs.) Assuming these three DWORDs are encapsulated within a “typical” IP datagram with an header length of 20 octets (i.e., 5 DWORDs), the entire IP datagram (header + payload) occupies a total of 31 bytes, or 8 DWORDs. So when the payload section of IP Datagram n is no greater than eleven bytes, a maximum of eight DWORDs will be transferred across the PCI bus when sending this datagram to the NIC. On PCI bus systems that support burst-mode memory write transactions – and this is the case for both the host machine Hydra and the EBSA-285 co-host – these eight DWORDs are written to the NIC using a burst-mode memory write transaction. 69 Such a transaction is comprised of a single PCI address phase followed immediately by a contiguous sequence of PCI data phases: Addr Data 1 Data 2 ... Data 8 Figure 4-9 PCI bus write burst of IP Datagram n where |IP Datagram n| < |start marker| Let t1 represent the time required to transfer IP Datagram n across the PCI bus to the NIC during a burst-mode write transaction. As shown below, time t1 is approximately 273 nanoseconds: 1 data phase 30.3 ns t1 = 1 address phase + 8 DWORDs ⋅ = 273 ns ⋅ DWORD phase [4.1] Recalling that IP Datagram n does not contain any start marker characters for the case when the length of IP Datagram n is less than the length of the start marker, the logic analyzer apparently will not detect the write transaction that sends IP Datagram n to the NIC. UDP Datagram UDP Header AAAA*** ... IP Datagram [ = 1 MTU ] ***EEEE 1 *** IP Datagram 2 [ < start marker length ] Send Order 1st IP *** ( ← Not detected by logic analyzer ) IP Datagram 2 2nd IP UDP AAAA***...***EEEE IP Datagram 1 Figure 4-10 Logic analyzer payload “miss” 70 This “payload miss” by the logic analyzer introduces a slight measurement error into the UDP payload transfer latency measurements that is inversely proportional to the size of the UDP datagram. The worst-case error, therefore, occurs when a UDP datagram is fragmented into two IP datagrams, and IP Datagram 2 contains fewer than |start marker| characters (see Figure 4-10). Since the underlying network media is in our case an Ethernet LAN, IP Datagram 1 contains exactly 1,492 octets (or 373 DWORDs). Assuming the entire contents of IP Datagram 1 are sent to the NIC in a single PCI burst-mode write transaction, the time t2 required to send IP Datagram 1 to the NIC is approximately 11.3 microseconds: 1 data phase 30.3 ns t 2 = 1 address phase + 373 DWORDs ⋅ = 11.3 µs ⋅ DWORD phase [4.2] This time interval recorded by the logic analyzer is the time required to send the entire UDP datagram to the NIC, and not just the IP Datagram 1 component of the UDP datagram. In other words, the time interval recorded by the logic analyzer does not include the time spent sending the IP Datagram 2 component of the UDP datagram to the NIC. % error = t1 242 ns × 100 = × 100 = 2.1 % t1 + t 2 242 ns + 11.3 µs [4.3] For our purposes, this is an acceptable upper bound on this measurement error. Of course, this measurement error can be avoided altogether by choosing sizes for the UDP payloads such that the resulting IP datagrams have the length of an MTU. Therefore, we do not incur any measurement error due to this artifact of IP and UDP payload size. 71 4.4 Host / Co-host Combined Platform Performance Instrumentation The instrumentation used to gather latency data from the combined host / co-host CiNIC platform is shown schematically in Figure 4-11 below. The data path between the host system and the NIC on the co-host system is logically broken down into three sub-paths. The first data flow sub-path is between the host computer system and the Intel 21554 PCI-to-PCI Nontransparent Bridge [21554] – i.e., the outbound data flow across the primary PCI bus. The second data flow sub-path is from the 21554 to the system RAM on the EBSA-285 co-host – i.e., the inbound data flow on the secondary PCI bus. The third and final data flow sub-path is between the EBSA-285 and the NIC on the secondary PCI bus – i.e., the outbound data flow on the secondary PCI bus. The logic analyzer’s job, in this case, is to gather latency data from these three data flow sub-paths. NOTE In section 4.4, the term “system under test” [SUT] refers to the combined host / co-host CiNIC architecture. Host Computer System 1 Primary PCI HP 16700A Logic Analyzer 21554 2 Secondary PCI EBSA-285 Linux 21285 3 NIC SA-110 LAN Figure 4-11 CiNIC architecture instrumentation configuration 72 4.4.1 Theory of Operation A test application, running on the host system, allocates storage for a “payload buffer” of some desired size. The test application then initializes the buffer’s contents with essentially the same “well formed” payload as described in section 4.3, with the exception that the TCP or UDP payload content is now encapsulated within a pair of 4-character “P” and “Q” sequences as shown in Figure 4-12. 0 PPPP AAAAAAAAAAAA ******** EEEEEEEEEEEE QQQQ TCP / UDP payload Payload buffer Figure 4-12 Well-defined data payload for the combined platform This new payload arrangement is required because we have observed that UDP transactions under Linux are formatted as described in §4.3.2 above. After initializing the payload buffer, the test application writes a known 32-bit value to an unimplemented base address register [BAR] on the primary PCI bus. For this project, the value 0xFEEDFACE is written to the 3rd BAR on a 3Com 3c905C NIC connected to the primary PCI bus. NOTE This NIC corresponds to the generic “PCI” device block shown connected to the primary PCI bus in Figure 4-13. The 3Com 3c905C NIC on the primary PCI bus is not used to send data to, or receive data from, the network. This NIC’s sole purpose in this case is to provide an unimplemented BAR for the aforementioned BAR write transaction. The logic analyzer’s triggering subsystem is manually configured to detect this BAR write transaction when it occurs on the primary PCI bus, and the analyzer uses this 73 event as the “time = 0 seconds” reference times t1,0 and t2,0 for all subsequent PCI transactions on both the primary and secondary PCI busses (see Figure 4-13). t 1,0 Win2K or Linux on the Host Computer System Primary PCI PCI BAR 0xFEEDFACE 21554 HP 16700A Logic Analyzer Secondary PCI EBSA-285 Linux 21285 NIC t 2,0 SA-110 LAN Figure 4-13 Write to unimplemented BAR triggers the logic analyzer Immediately after the test application triggers the logic analyzer by writing a known value to an unimplemented BAR on the primary PCI bus, the test application sends the contents of the payload buffer to a remote host’s discard port via the network [43], [47]. The payload is sent from the test application on the host using socket calls that are native to the operating system environment that’s controlling the host machine. More specifically, the socket API “seen” by the test application on the host is virtually the same API for both the CiNIC and stand-alone architectures. Of course, the networking subsystem within the operating system proper will generally need to be modified in some (perhaps radical) way in order to support the CiNIC architecture (e.g., see [36]). After a bit of processing within the host operating system (in response to the network send operation by the test application), the four “P” characters at the head of the payload buffer eventually reach the primary PCI bus on their way to the 21554 74 (see Figure 4-14). The analyzer detects the arrival of these characters on the primary PCI bus and records this event as the “wire arrival time” t1,1 for the primary PCI bus. In other words, time t1,1 represents the processing latency associated with moving the first bytes of the payload buffer from the application layer, down through the host’s operating system, and onto the primary PCI bus. t 1,0 t 1,1 Win2K or Linux on the Host Computer System PPPP Primary PCI 21554 HP 16700A Logic Analyzer Secondary PCI EBSA-285 Linux 21285 NIC t 2,0 SA-110 LAN Figure 4-14 Wire arrival time t 1,1 on primary PCI bus When the payload buffer’s leading edge characters reach the 21554 bridge, the 21554 stores these characters within its 256-byte downstream posted write buffer [23]. As soon as it is able, the 21554 dumps the contents of its posted write buffer onto the secondary PCI bus. At this point, the four “P” characters at the head of the payload buffer reach the secondary PCI bus on their way to the system RAM on the EBSA-285 (via the 21285) as shown in Figure 4-15. The analyzer detects the arrival of these characters on the secondary PCI bus and records this event as the “wire arrival time” t2,1 for the secondary PCI bus. In other words, time t2,1 represents the data processing latency introduced by the 21554 bridge as it moves the first bytes of the payload buffer from the primary PCI bus onto the secondary PCI bus. 75 t 1,0 t Win2K or Linux on the Host Computer System Primary PCI 21554 Secondary PCI EBSA-285 Linux 21285 1,1 HP 16700A Logic Analyzer PPPP NIC t 2,0 SA-110 t 2,1 LAN Figure 4-15 Wire arrival time t 2,1 on secondary PCI bus After a bit more processing within the host operating system (in response to the aforementioned network send operation by the test application), the four “Q” characters at the end of the payload buffer eventually reach the primary PCI bus on their way to the 21554 (see Figure 4-16). The analyzer detects the arrival of these characters on the primary PCI bus and records this event as the “wire exit time” t1,2 for the primary PCI bus. In other words, time t1,2 represents the processing latency associated with moving the last bytes of the payload buffer from the application layer, down through the host’s operating system, and onto the primary PCI bus. t 1,0 t 1,1 Win2K or Linux on the Host Computer System QQQQ Primary PCI 21554 t 1,2 HP 16700A Logic Analyzer Secondary PCI EBSA-285 Linux 21285 NIC SA-110 t 2,0 t 2,1 LAN Figure 4-16 Wire exit time t 1,,2 on primary PCI bus 76 When the payload buffer’s trailing edge characters reach the 21554 bridge, the 21554 again stores these characters within its 256-byte downstream posted write buffer [23]. As soon as it is able, the 21554 dumps the contents of its posted write buffer onto the secondary PCI bus. At this point, the four “Q” characters at the tail end of the payload buffer reach the secondary PCI bus on their way to the system RAM on the EBSA-285 (via the 21285) as shown in Figure 4-17. The analyzer detects the arrival of these characters on the secondary PCI bus and records this event as the “wire exit time” t2,2 for the secondary PCI bus. In other words, time t2,2 represents the data processing latency introduced by the 21554 bridge as it moves the last bytes of the payload buffer from the primary PCI bus onto the secondary PCI bus. t 1,0 t 1,1 Win2K or Linux on the Host Computer System Primary PCI 21554 Secondary PCI EBSA-285 Linux 21285 t 1,2 HP 16700A Logic Analyzer “QQQQ” NIC t 2,0 SA-110 t LAN t 2,1 2,2 Figure 4-17 Wire exit time t 2,2 on secondary PCI bus After some processing by the Linux operating system on the co-host, the twelve “A” characters near the head of the payload buffer eventually reach the secondary PCI bus on their way to the NIC (see Figure 4-18). The analyzer detects the arrival of these characters on the secondary PCI bus and records this event as the “wire arrival time” t2,3 on the secondary PCI bus. In other words, time t2,3 represents 77 the processing latency introduced by TCP/IP stack on the co-host as it moves the first bytes of the payload buffer from the system RAM on the EBSA-285 to the NIC on the secondary PCI bus. t 1,0 Win2K or Linux on the Host Computer System Primary PCI 21554 Secondary PCI EBSA-285 Linux 21285 t 1,1 t 1,2 HP 16700A Logic Analyzer “AAAA” NIC t 2,0 SA-110 t LAN t 2,3 2,1 t 2,2 Figure 4-18 Wire arrival time t 2,3 on secondary PCI bus After a bit more processing within the Linux operating system on the co-host, the twelve “E” characters near the end of the payload buffer eventually reach the secondary PCI bus on their way to the NIC (see Figure 4-19). The analyzer detects the arrival of these characters on the secondary PCI bus and records this event as the “wire exit time” t2,4 on the secondary PCI bus. In other words, time t2,4 represents the processing latency introduced by TCP/IP stack on the co-host as it moves the last bytes of the payload buffer from the system RAM on the EBSA-285 and to the NIC on the secondary PCI bus. 78 t 1,0 t Win2K or Linux on the Host Computer System Primary PCI t 1,2 HP 16700A Logic Analyzer 21554 “EEEE” Secondary PCI EBSA-285 Linux 21285 1,1 t 2,0 NIC t SA-110 LAN t 2,4 t t 2,1 2,2 2,3 Figure 4-19 Wire exit time t 2,4 on secondary PCI bus Primary PCI BAR Write “Trigger” t 1, 0 t 1, 1 Host writes data payload to the 21554 t 1, 2 Secondary PCI t 2, 0 PPPP . . . PPPP . . . QQQQ QQQQ AAAA . . . EEEE t 2, 1 t 2, 2 t 2, 3 t 2, 4 Time Line Figure 4-20 Data path time line 21554 writes data payload to shared memory (via 21285) Socket call (on EBSA) sends data payload from shared memory to the NIC 79 4.4.2 CiNIC Architecture Polling Protocol Performance Testbed As mentioned in §4.1, Rob McCready and Mark McClellend (§2.3.2) had not yet completed the host/co-host communication protocol when Ma and I began working on the new logic analysis measurement tool (i.e., the tool that would gather latency data from both the primary and secondary PCI busses). Consequently, Ma and I had to devise a “plan B” communications protocol that would allow us to send data from host system down to the EBSA-285 co-host and then out to the NIC on the secondary PCI bus. This protocol is basically a stop-and-wait protocol that is built around a set of state bits that the host and co-host sides manipulate and poll to determine the current state of the protocol (e.g., the host has finished writing data down to the cohost; the co-host has finished sending the data to the NIC, etc.). Ma’s senior project report [33] describes this protocol in detail, so only a brief description of the polling protocol testbed is provided here. Figure 4-21 below is a schematic representation of the hardware / software architecture of the polling testbed for the CiNIC architecture. The three data paths from Figure 4-11 are also shown on Figure 4-21 along with a fourth data path 0 (zero). In a nutshell, the client-2.1 application triggers the logic analyzer by writing to an unimplemented BAR on a PCI device that’s connected to the primary PCI bus (path 0). Immediately thereafter, the client-2.1 application copies a block of data down to the shared memory on the EBSA-285 (paths 1 and 2). The testapp program on the EBSA-285 detects the data block’s arrival and copies it from shared memory out to the NIC via a socket send operation (path 3). Host client-2.1 0 BAR Kernel Application 80 pcitrig.o PCI Device hostmem_p.o /dev/hostmem PCI 1 1 21554 HP 16700A 2 PCI 2 Kernel 21285 NIC’s Device Driver NIC Shared Memory TCP / IP Stack ebsamem_p.o Application /dev/ebsamem Socket API 3 testapp EBSA-285 Figure 4-21 CiNIC Polling protocol performance testbed 81 4.4.3 CiNIC Host Socket Performance Testbed The performance testbed for the CiNIC “Host Socket” configuration is shown in Figure 4-22. This is the current CiNIC design and implementation that was completed Application in June 2001. Host client-2.01 0 BAR Socket API Kernel pcitrig.o PCI Device protocol hostmem_p.o /dev/hostmem PCI 1 1 21554 HP 16700A 2 PCI 2 Kernel 21285 ebsamem_p.o NIC’s Device Driver NIC Shared Memory /dev/ebsamem TCP / IP Stack Protocol Handlers 3 EBSA-285 Figure 4-22 CiNIC architecture host socket performance testbed 82 4.4.4 Secondary PCI Bus “Posted Write” Phenomenon While working on the instrumentation for the CiNIC architecture polling protocol, Ma and I noticed a phenomenon where the logic analyzer would detect the presence of the four “P” characters on the secondary PCI bus before they appeared on the primary PCI bus as shown in Figure 4-23a. Recall that the data path is from the host to the 21554 across the primary PCI bus (Figure 4-23b), and then from the 21554 to the 21285 (and ultimately onto the shared memory) via the secondary PCI bus (Figure 4-23c). For a moment, there, we thought we might have discovered a heretofore unknown quantum-like behavior in the 21554 bridge – i.e., data was showing up on the secondary PCI side of the 21554 before it had been written to the primary PCI side. Talk about a performance boost! host host PCI1 21554 host PCI1 PPPP 21554 PCI2 PPPP PCI1 21554 PCI2 PCI2 PPPP 21285 21285 21285 shared memory shared memory shared memory (a) (b) (c) Figure 4-23 Start marker appears first on secondary PCI bus What we were seeing, of course, was not some strange quantum effect but a standard PCI behavior. Taking it step by step, the test application on the host machine Hydra copies the contents of its defined payload buffer down to the shared memory 83 region on the EBSA-285. At the hardware level, this copy operation transfers the payload buffer’s contents from the upstream side of the 21554 to the downstream side of the 21554, then to the 21285 host bridge, and finally to the shared memory region on the EBSA-285. An application on the EBSA-285 then copies the data from the shared memory region to the NIC via a socket connection. When this write completes, the test application begins the next iteration of the “host ⇒ EBSA ⇒ NIC” send cycle by once again copying the contents of its payload buffer down to the shared memory region on the EBSA-285. At this point, however, the 21554 bridge is faced with a dilemma because: a) the 21554 has no way of knowing whether the 21285 host bridge has completed the previous write operation to shared memory, and b) the 21554 has just received data (from the current “host ⇒ EBSA ⇒ NIC” send cycle) that is to be written to the same shared memory region as the previous write operation. So before the 21554 can send the new data to the 21285 (i.e., to the shared memory region via the 21285), the 21554 must first ensure the 21285 host bridge has written all of the data from the previous memory write transaction out to shared memory. The 21554 does this by performing a read transaction from the same region of shared memory that was the target of the previous write transaction. According to [51]: “Before permitting a read to occur on the target bus the bridge designer must first flush all posted writes to their destination memory targets. A device driver can ensure that all memory data has been written to its device by performing a read from the device. This will force the flushing of all posted write buffers in bridges that reside between the master executing the read and the target device before the read is permitted to complete.” 84 So the initial set of “P” characters on the secondary PCI bus (Figure 4-23a) is an artifact of the 21554 bridge chip performing a dummy read from shared memory to ensure the 21285 has finished writing its posted write data out to shared memory. Following this read, the 21554 asserts its “target ready” line and allows the host to send it the four “P” characters via the primary PCI bus (Figure 4-23b). Next, the 21554 transfers the “P” characters across the secondary PCI bus to the 21285 (Figure 4-23c), which ultimately writes the “P” characters into shared memory. 85 Chapter 5 Performance Data and Analysis Using the instrumentation technique described in Chapter 4, I obtained performance data for a number of test cases. Performance data for earlier versions of the Linux kernel, and for Microsoft Windows NT 4.0 is presented in my senior project report [12]. Samples of the performance data obtained with this instrumentation technique are provided here solely for the purpose of demonstrating the capabilities of this measurement technique. The charts in the following subsections represent the latency characteristics of a particular system under test [SUT] in response to a series of network send operations from the application layer. There are two types of SUTs: “stand-alone” systems and the combined host/co-host CiNIC system. During the stand-alone tests, the SUT was either an i686 host or the EBSA-285, but not both. For the combined tests, the SUT was the entire CiNIC architecture – i.e., the host computer system, the 21554 PCI-PCI nontransparent bridge, and the EBSA-285 co-host (see Figure 2-4). Each chart’s title contains the following information items: the name and version of the operating system that was running on the SUT during the tests, the transport protocol tested (either TCP or UDP), the version of the Internet protocol tested (either IPv4 or IPv6), and the range of payloads tested. The range of payloads tested are shown with the format, min-max-step/count, where min is the 86 minimum (or “starting”) payload size, max is the largest (or “ending”) payload size, step is the payload increment size, and count is the number of times each payload size was sent. Each chart contains two traces. The lower trace represents the median wire arrival time [41] – i.e., the median time when the first byte of the data payload appeared on a particular PCI bus relative to the “time zero” PCI trigger signal (see §§4.3 and 4.4). The upper trace represents the median wire exit time [41] – i.e., the median time when the last byte of the data payload reached a particular PCI bus relative to a “time zero” trigger signal. For purely informational purposes, certain figures on the following pages contain a pair of charts that plot the same set of data values. The top chart provides a linear plot of the data and the bottom provides a semi-log plot of the data. This was done because linear plots often reveal performance characteristics that go unnoticed on semi-log plots, and vice versa. 5.1 Stand-alone Performance – Microsoft Windows 2000 (SP1) The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained using the stand-alone performance instrumentation described in §4.3. The SUT was a dual Pentium-III host with 512 MB of RAM and a 3Com 3c905C NIC. The operating system environment was Microsoft Windows 2000 with service pack 1 (SP1) installed. Blocking-type sockets were used for all network connections. 87 Fornax: Windows 2000 TCP / IP4 32-16384-32 / 10 ( Linear ) 14000 12000 Latency (usec) 10000 8000 6000 4000 2000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 12000 14000 16000 18000 Payload Size (bytes) Fornax: Windows 2000 TCP / IP4 32-16384-32 / 10 ( Semilog ) 100000 Latency (usec) 10000 1000 100 10 0 2000 4000 6000 8000 10000 Payload Size (bytes) Figure 5-1 Windows 2000 TCP/IP4 on i686 host, payload sizes: 32 to 16K bytes 88 Fornax: Windows 2000 TCP / IP4 512-65536-512 / 30 ( Linear ) 60000 50000 Latency (usec) 40000 30000 20000 10000 0 0 10000 20000 30000 40000 50000 60000 70000 50000 60000 70000 Payload Size (bytes) Fornax: Windows 2000 TCP / IP4 512-65536-512 / 30 ( Semilog ) 100000 Latency (usec) 10000 1000 100 10 0 10000 20000 30000 40000 Payload Size (bytes) Figure 5-2 Windows 2000 TCP/IP4 on i686 host, payload sizes: 512 to 65K bytes 89 Fornax: Windows 2000 UDP / IP4 512-65536-512 / 30 ( Linear ) 6000 5000 Latency (usec) 4000 3000 2000 1000 0 0 10000 20000 30000 40000 50000 60000 70000 50000 60000 70000 Payload Size (bytes) Fornax: Windows 2000 UDP / IP4 512-65536-512 / 30 ( Semilog ) 10000 Latency (usec) 1000 100 10 0 10000 20000 30000 40000 Payload Size (bytes) Figure 5-3 Windows 2000 UDP/IP4 on i686 host, payload sizes: 512 to 65K bytes 90 5.2 Stand-alone Performance – Linux 2.4.3 Kernel on i686 Host The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained using the stand-alone performance instrumentation described in §4.3. The SUT was the i686 PC named Hydra. As mentioned in section 2.2.1, Hydra is a uniprocessor host with a 450 MHz Pentium-III CPU. The operating system was Redhat Linux 7.1 with a Linux 2.4.3 kernel. Blocking-type sockets were used for all network connections. 91 Hydra: Linux 2.4.3 TCP / IP4 64-9216-32 / 30 ( Linear ) 1200 1000 Latency (usec) 800 600 400 200 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 7000 8000 9000 10000 Payload Size (bytes) Hydra: Linux 2.4.3 TCP / IP4 64-9216-32 / 30 ( Semilog ) 10000 Latency (usec) 1000 100 10 0 1000 2000 3000 4000 5000 6000 Payload Size (bytes) Figure 5-4 Linux 2.4.3 TCP/IP4 on i686 host, payload sizes: 64 to 9K bytes 92 Hydra: Linux 2.4.3 TCP / IP4 512-65536-512 / 30 ( Linear ) 8000 7000 Latency (usec) 6000 5000 4000 3000 2000 1000 0 0 10000 20000 30000 40000 50000 60000 70000 50000 60000 70000 Payload Size (bytes) Hydra: Linux 2.4.3 TCP / IP4 512-65536-512 / 30 ( Semilog ) 10000 Latency (usec) 1000 100 10 0 10000 20000 30000 40000 Payload Size (bytes) Figure 5-5 Linux 2.4.3 TCP/IP4 on i686 host, payload sizes: 512 to 65K bytes 93 Hydra: Linux 2.4.3 UDP / IP4 64-9216-32 / 30 ( Linear ) 700 600 Latency (usec) 500 400 300 200 100 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 7000 8000 9000 10000 Payload Size (bytes) Hydra: Linux 2.4.3 UDP / IP4 64-9216-32 / 30 ( Semilog ) Latency (usec) 1000 100 10 0 1000 2000 3000 4000 5000 6000 Payload Size (bytes) Figure 5-6 Linux 2.4.3 UDP/IP4 on i686 host, payload sizes: 64 to 9K bytes 94 Hydra: Linux 2.4.3 UDP / IP4 512-65536-512 / 30 ( Linear ) 6000 5000 Latency (usec) 4000 3000 2000 1000 0 0 10000 20000 30000 40000 50000 60000 70000 50000 60000 70000 Payload Size (bytes) Hydra: Linux 2.4.3 UDP / IP4 512-65536-512 / 30 ( Semilog ) 10000 Latency (usec) 1000 100 10 0 10000 20000 30000 40000 Payload Size (bytes) Figure 5-7 Linux 2.4.3 UDP/IP4 on i686 host, payload sizes: 512 to 65K bytes 95 5.3 Stand-alone Performance – Linux 2.4.3 Kernel on EBSA-285 The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained using the stand-alone performance instrumentation described in §4.3. The SUT was an EBSA-285 with 144 MB of RAM and a 3Com 3c905C NIC. The EBSA-285 was configured for host bridge operation. The operating system environment was (approximately) Redhat Linux 7.1 with a Linux 2.4.3-rmk2 kernel. The “-rmk2” suffix on the Linux kernel version identifies the second patch that “Russell M. King” has created for version 2.4.3 of the ARM/Linux kernel. Blocking-type sockets were used for all network connections. 96 EBSA285: Linux 2.4.3-rmk2 TCP / IP4 64-9216-32 / 30 ( Linear ) 1400 1200 Latency (usec) 1000 800 600 400 200 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 7000 8000 9000 10000 Payload Size (bytes) EBSA285: Linux 2.4.3-rmk2 TCP / IP4 64-9216-32 / 30 ( Semilog ) Latency (usec) 10000 1000 100 0 1000 2000 3000 4000 5000 6000 Payload Size (bytes) Figure 5-8 Linux 2.4.3-rmk2 TCP/IP4 on EBSA285, payload sizes: 64 to 4.3K bytes 97 EBSA285: Linux 2.4.3-rmk2 TCP / IP4 512-65536-512 / 30 ( Linear ) 10000 9000 8000 Latency (usec) 7000 6000 5000 4000 3000 2000 1000 0 0 10000 20000 30000 40000 50000 60000 70000 50000 60000 70000 Payload Size (bytes) EBSA285: Linux 2.4.3-rmk2 TCP / IP4 512-65536-512 / 30 ( Semilog ) Latency (usec) 10000 1000 100 0 10000 20000 30000 40000 Payload Size (bytes) Figure 5-9 Linux 2.4.3-rmk2 TCP/IP4 on EBSA285, payload sizes: 512 to 50K bytes 98 EBSA285: Linux 2.4.3-rmk2 UDP / IP4 64-9216-32 / 30 ( Linear ) 1000 900 800 Latency (usec) 700 600 500 400 300 200 100 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 7000 8000 9000 10000 Payload Size (bytes) EBSA285: Linux 2.4.3-rmk2 UDP / IP4 64-9216-32 / 30 ( Semilog ) Latency (usec) 1000 100 0 1000 2000 3000 4000 5000 6000 Payload Size (bytes) Figure 5-10 Linux 2.4.3-rmk2 UDP/IP4 on EBSA285, payload sizes: 64 to 9K bytes 99 EBSA285: Linux 2.4.3-rmk2 UDP / IP4 512-65536-512 / 30 ( Linear ) 6000 5000 Latency (usec) 4000 3000 2000 1000 0 0 10000 20000 30000 40000 50000 60000 70000 50000 60000 70000 Payload Size (bytes) EBSA285: Linux 2.4.3-rmk2 UDP / IP4 512-65536-512 / 30 ( Semilog ) Latency (usec) 10000 1000 100 0 10000 20000 30000 40000 Payload Size (bytes) Figure 5-11 Linux 2.4.3-rmk2 UDP/IP4 on EBSA285, payload sizes: 512 to 31K bytes 100 5.4 CiNIC Architecture – Polling Protocol Performance Testbed The TCP/IP4 and UDP/IP4 performance data shown in this section were obtained using the combined host / co-host CiNIC performance instrumentation described in §4.4. The SUT was the i686 host PC named Hydra and the EBSA-285 co-host. The operating system environment on the host PC was Redhat Linux 7.1. The operating system environment on the EBSA-285 was (approximately) Redhat Linux 7.1 with a Linux 2.4.3-rmk2 kernel. The “-rmk2” suffix on the Linux kernel version identifies the second patch that “Russell M. King” has created for version 2.4.3 of the ARM/Linux kernel. Blocking-type sockets were used for all network connections. Recall that the polling protocol testbed being used writes the data payload directly down to the shared memory on the co-host. It does not use any socket calls to transfer data down to the co-host. A custom test application on the co-host detects the host-to-shared memory write operation whereupon it copies the payload contents from shared memory to the NIC via a co-host-side socket call. 101 5.4.1 CiNIC Polling Protocol TCP / IPv4 Performance The stacked area charts shown in Figure 5-13 and Figure 5-14 show the cumulative wire entry times across the entire CiNIC data path during, as depicted in Figure 5-12. Time T1 (= t1,1 – t1,0) is the host application / OS wire entry latency – i.e., the time interval between the PCI trigger event and the arrival of the data payload’s leading edge on the primary PCI bus. Time T2 (= t2,1 – t1,1) is the latency added by the 21554 as it copies the payload’s leading edge from the primary PCI bus to the secondary PCI bus. Time T3 (= t2,3 – t2,1) is the processing overhead of the test application and Linux OS on the EBSA-285 – i.e., the time interval between the arrival of the payload in shared memory and the arrival of the payload’s leading edge as a result of the test app sending the payload from shared memory to the NIC via a socket call. Primary PCI t 1, 0 t 1, 1 t 2, 0 T1 PPPP ... Host writes data payload to the 21554 Secondary PCI BAR Write “Trigger” T2 PPPP t 2, 1 ... 21554 writes data payload to shared memory (via 21285) T3 AAAA t 2, 3 ... Socket call (on EBSA) sends data payload from shared memory to the NIC Time Li ne Figure 5-12 CiNIC wire entry times 102 Wire Entry Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 60000 50000 Latency (usec) 40000 30000 20000 10000 704 724 744 764 724 744 764 684 704 664 644 624 604 584 564 544 524 504 484 464 444 424 404 384 364 344 324 304 284 264 244 224 204 184 164 144 124 84 104 64 0 Payload Size (bytes) Wire Entry Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog ) t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 100000 Latency (usec) 10000 1000 100 10 684 664 644 624 604 584 564 544 524 504 484 464 444 424 404 384 364 344 324 304 284 264 244 224 204 184 164 144 124 104 84 64 1 Payload Size (bytes) Figure 5-13 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire entry latencies 103 Wire Entry Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 25000 Latency (usec) 20000 15000 10000 5000 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 0 Payload Size (bytes) Wire Entry Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 100000 Latency (usec) 10000 1000 100 10 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 1 Payload Size (bytes) Figure 5-14 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire entry latencies 104 In Figure 5-13 and Figure 5-14, the elapsed time plot T3 (t2,3 – t2,1) is so large it completely swamps out the elapsed time plots T1 (t1,1 – t1,0) and T2 (t2,1 – t1,1). The time plots T1 and T2 are therefore re-plotted without T3 in Figure 5-16 so as to reveal the characteristics of these two time intervals. Primary PCI t 1, 0 t 1, 1 t 2, 0 T1 PPPP ... Host writes data payload to the 21554 Secondary PCI BAR Write “Trigger” T2 PPPP t 2, 1 ... 21554 writes data payload to shared memory (via 21285) ... Socket call (on EBSA) sends data payload from shared memory to the NIC Time Li ne Figure 5-15 Partial plot of CiNIC wire entry times 105 Wire Entry Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,1) - t(1,1) t(1,1) - t(1,0) 3.0 2.5 Latency (usec) 2.0 1.5 1.0 0.5 64 84 10 4 12 4 14 4 16 4 18 4 20 4 22 4 24 4 26 4 28 4 30 4 32 4 34 4 36 4 38 4 40 4 42 4 44 4 46 4 48 4 50 4 52 4 54 4 56 4 58 4 60 4 62 4 64 4 66 4 68 4 70 4 72 4 74 4 76 4 0.0 Payload Size (bytes) Wire Entry Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,1) - t(1,1) t(1,1) - t(1,0) 3.0 2.5 Latency (usec) 2.0 1.5 1.0 0.5 6 87 04 10 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 8 66 5 0 46 0 25 6 51 2 0.0 Payload Size (bytes) Figure 5-16 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire entry latencies 106 The stacked area charts shown in Figure 5-18 and Figure 5-19 show the cumulative wire exit times across the entire CiNIC data path, as depicted in Figure 5-17. Time T1 (= t1,2 – t1,0) is the host application / OS wire exit latency – i.e., the time interval between the PCI trigger event and the arrival of the data payload’s trailing edge on the primary PCI bus. Time T2 (= t2,2 – t1,2) is the latency added by the 21554 as it copies the payload’s trailing edge from the primary PCI bus to the secondary PCI bus. Time T3 (= t2,4 – t2,2) is the processing overhead of the test application and Linux OS on the EBSA-285 – i.e., the time interval between the arrival of the payload in shared memory and the arrival of the payload’s trailing edge as a result of the test app sending the payload from shared memory to the NIC via a socket call. Primary PCI t 1, 0 Secondary PCI BAR Write “Trigger” t 2, 0 T1 ... Host writes data payload to the 21554 ... t 1, 2 QQQQ T2 QQQQ t 2, 2 T3 ... EEEE t 2, 4 21554 writes data payload to shared memory (via 21285) Socket call (on EBSA) sends data payload from shared memory to the NIC Time Li ne Figure 5-17 CiNIC wire exit times 107 t(2,4) - t(2,2) Wire Exit Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,2) - t(1,2) t(1,2) - t(1,0) 60000 50000 Latency (usec) 40000 30000 20000 10000 704 724 744 764 724 744 764 684 704 664 644 624 604 584 564 544 524 504 484 464 444 424 404 384 364 344 324 304 284 264 244 224 204 184 164 144 124 84 104 64 0 Payload Size (bytes) t(2,4) - t(2,2) Wire Exit Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog ) t(2,2) - t(1,2) t(1,2) - t(1,0) 100000 1000 100 10 684 664 644 624 604 584 564 544 524 504 484 464 444 424 404 384 364 344 324 304 284 264 244 224 204 184 164 144 124 104 84 1 64 Latency (usec) 10000 Payload Size (bytes) Figure 5-18 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire exit latencies 108 Wire Exit Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,4) - t(2,2) t(2,2) - t(1,2) t(1,2) - t(1,0) 30000 25000 Latency (usec) 20000 15000 10000 5000 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 0 Payload Size (bytes) Wire Exit Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol t(2,4) - t(2,2) t(2,2) - t(1,2) t(1,2) - t(1,0) 100000 1000 100 10 1 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 Latency (usec) 10000 Payload Size (bytes) Figure 5-19 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, Wire exit latencies 109 In Figure 5-18 and Figure 5-19, the elapsed time plot T3 (t2,4 – t2,2) is so large it completely swamps out the elapsed time plots T1 (t1,2 – t1,0) and T2 (t2,2 – t1,2). The time plots T1 and T2 are therefore re-plotted without T3 in Figure 5-21 so as to reveal the characteristics of these two time intervals. Primary PCI t 1, 0 Secondary PCI BAR Write “Trigger” t 2, 0 T1 ... Host writes data payload to the 21554 ... t 1, 2 QQQQ T2 QQQQ t 2, 2 21554 writes data payload to shared memory (via 21285) ... Socket call (on EBSA) sends data payload from shared memory to the NIC Time Li ne Figure 5-20 Partial plot of CiNIC wire exit times 110 Wire Exit Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,2) - t(1,2) t(1,2) - t(1,0) 25 Latency (usec) 20 15 10 5 64 84 10 4 12 4 14 4 16 4 18 4 20 4 22 4 24 4 26 4 28 4 30 4 32 4 34 4 36 4 38 4 40 4 42 4 44 4 46 4 48 4 50 4 52 4 54 4 56 4 58 4 60 4 62 4 64 4 66 4 68 4 70 4 72 4 74 4 76 4 0 Payload Size (bytes) Wire Exit Latencies vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,2) - t(1,2) t(1,2) - t(1,0) 1600 1400 1000 800 600 400 200 6 87 04 10 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 8 66 5 46 0 25 6 0 0 51 2 Latency (usec) 1200 Payload Size (bytes) Figure 5-21 Linux 2.4.3 TCP/IP4 on CiNIC, Poling protocol, Wire exit latencies 111 Figure 5-23 shows the overhead introduced on the wire entry and exit times by the Intel 21554 PCI-PCI nontransparent bridge, as depicted in Figure 5-22. Primary PCI t 1, 0 t 1, 1 PPPP t 2, 1 ... t 1, 2 t 2, 0 PPPP ... Host writes data payload to the 21554 Secondary PCI BAR Write “Trigger” QQQQ QQQQ t 2, 2 21554 writes data payload to shared memory (via 21285) ... Socket call (on EBSA) sends data payload from shared memory to the NIC Time Li ne Figure 5-22 21554 overhead on wire entry and exit times Some points of interest on these graphs are: • The 21554’s downstream posted write FIFO is 256 bytes long. This fact is clearly visible on the wire exit time plot (i.e., the upper plot) where the wire exit latency across the 21554 does not begin to increase until the payload size reaches approximately 256 bytes. • The wire exit latency plot levels off at about 2.5 usec (x 33e6 ≈ 83 PCI clock cycles) for payload sizes of approximately 512 bytes and larger. This sounds reasonable because the 21554 requires a minimum of [256 (bytes)] / [4 (bytes / bus cycle)] x [1 (PCI bus cycle) / 33 MHz] ≈ 1.94 usec to empty all 256 bytes of its downstream posted write FIFO. 112 21554 Wire Entry and Exit Overheads vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,2) - t(1,2) t(2,1) - t(1,1) 3.0 2.5 Overhead ( usec ) 2.0 1.5 1.0 0.5 0.0 0 100 200 300 400 500 600 700 Payload Size (usec) 21554 Wire Entry and Exit Overheads vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,2) - t(1,2) t(2,1) - t(1,1) 4.0 3.5 2.5 2.0 1.5 1.0 0.5 6 87 04 10 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 8 66 5 46 0 25 6 0 0.0 51 2 Overhead ( usec ) 3.0 Payload Size (bytes) Figure 5-23 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, 21554 overhead 800 113 Figure 5-25 shows the time required for the co-host-side socket to transfer the entire data payload from shared memory to the NIC, as depicted in Figure 5-24. Primary PCI t 1, 0 t 2, 0 ... Host writes data payload to the 21554 Secondary PCI BAR Write “Trigger” ... 21554 writes data payload to shared memory (via 21285) AAAA t 2, 3 ... EEEE t 2, 4 Socket call (on EBSA) sends data payload from shared memory to the NIC Time Li ne Figure 5-24 Socket transfer time 114 Send Duration, Shared Memory to NIC on PCI bus 2 vs. Payload Size TCP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,4) - t(2,3) 1400 1200 Delta ( usec ) 1000 800 600 400 200 9024 8768 8512 8256 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 832 1088 576 64 320 0 Payload Size ( bytes ) Figure 5-25 Linux 2.4.3 TCP/IP4 on CiNIC, Polling protocol, testapp send times (PCI2) 5.4.2 CiNIC Polling Protocol UDP / IPv4 Performance The UDP performance graphs in this section have the same sequence as the TCP performance graphs in the previous section – i.e., wire entry, partial wire entry, wire exit, partial wire exit, and so on. 115 Wire Entry Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 60000 50000 Latency (usec) 40000 30000 20000 10000 8512 8768 9024 8768 9024 8256 8512 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 832 1088 576 64 320 0 Payload Size (bytes) Wire Entry Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog ) t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 100000 Latency (usec) 10000 1000 100 10 8256 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 1088 832 576 64 320 1 Payload Size (bytes) Figure 5-26 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire entry latencies 116 Wire Entry Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 30000 Latency (usec) 20000 10000 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 0 Payload Size (bytes) Wire Entry Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog ) t(2,3) - t(2,1) t(2,1) - t(1,1) t(1,1) - t(1,0) 1000000 100000 Latency (usec) 10000 1000 100 10 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 1 Payload Size (bytes) Figure 5-27 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire entry latencies 117 t(2,1) - t(1,1) Wire Entry Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(1,1) - t(1,0) 3.0 2.5 Latency (usec) 2.0 1.5 1.0 0.5 9024 8768 8512 8256 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 832 1088 576 64 320 0.0 Payload Size (bytes) Wire Entry Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,1) - t(1,1) t(1,1) - t(1,0) 3.0 2.5 Latency (usec) 2.0 1.5 1.0 0.5 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 0.0 Payload Size (bytes) Figure 5-28 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire entry latencies 118 Wire Exit Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,4) - t(2,2) t(2,2) - t(1,2) t(1,2) - t(1,0) 60000 50000 Latency (usec) 40000 30000 20000 10000 8512 8768 9024 8768 9024 8256 8512 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 832 1088 576 64 320 0 Payload Size (bytes) Wire Exit Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog ) t(2,4) - t(2,2) t(2,2) - t(1,2) t(1,2) - t(1,0) 100000 1000 100 10 8256 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 1088 832 576 64 1 320 Latency (usec) 10000 Payload Size (bytes) Figure 5-29 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire exit latencies 119 Wire Exit Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,4) - t(2,2) t(2,2) - t(1,2) t(1,2) - t(1,0) 30000 Latency (usec) 20000 10000 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 0 Payload Size (bytes) Wire Exit Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Semilog ) t(2,4) - t(2,2) t(2,2) - t(1,2) t(1,2) - t(1,0) 1000000 100000 1000 100 10 1 51 2 25 60 46 08 66 56 87 0 10 4 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 Latency (usec) 10000 Payload Size (bytes) Figure 5-30 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire exit latencies 120 Wire Exit Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,2) - t(1,2) t(1,2) - t(1,0) 250 Latency (usec) 200 150 100 50 9024 8768 8512 8256 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 832 1088 576 64 320 0 Payload Size (bytes) Wire Exit Latency vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,2) - t(1,2) t(1,2) - t(1,0) 1600 1400 1200 800 600 400 200 6 87 04 10 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 8 66 5 46 0 25 6 0 0 51 2 Latency (usec) 1000 Payload Size (bytes) Figure 5-31 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, Wire exit latencies 121 t(2,2) - t(1,2) 21554 Wire Entry and Exit Overheads vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,1) - t(1,1) 4.0 3.5 Overhead ( usec ) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Payload Size (usec) 21554 Wire Entry and Exit Overheads vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,1) - t(1,1) t(1,1) - t(1,0) 4.0 3.5 2.5 2.0 1.5 1.0 0.5 6 87 04 10 75 12 2 80 14 0 84 16 8 89 18 6 94 20 4 99 23 2 04 25 0 08 27 8 13 29 6 18 31 4 23 33 2 28 35 0 32 37 8 37 39 6 42 41 4 47 43 2 52 45 0 56 47 8 61 49 6 66 51 4 71 53 2 76 55 0 80 57 8 85 59 6 90 61 4 95 64 2 00 0 8 66 5 46 0 25 6 0 0.0 51 2 Overhead (usec) 3.0 Payload Size (bytes) Figure 5-32 Linux 2.4.3 UDP/IP4 on CiNIC, Polling protocol, 21554 overheads 122 Send Duration, Shared Memory to NIC on PCI bus 2 vs. Payload Size UDP / IP4, Linux 2.4.3 kernel, Polling Protocol ( Linear ) t(2,4) - t(2,3) 700 600 400 300 200 100 Payload Size ( bytes ) Figure 5-33 Linux 2.4.3 UDP/IP4 on CiNIC, polling protocol, testapp send times (PCI2) 9024 8768 8512 8256 8000 7744 7488 7232 6976 6720 6464 6208 5952 5696 5440 5184 4928 4672 4416 4160 3904 3648 3392 3136 2880 2624 2368 2112 1856 1600 1344 832 1088 576 64 0 320 Delta ( usec ) 500 123 Chapter 6 6.1 Conclusions and Future Work Conclusions The first half of this thesis presented a brief history of the Calpoly intelligent NIC (CiNIC) Project, and described the hardware and software architectures that comprise the CiNIC platform. This was followed by a discussion of the i686-ARM and ARMARM development tools I implemented to support software development for (and within) the ARM/Linux environment on the EBSA-285 co-host. In the second half of the thesis I described the performance instrumentation techniques I created, with the assistance of Ron J. Ma [33], to gather latency data from the CiNIC architecture during network send operations. There are two implementations of this instrumentation technique: “stand-alone” and “CiNIC.” The stand-along implementation gathers performance data either from the host system by itself, or the EBSA-285 co-host system by itself. The CiNIC instrumentation technique gathers performance data from the combined host/co-host CiNIC platform. Samples of the performance data obtained with these two instrumentation techniques are provided in Chapter 5. As part of this development effort, Ma and I devised a makeshift communications protocol that sends data from host system down to the EBSA-285 co-host and out to the NIC on the secondary PCI bus. The results of our tests with the makeshift data transfer protocol showed the CiNIC platform about 50% slower than the stand-alone i686 host Hydra at this stage of development. 124 6.2 Future Work Some ideas for future work based on this project are: • Upon completion of the host / co-host communications infrastructure (see [35] and [36]), use the instrumentation technique described herein to gather performance data for that system. Specifically, use host-side socket calls instead of the memory mapping technique used in the polling protocol [33] to write data down to the EBSA-285 co-host and out to the NIC. • Using empirical data gathered with the instrumentation technique described herein, create a performance model for the CiNIC architecture. This model should be derived from the performance metrics described in Vern Paxson’s “Framework for IP Performance Metrics” document [41]. • Use the performance model to identify potential optimization points within the CiNIC architecture’s data path. • Employing the cyclic development strategy shown in Figure 6-1, make incremental improvements the CiNIC architecture. Research Results System design Performance Instrumentation Implementation Testing Analysis Figure 6-1 CiNIC development cycle 125 Bibliography [1] American National Standards Institute, “Programming Languages – C” ISO/IEC 9899:1999(E), 2nd edition. December 1, 1999. [2] American National Standards Institute, “Programming Languages – C++” ISO/IEC 14882:1998(E). September 1, 1998. [3] D. Barlow, “The Linux GCC HOWTO.” May 1999 <http://howto.tucows.com/> [4] I. Bishara, “Data Transfer Between WinNT and Linux Over a NonTransparent Bridge.” Senior project, California Polytechnic State University, San Luis Obispo (June 2000). [5] Borland International, “Service applications.” Software: Borland C++ Builder 5.0 Development Guide: Programming with C++ Builder, 2000. [6] D. P. Bovet and M. Cesati, “Understanding the Linux Kernel.” O’Reilly, 2001. ISBN 0-596-00002-2. [7] Y. Chen, “The Performance of a Client-Side Web Caching System.” Master’s thesis, California Polytechnic State University, San Luis Obispo (April 2000). [8] D. Comer, “Internetworking with TCP/IP: Principles, Protocols, and Architectures.” 4th ed., v1, Prentice-Hall, 2000. ISBN 0-13-018380-6. [9] Digital Equipment Corp., “Digital Semiconductor 21554 PCI-to-PCI Bridge Evaluation Board.” User’s guide number EC-R93JA-TE, February 22, 2000 (Preliminary document). [10] D. Dougherty and A. Robbins, “sed & awk.” 2nd ed. Unix Power Tools series. O’Reilly, 1997. ISBN 1-56592-225-5. [11] E. Engstrom, “Running ARM/Linux on the EBSA-285.”Senior project, California Polytechnic State University, San Luis Obispo (Spring 2001). [12] J. Fischer, “Measuring TCP/IP Performance in the PC Environment: A Logic Analysis / Systems Approach.” Senior project, California Polytechnic State University, San Luis Obispo (June 2000). [13] FuturePlus Systems Corp, “FS2000 Users Manual.” Rev. 3.5, 1998. [14] FuturePlus Systems Corp, “PCI Analysis Probe and Extender Card FS2000 – Quick Start Instructions.” Rev 3.4, 1998. [15] Hewlett-Packard Corp, “HP 16500C/16501A Logic Analysis System: Programmer's Guide.” Publication number 16500-97018, December 1996. 126 [16] Hewlett-Packard Corp, “HP 16550A 100-MHz State / 500-MHz Timing Logic Analyzer: Programmer's Guide.” Publication number 16500-97000, May 1993. [17] Hewlett-Packard Corp, “HP 16550A 100-MHz State / 500-MHz Timing Logic Analyzer: User’s Reference.” Publication number 16550-97001, May 1993. [18] Hewlett-Packard Corp, “HP 16554A, HP 16555A, and HP 16555D State/Timing Logic Analyzer: User’s Reference.” Publication number 1655597015, February 1999. [19] Hewlett-Packard Corp, “Installation Guide: HP 16600 Series, HP 16700A, HP 16702A Measurement Modules.” Publication number 16700-97010, 1999. [20] Hewlett-Packard Corp, “Remote Programming Interface (RPI) For the HP 16700 Logic Analysis System.” Version 3-1-99, 1999. [21] Y. Huang, “Web Caching Performance Comparison for Implementation on a Host (Pentium III) and a Co-host (EBSA-285).” Master’s thesis, California Polytechnic State University, San Luis Obispo (June 2001). [22] P. Huang, “Design and Implementation of the CiNIC Software Architecture on a Windows Host.” Master’s thesis, California Polytechnic State University, San Luis Obispo (February 2001). [23] Intel Corporation, “21554 PCI-to-PCI Bridge for Embedded Applications.” Hardware reference manual 278091-001, September 1998. [24] Intel Corporation, “StrongARM** EBSA-285 Evaluation Board.” Reference manual 278136-001, October 1998. [25] Intel Corporation, “21285 Core Logic for SA-110 Microprocessor.” Datasheet number 278115-001, September 1998. [26] Intel Corporation, “SA-110 Microprocessor.” Technical reference manual 278058-001, September 1998. [27] Intel Corporation, “PCI and uHAL on the EBSA-285.” Application note 278204-001, October 1998. [28] Intel Corporation, “Memory Initialization on the EBSA-285.” Application note 278152-001, October 1998. [29] R. King, “The ARM Linux Project.” 09-June-2001, <http://www.arm.linux.org.uk/> [30] R. King, “ARM/Linux FTP site.” 09-June-2001 <ftp.arm.linux.org.uk/pub/armlinux/source/>. [31] A. Kostyrka, “NFS-Root mini-howto.” v8.0, 8 Aug, 1997 <http://howto.tucows.com/> [32] Y.G. Liang, “Network Latency Measurement for I2O Architecture by Logic Analyzer Instrumentation Technique.” Senior project, California Polytechnic State University, San Luis Obispo (June 1998). 127 [33] R. Ma, “Instrumentation Development on the Host PC / Intel EBSA-285 Cohost Network Architecture.” Senior project, California Polytechnic State University, San Luis Obispo (June 2001). [34] Maxtor Corp., “531DX Ultra DMA 100 5400 RPM: Quick Specs.” Document number: 42085, rev. 2, May 7, 2001. [35] M. McClelland, “A Linux PCI Shared Memory Device Driver for the Cal Poly Intelligent Network Interface Card” Senior project, California Polytechnic State University, San Luis Obispo (June 2001). [36] R. McCready, “Design and Development of CiNIC Host/Co-host Protocol” Senior project, California Polytechnic State University, San Luis Obispo (June 2001). [37] Microsoft Corp., “Simultaneous Access to Multiple Transport Protocols,” Windows Sockets Version 2. Microsoft Developer Network, Platform SDK online reference, January 2001. [38] D. Mills, “Network Time Protocol (Version 3),” Network Working Group, RFC 1305, 120 pages (March 1992). [39] R. Nemkin, “Diskless-HOWTO.” v1.0, 13 May 1999. <http://howto.tucows.com/> [40] C. Newham and B. Rosenblatt, “Learning the bash shell.” 2nd ed. O’Reilly, 1998. ISBN 1-56592-347-2. [41] V. Paxson, “Framework for IP Performance Metrics.” Informational RFC 2330, 40 pages (May 1998). [42] T. Pepper, “Linux Network Device Drivers for PCI Network Interface Cards Using Embedded EBSA-285 StrongARM Processor Systems.” Senior project, California Polytechnic State University, San Luis Obispo (June 1999). [43] J. Postel, “Discard Protocol.” RFC-863, 1 page (May 1983). [44] J. Postel, “Internet Protocol.” RFC-791, 45 pages (September 1981). [45] J. Postel and K. Harrenstien, “Time Protocol.” Internet standard RFC 868, 2 pages (May 1983). [46] B. Quinn and D. Shute, “Windows Sockets Network Programming.” Reading: Addison-Wesley, 1996. [47] J. Reynolds and J. Postel, “Assigned Numbers.” Memo RFC 1340, 139 pages (July 1992). [48] A. Rubini, “Linux Device Drivers.” O’Reilly, 1998. ISBN 1-56692-292-1. [49] M. Sanchez, “Iterations on TCP/IP – Ethernet Network Optimization.” Master’s Thesis, California Polytechnic State University, San Luis Obispo (June 1999). [50] R. L. Schwartz and T. Christiansen, “Learning Perl.” 2nd ed. UNIX Programming series, O’Reilly, 1997. ISBN 1-56592-284-0. 128 [51] T. Shanley and D. Anderson, “PCI System Architecture.” PC System Architecture Series. 4th ed. Reading: Mindshare, 1999. ISBN 0-201-30974-2. [52] K. Sollins, “The TFTP Protocol (Revision 2).” Internet standard, RFC 1350, 11 pages (July 1992). [53] L. Wall, T. Christiansen and R. L. Schwartz. “Programming Perl.” 2nd ed. O’Reilly, 1996. ISBN 1-56592-149-6. [54] W. Wimer, “Clarifications and Extensions for the Bootstrap Protocol.” Proposed standard, RFC 1532. 23 pages (October 1993). [55] B. Wu, “An Interface Design and Implementation for a Proposed Network Architecture to Enhance the Network Performance.” Master’s thesis, California Polytechnic State University, San Luis Obispo (June 2000). [56] P. Xie, “Network Protocol Performance Evaluation of IPv6 for Windows NT.” Master’s thesis, California Polytechnic State University, San Luis Obispo (June1999). [57] A. Yu, “Windows NT 4.0 Embedded TCP/IP Network Appliance: A Proposed Design to Increase Network Performance.” Master’s thesis, California Polytechnic State University, San Luis Obispo (December 1999). 129 Appendix A Acronyms Acronym Definition A/L-BIOS ARM/Linux BIOS ARM Advanced Risc Microprocessor BAR Base Address Register CiNIC Calpoly Intelligent NIC CPNPRG Cal Poly Network Performance Research Group DIMM Dual In-line Memory Module DLL Dynamic Link Library DNS Domain Name Service DUT Device Under Test EBSA Evaluation Board – StrongARM EIDE Enhanced IDE FMU Flash Management Utility (p/o the SDT) FTP File Transfer Protocol GNU/SDTC GNU Software Development Tool Chain HAL Hardware Abstraction Layer HUT Host Under Test I2O Intelligent I/O IDE Integrated Drive Electronics IETF Internet Engineering Task Force IP Internet Protocol (versions 4 and 6) JTAG Joint Test Action Group KB Kilobyte LED Light Emitting Diode 130 Acronym Definition MB Megabyte MBPS Megabits per second NFS Network File System NIC Network Interface Card OS Operating System PCI Peripheral Component Interconnect POST Power On System (Self) Test QOS Quality of Service RAID Redundant Array of Inexpensive Disks RPM Redhat Package Manager SCSI Small Computer System Interface SDRAM Synchronous Dynamic Random Access Memory SDT Software Development Toolkit (ARM Ltd.) SUT System Under Test TCP Transmission Control Protocol UDP User Datagram Protocol uHAL Micro HAL WOSA Windows Open System Architecture 131 Appendix B RFC 868 Time Protocol Servers B.1. RFC 868 Time Protocol Servers The Linux operating system on the EBSA-285 can only use the rdate command when the EBSA-285 is connected to a network that has one or more RFC 868 Time Protocol servers connected to it. Therefore, the system administrator needs to configure at least one of the networked workstations as a server for the RFC 868 Time Protocol. This appendix describes how this is done. NOTE Redundancy is generally a good thing when dealing with system-critical services. So the system administrator should, in my opinion, configure at least three machines to act as RFC 868 Time Protocol servers. These machines should preferably be a mixture of both Linux and Windows 2000 hosts so that if one set of hosts is temporarily unavailable (e.g., a Windows upgrade temporarily disables the RFC 868 time service on the Win2K hosts), the remaining RFC 868 time servers will still be available to the EBSA-285. B.1.1. RFC 868 Support on Linux The Linux hosts in the CPNPRG lab use the xinetd daemon to manage (on a host-byhost basis) incoming connection requests for the so-called “well-known” network services such as telnet, FTP, finger, the RFC 868 Time Protocol, etc. On Redhat 7.x systems there exists a directory named /etc/xinet.d/ that contains configuration files for each of the well-known network services the xinetd daemon is responsible for. The 132 xinetd daemon reads these configuration files when it starts, and thereby learns what it should do when it receives an incoming connection request for a particular network service. To enable support for the RFC 868 Time Protocol on our Redhat 7.x Linux hosts I had to manually created two configuration files named time and time-udp in each host’s /etc/xinetd/ directory. This was done in accordance with the instructions found in the xinetd.conf(5) man page. After creating the time and time-udp configuration files I restarted each host’s xinetd daemon so that the daemon would re-read its configuration files and thereby enable support for the RFC 868 Time Protocol. To restart the xinetd daemon on a Redhat 7.x Linux host, log on as the super user ‘root’ and issue the xinetd daemon’s restart command as shown in Figure B-1 [jfischer@server: /temp]$ su – Password: [root@server: /root]$ /etc/rc.d/init.d/xinetd restart Figure B-1 Restarting the xinetd daemon on a Linux host To test whether the xinetd restart enabled support for the RFC 868 Time Protocol on a particular Linux host, log on to any other Linux host (i.e., any host other than the RFC 868 server) and issue the following command shown in Figure B-2: [jfischer@xyz: /temp]$ rdate –p server [server] Wed Apr 25 02:47:09 2001 [jfischer@xyz: /temp]$ Figure B-2 Testing the xinetd daemon with the rdate command where ‘server’ is the DNS name (e.g., pictor) or IP address of the RFC 868 server 133 whose xinetd daemon you just restarted. The RFC 868 server should send back a reply that looks something like the one on line 2 (i.e., the line that starts with the string “[server]”) in Figure B-2. B.1.2. RFC 868 Support on Windows 2000 Workstations As far as I know, Microsoft’s Windows 2000 for Workstations operating system does not ship with an RFC 868 Time Protocol service. Since the RFC 868 time protocol is a fairly trivial protocol, I decided to “roll my own” Windows 2000 service application to implement the TCP/IP version of the RFC 868 Time Protocol. Recall that a Win2K “service application” is a user-layer application that: a) does not have a “display window,” and b) takes service requests from client applications, process those requests, and returns some information to the client applications. They typically run in the background, without much user input. A web, FTP, or e-mail server is an example of a service application [5]. I used Borland C++ Builder, version 5, build 2195, Service Pack 1, to write the Win2K service application. The source code and executable image for the Win2K service can be found on the accompanying CD-ROM in the directory named \CiNIC\software\Win2k\RFC-868\. B.1.2.1. Installing the Windows 2000 RFC 868 Service Application To install the RFC 868 service application on a Windows 2000 workstation, start by copying the file TimeServerTcp.exe from the accompanying CD-ROM to the 134 /WINNT/system32/ subdirectory on the Windows 2000 host. Then open a Win2K command prompt window and type the following at the command prompt: C:\> %windir%\system32\TimeServerTcp.exe /INSTALL After executing the install command shown above, you should be able to start the service application by opening the Administrative Tools > Services applet (via the Windows Control Panel), selecting the RFC 868 Time Service (TCP) entry in the list of available services (see Figure B-3), and clicking on the Start Service button on the Services window toolbar. Figure B-3 Windows 2000 “Services” applett showing the RFC 868 Time Service NOTE The RFC 868 time service is designed to start automatically when Windows boots. If for some reason this (or any other) service fails to start, the System Administrator can manually start, pause, stop, and restart any Win2K service via the Administrative Tools > Services applet in the Windows Control Panel. 135 B.1.2.2. Removing the Windows 2000 RFC 868 Service Application To remove the RFC 868 Time Service from the Services applet, open the Administrative Tools > Services applet (via the Windows Control Panel) and make sure the RFC 868 Time Service (TCP) is stopped. Then open a command prompt window and type the command, W:/> %windir%\system32\TimeServerTcp.exe /UNINSTALL This should remove the RFC 868 Time Service from the Win2K services database. 136 Appendix C EBSA-285 SDRAM Upgrade It is important to note that the EBSA-285 has very specific requirements for the SDRAM DIMMS it uses. Not all SDRAM DIMMs will work with the EBSA-285. For example, we tried transplanting some SDRAM DIMMs from various PC hosts into the EBSA-285 and found that the EBSA-285 would not boot with these DIMMs installed. The exact specifications for the EBSA-285 SDRAM DIMMs are located in sections 2.3.1, 8.8.7 (particularly tables 8-1 and 8-2), and appendix A.5 of the EBSA-285 reference manual [24]. The ARM/Linux BIOS [30] we are using to boot the EBSA-285 also has some specific requirements for the SDRAM DIMMs. Note that the BIOS must be able to detect and initialize the SDRAM DIMMs during the EBSA-285 POST sequence; otherwise, the EBSA-285 will not boot. We ran into this very problem with versions 1.9 and lower of the ARM/Linux BIOS – even when compatible SDRAM DIMMs were installed on the EBSA-285 (i.e., the v1.9 and lower BIOSs failed to detect and initialize some EBSA-compatible DIMMs). The 1.10 and 1.11 versions of the ARM/Linux BIOS are definitely better at detecting and initializing SDRAM DIMMs, but they still have trouble with certain DIMM configurations. For example, after installing two EBSA-compatible 128 MB DIMMs on the EBSA-285, the 1.10 and 1.11 ARM/Linux BIOSs both failed to detect and initialize these DIMMs; 137 consequently, the EBSA-285 would not boot. However, when we installed one 128 MB DIMM in the EBSA’s first (i.e., “top”) DIMM socket and the original 16 MB DIMM in the second (i.e., “bottom”) DIMM socket, the 1.10 and 1.11 BIOSs both successfully detected and initialized the DIMMs. Consequently, we are now using a 128 MB + 16 MB SDRAM DIMM configuration on the EBSA-285, giving us a total SDRAM capacity of 144 MB. This quantity of RAM has proven to be more than sufficient for developing and debugging purposes, and for building “large” software packages on the EBSA-285. The particulars of the 128 MB SDRAM DIMM part we are currently using are itemized in Table C-1: Table C-1 EBSA-285 and ARM/Linux v1.1x BIOS compatible 128 MB SDRAM Item Description Vendor Viking Components 30200 Avenida de las Bandera Rancho Santa Maria, CA 92688 1.800.338.2361 Part number H6523 Nomenclature 128 MB SDRAM DIMM PC100 compliant 168-pin 64-bit unbuffered 3.3 Volt switching levels 4M x 64 array configuration Received / Installed December 2000