========================================
 User's Guide for GPCIe Development Kit
========================================

===============================
  1. Contents of the Kit
===============================

GPCIe is a PCI Express IP core developed by K&F Computing Research
Co. (hereafter KFCR). It provides a simple interface to the backend
logic designed by the user. Combining GPCIe with the backend logic,
the user can easily implement an interface to other PCI Express
devices without detailed knowledge about PCI Express protocol.

The development kit includes the following three logic designs (all
logic designs are provided as VHDL sources):

  1) Host Interface Bridge (HIB) at the topmost layer of GPCIe design
     hierarchy. It provides a simple interface to the backend logic
     designed by the user.

  2) GPCIe core, which implements the Transaction layer, the Data Link
     layer, and the PHY MAC sub layer defined by the PCI Express
     Specification, as well as the "Application layer" built over
     these three layers. PCI configuration registers and DMA
     controllers are built in this layer.

  3) PHY, which implements the PHY PCS and PHY PMA sub layers using
     embedded Gigabit transceiver of Altera's FPGA devices.


The development kit also includes a reference design (i.e. a sample
logic) to show usage of HIB, as well as its device driver and control
library which run on Linux OS.

The kit contains the following items:

gpciepkg/
  00readme                       -- This file.
  00readme-j                     -- Japanese translation of this file.
  00license                      -- License agreement of this kit.
  00license-j                    -- Japanese translation of 00license.
  doc/                           -- User's guide and other documents.
  hib.vhd                        -- Logic design of HIB.
  gpcie.vhd                      -- Logic design of GPCIe core.
  phy.vhd                        -- Logic design of PHY.
  ifpga_{agx,s2gx}{8,4}.vhd      -- A reference design.
  synth/                         -- Files used for synthesis of the reference design (.qpf, .qsf, .sdc).
  Makefile                       -- A makefile to generate hib.vhd, gpcie.vhd, and phy.vhd from VHDL source template.
  templates/                     -- VHDL source template.
  scripts/                       -- Utilities to install/uninstall HIB control softwares.
  include/                       -- Header files for HIB control library (for Linux).
  lib/                           -- HIB control library (for Linux).
  driver/                        -- A source code of the HIB device driver (for Linux).
  hibutil/                       -- A source code of the HIB control library (for Linux).
  sample/                        -- A sample program to show usage of HIB control library.


In section 2, we briefly overview GPCIe. In section 3, basic usage of
GPCIe using HIB and embedded transceiver is shown.  In section 4, we
give description for advanced usage, such as directly control GPCIe
engine without HIB. Section 5 is devoted for detailed description of
hib, gpcie, and phy VHDL entities.

Hereafter, all file locations are given as relative path from the top
directory of the development kit, gpciepkg/.


==============
 2. Overview
==============


Supported PCI Express device type
----------------------------------

GPCIe operates as an Endpoint. It does not operate as a Switch nor a
Rootport (of a Root Complex).


Supported FPGA devices
----------------------

We have tested GPCIe on Altera's Arria GX and Stratic II GX.

It is designed to operate also on Altera's FPGA devices without
embedded Gbit transceivers (e.g. Cyclone II, III, Stratix II), when
combined with external PHY chips. However, we haven't tested such
configuration yet.

List of supported FPGA devices:
-------------------------------------------------------------------------
PCIe revision  Lane width       PIPE I/F        Arria GX    StratixII GX
-------------------------------------------------------------------------
Gen1.0
(2.5Gb/s)       x8              128b@125MHz     **          **
                x4              64b@125MHz      **          **
                x1              16b@125MHz      **          **
Gen2.0
(5.0Gb/s)       x8              128b@250MHz     -           *
                x4              64b@250MHz      -           *
                x1              16b@250MHz      -           *
-------------------------------------------------------------------------
**:Supported.    *:Will be supported soon.


Structure of the Design
-----------------------

GPCIe consists of three entities, namely, hib, gpcie, and phy125.

Entity hib is located at the topmost layer, which provides a simple
interface to the backend logic designed by the user.

Entity gpcie implements the Transaction layer, the Data Link layer,
and the PHY MAC sub layer defined by the PCI Express Specification, as
well as the "Application layer" built over these three layers. PCI
configuration registers and DMA controllers are built in this layer.

Entity phy125 implements the PHY PCS and PHY PMA sub layers using
embedded Gigabit transceiver of Altera's FPGA devices.  This entity is
not used when external PHY chips are used.  In such a case, PIPE
interface of entity gpcie is directly connected to the PHY chips.


  <Backend logic designed by the user>
        |
        | HIB interface
        | (local interface defined by GPCIe)
        |
  <Entity hib>
        |
        | Application interface
        | (local interface defined by GPCIe)
        |
  <Entity gpcie>
    Application layer (PCI configuration registers and DMA controllers)
    Transaction layer
    Data Link layer
    PHY layer (PHY MAC sub layer)
        |
        | PIPE interface
        | (interface defined by the PCI Express Specification)
        |
  <Entity phy125> or <External PHY chip>
    PHY layer (PHY PCS and PHY PMA sub layers)
        |
        | PCI Express serial interface
        | (interface defined by the PCI Express Specification)
        |
   PCI Express device


====================
 3. Basic Usage
====================

In this section, basic usage of GPCIe using HIB and embedded
transceiver is shown. For usage without HIB, and usage with external PHY
chips see the next session.

In order to use GPCIe from a logic designed by the user (hereafter
backend logic), entity hib need to be instantiated. The backend logic
communicate with the host computer (i.e. an upstream PCI Express
device) via the HIB interface. HIB bridges data transfer via the PCI
Express link and that via the HIB interface.


                  PCI Express         HIB interface
  Host computer  <-------------> HIB <------------------> Backend logic


Data transfer from the host computer to HIB is performed by Programm
I/O (PIO) write. Data transfer from HIB to the host computer is
performed by Direct Memory Access (DMA) write.  Softwares to control
these transfers are included in the development kit. Usage of the
softwares will be described later.

Data transfer between HIB and the backend logic is performed using
four signals (hib_we, hib_data, backend_we, backend_data) synchronized
to a 125MHz clock clk_out.  The backend write to HIB using data bus
backend_data, and its enable signal backend_we.  HIB write to the
backend using data bus hib_data, and its enable signal hib_we.

  Write from the backend to HIB;
    clk_out       __~~__~~__~~__~~__~~__~~__~~
    backend_we    ______~~~~~~~~~~~~~~~~______
    backend_data        <d0><d1><d2><d3>

  Write from HIB to the backend:
    clk_out       __~~__~~__~~__~~__~~__~~__~~
    hib_we        ______~~~~~~~~~~~~~~~~______
    hib_data            <d0><d1><d2><d3>

The backend cannot insert any delay during a write burst from HIB.
The backend must receive all data whenever hib_we is asserted.

HIB has an internal buffer to temporary store data written by the
backend. HIB send contents of the buffer to the host computer, when
HIB receives DMA write request from the host.  The backend need to
take care so that the buffer does not overflow, since HIB does not
check the overflow by itself.  By default, size of the buffer is set
to 1k words (8k bytes for x4, 16k bytes for x8).

DMA controllers and a PIO write controller reside in HIB. These
controller can be accessed via HIB local registers mapped to the PCI
Base Address Register 0 (BAR0) space.

The host computer performs PIO write onto the BAR2 space. HIB is
designed so that it can achieve high transfer speed, if the
page-attribute of the BAR2 space region in use is set to
write-combining mode. See the source code templates/hibctl.vhd for
implementation detail.

You can find a sample logic at ./ifpga_{agx,s2gx}{8,4}.vhd, which shows
the actual usage of HIB. In the following, we describe how to
synthesise the sample logic, and how to control it from the host
computer.


---------------
Logic Synthesis
---------------

Use Altera's QuartusII for logic synthesis. Synthesis using other
tools may be possible, but are not tested.

You can find a Quartus Project File (.qpf) and a Quartus Setting File
(.qsf) at:

  ./synth/ifpga_{agx,s2gx}{8,4}.qpf
  ./synth/ifpga_{agx,s2gx}{8,4}.qsf,

with which you can synthesize a sample logic

  ./ifpga_{agx,s2gx}{8,4}.vhd 

to obtain an SRAM Object File (.sof). This file can be used to configure
KFCR's evaluation boards AGX8 and S2GX8.

Note that only two VHDL source file:

  ./hib.vhd
  ./ifpga_{agx,s2gx}{8,4}.vhd

is used for the synthesis. Although HIB internally uses entity gpcie
and phy125, gpcie.vhd and phy.vhd are not necessary. These are
included into hib.vhd, just for user's convenience.


-------------------------
HIB Controlling Software
-------------------------

The development kit includes softwares to control HIB from the host
computer. The softwares consist of two parts: HIB device driver and
HIB control library. Installation procedure and usage of the softwares
are described in this section.

Note : The softwares are only for Linux OS, and other platforms are
currently not supported.  However, this DOES NOT imply that design of
HIB is Linux OS dependent. HIB is designed independent of any specific
OS, and can be controlled from platforms other than Linux, if
appropriate softwares are provided.


Software Installation
---------------------

In order to install the softwares, run ./scripts/install.csh and
follow its instruction.

  kawai@localhost[1]>./scripts/install.csh
  -----------------------------------------------
   Host Interface Bridge (HIB) software package
   installation program.
  -----------------------------------------------
  
  How many HIBs are you installing?: 1
  
  Confirm your choice.
    number of HIBs you are installing : 1
  Are they correct? (y/n): y
  
  -------------------------------
  Preparing for installation...
  -------------------------------
  
  ...
  
  gcc -O0 -g -I. -I../include -o hibtest hibtest.c hibutil.c -lm
  gcc -O0 -g -I. -I../include -o lsgrape lsgrape.c hibutil.c -lm
  
  done


Note that a complete source tree of the Linux kernel is required for
successful installation.


Device Driver Configuration
---------------------------

Everytime the host computer is restarted, HIB device driver need to be
configured into the Linux kernel. In order to do this, change
directory to ./driver/, and run make installmodule (You need the root
permission).

  [root@localhost driver]# make installmodule
  ./install0.csh
  
  -- install module hibdrv --
  hibdrv: 1 HIB(s) found.
  
  rm -f /dev/hibdrv[0-9]
  
  /sbin/insmod -f hibdrv.ko
  mknod /dev/hibdrv0 c 253 0
  
  chgrp wheel /dev/hibdrv0
  
  chmod 666 /dev/hibdrv0
  crw-rw-rw- 1 root wheel 253, 0 Jul  9 12:59 /dev/hibdrv0
  -- done --

This should plug-in the HIB device driver hibdrv into the kernel.
You can use a command /sbin/lsmod to check the driver status. Output
of the command should have a line that has a word 'hibdrv'.

  kawai@localhost[2]>lsmod
  Module                  Size  Used by
  hibdrv                 39608  0
  ...                    ...    ...

Once the device driver is properly configured, softwares running on
the userland can access to HIB via the driver.


Functionality Test
------------------

A command ./hibutil/hibtest can be used to check functionality of the
HIB installed into the system. Run hibtest without argument to show
its usage:

  kawai@localhost[3]>./hibtest
  usage: ./hibtest <test_program_ID>
     0) show contents of config & HIB-local registers [devid]
     1) reset DMA and FIFO [devid]
     2) clear HIB-internal FIFO [devid]
     3) show DMA status [devid]
     4) read config register <addr> [devid]
     5) write config register <addr> <val> [devid]
     6) read HIB local registers mapped to BAR0 <addr> [devid]
     7) write HIB local registers mapped to BAR0 <addr> <val> [devid]
     8) read backend memory space mapped to BAR1 <addr> [devid]
     9) write backend memory space mapped to BAR1 <addr> <val> [devid]
    10) check DMA read/write function <size> <sendfunc> [devid] (host <-> HIB)
    11) measure DMA performance <sendfunc> [devid] (host <-> HIB)
    12) measure DMA write performance [devid] (host <- HIB; bypass internal FIFO)
    13) measure DMA read performance <sendfunc> [devid] (host -> HIB; bypass internal FIFO)
    14) reset backend [devid]
    15) raw PIO r/w & DMA r/w [devid]
    16) measure DMA performance with multiple HIBs <sendfunc>  <# of hibs> (host <-> HIBs internal FIFO)
    17) measure DMA write performance with multiple HIBs <# of hibs> [devid offset] (host <- HIBs; bypass internal FIFO)
    18) measure DMA read performance with multiple HIBs <sendfunc> <# of hibs> [devid offset] (host -> HIBs; bypass internal FIFO)
    19) erase configuration ROM (EPCS64) [devid]
    20) write .rpd to configuration ROM (EPCS64) <rpd-file> [devid]
    21) read configuration ROM ID (0x10:EPCS1 0x12:EPCS4 0x14:EPCS16 0x16:EPCS64) [devid]
    22) set pipeline clock frequency to (PCI-X_bus_freq * N / M) <N> <M> [devid]


Run hibtest with argument 0 to show contents of the PCI configuration registers:

  kawai@localhost[4]>./hibtest 0
  ## hib0:
  protocol : PCIe
  link width negotiated : x8
              supported : x8
  link speed negotiated : 2.5 Gb/s
             supported  : 2.5 Gb/s
  max payload size negotiated : 128 byte
                   supported  : 256 byte
  max read request size : 256 byte
  
  configuration register:
  0x00000000: 0x0e701b1a
  0x00000004: 0x00100007
  0x00000008: 0xff000001
  0x0000000c: 0x00000008
  0x00000010: 0xdf608008 0xdf608000
  0x00000014: 0xdf610008 0xdf610000
  0x00000018: 0xdf600008 0xdf600000
  0x0000001c: 0x00000000 0x00000000
  0x00000020: 0x00000000
  0x00000024: 0x00000000
  0x00000028: 0x00000000
  0x0000002c: 0x0e701b1a
  0x00000030: 0x00000000
  0x00000034: 0x00000080
  0x00000038: 0x00000000
  0x0000003c: 0x000000ff
  PCI Express Capability Register:
  0x00000080: 0x00110010
  0x00000084: 0x00000001
  0x00000088: 0x00001000
  0x0000008c: 0x00000481
  0x00000090: 0x00810000

Run hibtest 10 10 1 to test loopback transfer.  This will transmit 10
* 8 byte data from the host computer. HIB receives the data, and then
send it back to the host computer. The host computer compares the data
transmitted and received, and report the result.

  kawai@localhost[5]>./hibtest 10 10 1
  
  # check hib[0] DMA read/write (host <-> HIB internal FIFO)
  
  size 10
  
  # hib[0] PIO write, and then DMA write (host <-> HIB internal FIFO)
  clear DMA buf...
  DMA read size: 10 words (80 bytes)
  will dmar...
  
  rbuf[0000]: 0x1111111111111111  wbuf[0000]: 0x1111111111111111
  rbuf[0001]: 0x2222222222222222  wbuf[0001]: 0x2222222222222222
  rbuf[0002]: 0x3333333333333333  wbuf[0002]: 0x3333333333333333
  rbuf[0003]: 0x4444444444444444  wbuf[0003]: 0x4444444444444444
  rbuf[0004]: 0x5555555555555555  wbuf[0004]: 0x5555555555555555
  rbuf[0005]: 0x6666666666666666  wbuf[0005]: 0x6666666666666666
  rbuf[0006]: 0x123456789abc0006  wbuf[0006]: 0x123456789abc0006
  rbuf[0007]: 0x123456789abc0007  wbuf[0007]: 0x123456789abc0007
  rbuf[0008]: 0x123456789abc0008  wbuf[0008]: 0x123456789abc0008
  rbuf[0009]: 0x123456789abc0009  wbuf[0009]: 0x123456789abc0009
  ---- transfer size reached ----
  rbuf[0010]: 0x123456789abc000a  wbuf[0010]: 0xfedcba987654000a
  rbuf[0011]: 0x123456789abc000b  wbuf[0011]: 0xfedcba987654000b
  done
   10 words (80 bytes).
  OK

Run hibtest 12 to measure performance of the DMA write (write from HIB
to the host) transfer.

  kawai@localhost[6]>./hibtest 12
  
  # hib[0] DMA write (host <- HIB)
  size: 1024 DMA write: 1.562367 sec  512.043597 MB/s
  size: 2048 DMA write: 1.101087 sec  726.554697 MB/s
  size: 4096 DMA write: 0.857353 sec  933.104598 MB/s
  size: 8192 DMA write: 0.739353 sec  1082.027209 MB/s
  size: 16384 DMA write: 0.680854 sec  1174.995203 MB/s
  size: 32768 DMA write: 0.651100 sec  1228.690060 MB/s

Run hibtest 13 1 to measure performance of the PIO write (write from
the host to HIB) transfer.

  kawai@localhost[7]>./hibtest 13 1
  
  # hib[0] PIO write (host -> HIB)
  size: 64 PIO write: 2.037641 sec  392.610858 MB/s
  size: 128 PIO write: 1.233335 sec  648.647763 MB/s
  size: 256 PIO write: 0.822831 sec  972.253211 MB/s
  size: 512 PIO write: 0.639186 sec  1251.591587 MB/s
  size: 1024 PIO write: 0.620417 sec  1289.455073 MB/s
  size: 2048 PIO write: 0.620460 sec  1289.365885 MB/s
  size: 4096 PIO write: 0.620398 sec  1289.495211 MB/s
  size: 8192 PIO write: 0.620425 sec  1289.438721 MB/s
  size: 16384 PIO write: 0.620416 sec  1289.457550 MB/s

Usage of hibtest other than the ones shown above, see the source code
./hibutil/hibtest.c.


MTRR Configuration
------------------

The host computer performs PIO write onto the BAR2 space. HIB is
designed so that it can achieve high transfer speed, if the
page-attribute of the BAR2 space region in use is set to
write-combining mode. If the mode is not set, the speed would be
reduced to 20% or lower of the peak.

In order to set the mode of the BAR2 space to write-combining mode,
run ./scripts/setmtrr.csh (You need the root permission).

  [root@localhost driver]# ./setmtrr.csh
  
  Searching for HIB(s)... Found 0 PCI-X HIB(s). Found 1 PCIe HIB(s).
  Found 1 HIB(s) in total.
  
  Trying to set 1 MTRR(s)...
      echo "base=0xdf600000 size=0x1000 type=write-combining" > /proc/mtrr
  Done.
  
  current setting of MTRRs:
  reg00: base=0x00000000 (   0MB), size=2048MB: write-back, count=1
  reg01: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
  reg02: base=0x100000000 (4096MB), size=200704MB: write-back, count=1
  reg03: base=0x200000000 (8192MB), size=1024MB: write-back, count=1
  reg04: base=0xdf600000 (3574MB), size=   4KB: write-combining, count=1

The output should include a line containing
"base=0xAAAAAAAA (XXXXMB), size = 4kB: write-combining",
where AAAAAAAA denote the start address of the BAR2 space of HIB.
The value can be checked by hibtest 4 18:

  kawai@localhost[8]>../hibutil/hibtest 4 18
  hib[0] config 0x00000018: 0xdf600008

MTRR may not be set up to write-combining mode, if, for example, all 8
existing MTRR are already assigned to other PCI devices, or, the total
size of the main memory exceeds 4GB.  Depending on the chipset, this
problem may be avoided (e.g. by setting I/O remapping of the main
memory to address higher than 4GB, or setting memory hole granularity
to a larger value). Refer to the manual of your chipset or mother
board.

Running hibtest 13 1 before and after MTRR configuration, you can see
improvement of the PIO write performance:

  Before MTRR configuration (x8)

  kawai@localhost[9]>./hibtest 13 1
  
  # hib[0] PIO write (host -> HIB)
  size: 64 PIO write: 7.319836 sec  109.292068 MB/s
  size: 128 PIO write: 6.857664 sec  116.657799 MB/s
  size: 256 PIO write: 6.597888 sec  121.250922 MB/s
  size: 512 PIO write: 6.458101 sec  123.875423 MB/s
  size: 1024 PIO write: 6.404411 sec  124.913905 MB/s
  size: 2048 PIO write: 6.397210 sec  125.054514 MB/s
  size: 4096 PIO write: 6.387041 sec  125.253617 MB/s
  size: 8192 PIO write: 6.390173 sec  125.192230 MB/s
  size: 16384 PIO write: 6.384816 sec  125.297269 MB/s


  After MTRR configuration (x8)

  kawai@localhost[10]>./hibtest 13 1
  
  # hib[0] PIO write (host -> HIB)
  size: 64 PIO write: 2.037641 sec  392.610858 MB/s
  size: 128 PIO write: 1.233335 sec  648.647763 MB/s
  size: 256 PIO write: 0.822831 sec  972.253211 MB/s
  size: 512 PIO write: 0.639186 sec  1251.591587 MB/s
  size: 1024 PIO write: 0.620417 sec  1289.455073 MB/s
  size: 2048 PIO write: 0.620460 sec  1289.365885 MB/s
  size: 4096 PIO write: 0.620398 sec  1289.495211 MB/s
  size: 8192 PIO write: 0.620425 sec  1289.438721 MB/s
  size: 16384 PIO write: 0.620416 sec  1289.457550 MB/s


HIB Control Library Usage
--------------------------

HIB control library provides an API to handle data transfer between
the host computer and HIB. In order to use the library, include a
header file ./include/hibutil.h into your own source code (written in
C or C++), and link ./lib/libhib.a.

Descriptions for some important functions provided by the library are
given below. For usages of other functions, see the source code
./hibutil/hibutil.c.

    Hib* hib_openMC(int devid)
      Obtains access permission of a HIB that has device ID 'devid'.
      If the HIB is already obtained by another process, this function
      blocks.

      A device ID 'devid' is a small integer uniquely assigned to each
      HIB. When n HIBs are installed in the system, device ID 0 to n-1
      are used.

      hib_openMC() returns a pointer to a variable of type 'Hib'. The
      variable stores information necessary to manage the HIB device
      opened. Some API functions require the pointer as their argument
      (cf. hib_dmawMC).

    void hib_closeMC(int devid)
      Release access permission of a HIB that has device ID 'devid',
      so that other process can obtain it.

    void hib_piowMC(int devid, int size, UINT64 *buf)

      The host computer write data stored in the main memory to a HIB
      that has device ID 'devid'. Size of the data is given by 'size'
      (in 8-byte unit), and the start address is given by 'buf'.

      As the buffer pointed by 'buf', you can specify a memory region
      allocated by usual methods, such as an array of type UINT64
      statically allocated, or a region dynamically allocated with
      malloc().

    void hib_start_dmawMC(int devid, int size, UINT64 *buf)

      The host computer send a DMA-write request to a HIB that has
      device ID 'devid', which will kick off a data transfer from the
      HIB to the host. Size of the data is given by 'size' (in 8-byte
      unit), and the address of receiving buffer is given by 'buf'.

      Note that you cannot specify arbitrary memory region as
      receiving buffer. Only a memory region 'h->dmaw_buf', or
      'h->dmaw_buf+offset' can be used as 'buf'. Here, 'h' denotes a
      pointer to a variable of type Hib returned by hib_openMC(), and
      the value 'offset+size' should not exceed 32k byte.  The address
      pointed by 'h->dmaw_buf' is a continuous memory region allocated
      inside the Linux kernel space which is mapped to the user space.

      In order to store data received from the HIB into a buffer in a
      usual memory region, such as a statically allocated array, or a
      region dynamically allocated with malloc(), you need to copy the
      data from 'h->dmaw_buf' to the buffer.

    int hib_finish_dmawMC(int devid)
      The host computer waits for completion of a DMA write transfer
      started by 'hib_start_dmawMC()'.

    UINT32 hib_config_readMC(int devid, UINT32 addr)
      Read the value of the PCI Configuration Register address 'addr'
      of a HIB that has device ID 'devid'.

    void hib_config_writeMC(int devid, UINT32 addr, UINT32 value)
      Write a 'value' to the PCI Configuration Register address
      'address' of a HIB that has device ID 'devid'.

    UINT32 hib_mem_readMC(int devid, UINT32 addr)
      Read the value of the HIB Local Register address 'addr' of a HIB
      that has device ID 'devid'. See ./templates/hibctl.vhd for the
      address map of the Local Register.

    void hib_mem_writeMC(int devid, UINT32 addr, UINT32 value)
      Write a 'value' to the HIB Local Register address
      'address' of a HIB that has device ID 'devid'.

You can find an example of application program at
'./sample/loopback.c', which shows usage of the HIB control
library. It performs a simple loopback transfer: It transmit 10 * 8
byte data from the host computer. HIB receives the data, and then send
it back to the host computer. The host computer compares the data
transmitted and received, and report the result.

  kawai@localhost[9]>./loopback
  0x0000  sent : 0x123456789abc0000    received : 0x123456789abc0000  OK
  0x0001  sent : 0x123456789abc0001    received : 0x123456789abc0001  OK
  0x0002  sent : 0x123456789abc0002    received : 0x123456789abc0002  OK
  0x0003  sent : 0x123456789abc0003    received : 0x123456789abc0003  OK
  0x0004  sent : 0x123456789abc0004    received : 0x123456789abc0004  OK
  0x0005  sent : 0x123456789abc0005    received : 0x123456789abc0005  OK
  0x0006  sent : 0x123456789abc0006    received : 0x123456789abc0006  OK
  0x0007  sent : 0x123456789abc0007    received : 0x123456789abc0007  OK
  0x0008  sent : 0x123456789abc0008    received : 0x123456789abc0008  OK
  0x0009  sent : 0x123456789abc0009    received : 0x123456789abc0009  OK


======================
 4. Advanced Usage
======================

In this section, advanced usages of GPCIe, such as a usage without
HIB, and a usage with external PHY chips are shown.

4.1 Source Code Updation
========================

Source code of GPCIe is splitted into multiple VHDL files in
'templates/' directory. For user's convenience, all files which entity
'hib' relies on are packed into a single file 'hib.vhd'. Similary,
files necessary for entity 'gpcie' and 'phy125' are packed into
'gpcie.vhd' and 'phy.vhd', respectively.

When you modified the source code, change directory to 'gpciepkg' and
run 'make'. The modification will be reflected to 'hib.vhd',
'gpcie.vhd', and 'phy.vhd'.


4.2 Usage with External PHY Chip
================================

In order to use external PHY chips instead of embedded Gbit
transceivers, you need to modify entity 'hib' defined in
'/templates/hibtop.vhd'.

Entity 'hib' internally uses instances of three entities: 'hibctl',
'gpcie', and 'phy125'.

  hib  --+-- hibctl
         |
         +-- gpcie
         |              
         +-- phy125

You need two modifications for these instances. First, remove the
instance of 'phy125', and also remove the PIPE interface connection
between 'phy125' and 'gpcie'. The instance 'phy125' is a wrapper for
the embedded transceivers that implements PHY PCS and PHY PMA layers.
These layers are realized by the external PHY chips, and thus 'phy125'
is not necessary.

Next, connect the PIPE interface of the instance 'gpcie' to that of
the PHY chips. To do this, you need to hardwire I/O pins of the PHY
chips and the FPGA device, and then connect the I/O pins to the port
of the instance 'hibctl'.

4.3 Direct Handling of GPCIe Engine
===================================

Although HIB provides a simple interface to the backend logic, it
cannot take full advantage of GPCIe functions. For example,
in order to:

  . implement PIO read/write transfer with address, byte enable, and
    wait control,

  . assign all Base Address Space (BAR0..5) to arbitrary purpose, or,

  . use multiple DMA channels (8 channels at max),

you need to handle GPCIe engine directly from the backend logic.  For
this purpose, instantiate entity 'gpcie' (which is defined in
'gpcie.vhd') in your design.


================================
5. Details of the VHDL Entities
================================

VHDL entities 'hib', 'gpcie', and 'ph125' have various generic
parameters and I/O ports. In the following, description for some
important generic parameters and all I/O ports of these entities are
given.

The default values of generic parameters of entity 'gpcie' are
optimized for HIB. You may overwrite them depending on your design
requirement.


---------------------
Details of Entity HIB
---------------------

entity hib is
  generic (
    DEVICE        : string := "Arria GX"; -- Targetting FPGA device. Should be set to "Arria GX" or "Stratix II GX".
    NLANE         : integer               -- Lane width of the PCI Express link. Should be set to 4 or 8.
    PIOWBUF_DEPTH : integer := 8;         -- Depth of the PIO write buffer. Default value 8 denotes 256 (= 2^8) words.
    TXBUF_DEPTH   : integer := 10;        -- Depth of the backend_data receiving buffer. Default value 10 denotes 1024 (= 2^10) words.

    USE_CLK32     : integer := 1          -- Should be set to 1 whenever possible. You may set this value to 0 if you cannot
                                          -- supply clk32 input. Then HIB try to boot without using clk32, at the risk of malfunction.
  );
  port (
    phy_linkup    : out std_logic;                             -- Asserted when the PCIe link training in the PHY layer is completed.
    dl_linkup     : out std_logic;                             -- Asserted when the PCIe link initialization in the Data Link layer is completed.
    clk32         : in  std_logic;                             -- A clock input used to generate timing for power on reset signal and
                                                               -- transceiver calibration. The clock frequency can be any value
                                                               -- in the range of 10MHz-125MHz.
    clk100_ext    : in  std_logic;                             -- A 100MHz differencial input for Gbit transceiver reference clock.
    mperst        : in  std_logic;                             -- An active low reset signal.

    --
    -- PCI Express Serial Interface
    --
    rx_in         : in  std_logic_vector(NLANE-1 downto 0);    -- Input from the PCI Express high-speed serial receiver port.
    tx_out        : out std_logic_vector(NLANE-1 downto 0);    -- Output to the PCI Express high-speed serial transmitter port.

    wake          : out std_logic;                             -- Not used.
    clk_out       : out std_logic;                             -- A 125MHz clock output generated in the PHY PCS layer based on 'clk100_ext' input.
                                                               -- The interface to the backend is synchronized to this clock.
    --
    -- Interface to the Backend Logic
    -
    hib_we        : out std_logic;                             -- Write enable for 'hib_data', which is driven by HIB.
    hib_data      : out std_logic_vector(NLANE*16-1 downto 0); -- Data output to the backend logic.
    backend_we    : in  std_logic;                             -- Write enable for 'backend_data', which is driven by the backend logic.
    backend_data  : in  std_logic_vector(NLANE*16-1 downto 0); -- Data input from the backend logic.
    reset_backend : out std_logic;                             -- Active high reset output to the backend logic.
    board_info    : in  std_logic_vector(31 downto 0)          -- Initial value of a mailbox register 'board_info'. This register can be
                                                               -- read/written by the host computer, and be used by the backend logic
                                                               -- for an arbitrary purpose.
  );
end hib;


-----------------------
Details of Entity GPCIE
-----------------------

library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;
use work.gpciepkg.all;  -- Use gpciepkg library, which is defined in gpcie.vhd.

entity gpcie is
  generic (
    NLANE             : integer range 1 to 8 := 8; -- Lane width of the PCI Express link. Should be set to 4 or 8.
    NDMACH            : integer range 0 to 7 := 2; -- Number of DMA channels (8 channels at max).

    MAX_READ_REQ_SIZE : integer := 256; -- max. read request size in byte.
    MAX_PAYLOAD       : integer := 256; -- max. payload size supported (in byte unit).
                                        -- max. payload size actually used is determined through negotiation with the upstream device.

    CA_PH_VC0_INIT    : integer := 16;  -- Depth of the Rx Flow Control buffer (posted, header) in TLP unit,
                                        -- i.e., Default value 16 denotes 16 Transaction-layer packet (TLP) can be buffered at max.
    CA_PD_VC0_INIT    : integer := 64;  -- Depth of the Rx Flow Control buffer (posted, data) in 16-byte unit.

    CA_NPH_VC0_INIT   : integer := 2;   -- Depth of the Rx Flow Control buffer (non-posted, header) in TLP unit.
    CA_NPD_VC0_INIT   : integer := 16;  -- Depth of the Rx Flow Control buffer (non-posted, data) in 16-byte unit.

    CA_CH_VC0_INIT    : integer := 2;   -- Depth of the Rx Flow Control buffer (completion, header) in TLP unit.
    CA_CD_VC0_INIT    : integer := 16;  -- Depth of the Rx Flow Control buffer (completion, data) in 16-byte unit.

    CL_PH_VC0_INIT    : integer := 16;  -- Depth of the Tx Flow Control buffer (posted, header) in TLP unit.
    CL_PD_VC0_INIT    : integer := 64;  -- Depth of the Tx Flow Control buffer (posted, data) in 16-byte unit.

    CL_NPH_VC0_INIT   : integer := 2;   -- Depth of the Tx Flow Control buffer (non-posted, header) in TLP unit.
    CL_NPD_VC0_INIT   : integer := 16;  -- Depth of the Tx Flow Control buffer (non-posted, data) in 16-byte unit.

    CL_CH_VC0_INIT    : integer := 2;   -- Depth of the Tx Flow Control buffer (completion, header) in TLP unit.
    CL_CD_VC0_INIT    : integer := 16;  -- Depth of the Tx Flow Control buffer (completion, data) in 16-byte unit.

    CFG_VENDOR_ID_INIT           : std_logic_vector(15 downto 0) := x"1b1a";      -- Vendor ID of KFCR. Do not modify (see the license agreement).
    CFG_DEVICE_ID_INIT           : std_logic_vector(15 downto 0) := x"0e70";      -- Device ID. Default value 0E70h is the one KFCR assigned to HIB.
    CFG_REVISION_ID_INIT         : std_logic_vector( 7 downto 0) := x"01";        -- Revision ID.
    CFG_CLASS_CODE_INIT          : std_logic_vector(23 downto 0) := x"ff0000";    -- PCI class code.

    -- PCI Base Address Register0-5 ȳĥ ROM νͤꤷޤ
    CFG_BAR0_INIT                : std_logic_vector(31 downto 0) := x"ffff8008";  -- 32kB, prefetchable, 32-bit address, memory space.
    CFG_BAR1_INIT                : std_logic_vector(31 downto 0) := x"fffff008";  --  4kB, prefetchable, 32-bit address, memory space.
    CFG_BAR2_INIT                : std_logic_vector(31 downto 0) := x"ffff8008";  -- 32kB, prefetchable, 32-bit address, memory space.
    CFG_BAR3_INIT                : std_logic_vector(31 downto 0) := x"00000000";  -- Not used.
    CFG_BAR4_INIT                : std_logic_vector(31 downto 0) := x"00000000";  -- Not used.
    CFG_BAR5_INIT                : std_logic_vector(31 downto 0) := x"00000000";  -- Not used.
    CFG_BAR_ROM_INIT             : std_logic_vector(31 downto 0) := x"00000000";  -- Not used.

    CFG_SUB_VENDOR_ID_INIT       : std_logic_vector(15 downto 0) := x"1b1a";      -- Sub verndor ID.
    CFG_SUB_DEVICE_ID_INIT       : std_logic_vector(15 downto 0) := x"0e70";      -- Sub device ID.
    CFG_INT_PIN_INIT             : std_logic_vector( 7 downto 0) := x"00"         -- Interrupt pin in use. Should be set to 0, since
                                                                                  -- GPCIe currently does not support interrupt signal.
  );

  port (
    phy_linkup      : out std_logic;  -- Asserted when the PCIe link training in the PHY layer is completed.
    dl_linkup       : out std_logic;  -- Asserted when the PCIe link initialization in the Data Link layer is completed.
    
    clk             : in  std_logic;  -- A 125MHz clock input generated in the PHY PCS layer.
                                      -- All I/O ports including PIPE interface are synchronized to this clock.
    rstn            : in  std_logic;  -- An active low reset signal.

    --
    -- PIPE Interface
    --
    phystatus       : in  std_logic;
    powerdown       : out std_logic_vector(1 downto 0);
    txdetectrx      : out std_logic;
    txdata          : out std_logic_vector(NLANE*16-1 downto 0);
    txdatak         : out std_logic_vector(NLANE*2-1 downto 0);
    txelecidle      : out std_logic_vector(NLANE-1 downto 0);
    txcompl         : out std_logic_vector(NLANE-1 downto 0);
    rxpolarity      : out std_logic_vector(NLANE-1 downto 0);
    rxdata          : in  std_logic_vector(NLANE*16-1 downto 0);
    rxdatak         : in  std_logic_vector(NLANE*2-1 downto 0);
    rxvalid         : in  std_logic_vector(NLANE-1 downto 0);
    rxelecidle      : in  std_logic_vector(NLANE-1 downto 0);
    rxstatus        : in  std_logic_vector(NLANE*3-1 downto 0);

    --
    -- Application Interface
    --

    -- Target (Slave) read/write Interface
    slv_readreq     : out std_logic;                             -- Read request. The read will start right at the clock cycle when 'slv_accept' is asserted.
    slv_writereq    : out std_logic;                             -- Write request. The write will start right at the clock cycle when 'slv_accept' is asserted.
    slv_accept      : in  std_logic;                             -- Accept for read/write request.
    slv_read        : out std_logic;                             -- When this is asserted, the backend logic should supply data to 'slv_datain' in the next clock cycle.
    slv_write       : out std_logic;                             -- Indicates data is present on 'slv_dataout'.
    slv_bar         : out std_logic_vector(6 downto 0);          -- Base address space from/to which current transaction is reading/writing.
    slv_addr        : out std_logic_vector(63 downto 0);         -- Local address from/to which current transaction is reading/writing.
    slv_bytevalid   : out std_logic_vector(NLANE*2-1 downto 0);  -- Byte enable of 'slv_dataout'. Valid only for write transaction.
    slv_bytecount   : out std_logic_vector(11 downto 0);         -- Remaining byte count of current transaction.
    slv_dataout     : out std_logic_vector(NLANE*16-1 downto 0); -- Data output from GPCIe.
    slv_datain      : in  std_logic_vector(NLANE*16-1 downto 0); -- Data input to GPCIe.

    -- Initiator (Master) read/write Interface
    ms_wrchannel      : out std_logic_vector(NDMACH-1 downto 0);   -- DMA channel currently occupying 'ms_wrdata', the data path for DMA write.
    ms_write          : out std_logic;                             -- When this is asserted, the backend logic should supply data to 'ms_wrdata' in the next clock cycle.
    ms_wraddr         : out std_logic_vector(31 downto 0);         -- Local address which current DMA write transaction is reading from.
    ms_wrdata         : in  std_logic_vector(NLANE*16-1 downto 0); -- Data input to GPCIe, used for DMA write.

    ms_rdchannel      : out std_logic_vector(NDMACH-1 downto 0);   -- DMA channel currently occupying 'ms_rddata', the data path for DMA read.
    ms_read           : out std_logic;                             -- Indicates data is present on 'ms_rddata'.
    ms_rdaddr         : out std_logic_vector(31 downto 0);         -- Local address which current DMA read transaction is writing to.
    ms_rddata         : out std_logic_vector(NLANE*16-1 downto 0); -- Data output from GPCIe, used for DMA read.

    -- DMA Controller Interface
    --    Independent set of interface is provided for each DMA(n) channel,
    --    where n is a channel ID in 0..NDMACH-1.
    --    For example, 'dma_control' signal for the n-th channel can be accessed as 'dma_control(n)(6 downto 0)'.
    --    Two-dimensional array type used for the definition of such signals are defined in a package 'gpciepkg'.
    --       
    dma_control       : in  each7b(NDMACH-1 downto 0);             -- Signals to control DMA transfer.
                                                                   --   dma_control(n)(0) : Write enable for dma_paddrlow_in(n).
                                                                   --   dma_control(n)(1) : Write enable for dma_paddrhight_in(n).
                                                                   --   dma_control(n)(2) : Write enable for dma_laddr_in(n).
                                                                   --   dma_control(n)(3) : Write enable for dma_size_in(n).
                                                                   --   dma_control(n)(4) : Write enable for dma_param_in(n).
                                                                   --   dma_control(n)(5) : A single pulse input starts a new DMA transfer.
                                                                   --   dma_control(n)(6) : A single pulse input aborts the DMA transfer in progress.
    dma_param         : in  each16b(NDMACH-1 downto 0);            -- Signals to set DMA-transfer parameters.
                                                                   --   dma_param(n)(8) : Direction of the transfer.
                                                                   --                     0:read (read from the host computer)  1:write (write to the host computer)
    dma_status        : out each4b(NDMACH-1 downto 0);             -- Signals to show DMA status.
                                                                   --   dma_status(n)(3) : 0:a transfer is in progress.  1:no transfer is in progress.
    dma_fifocnt       : in  each13b(NDMACH-1 downto 0);            -- For DMA write : The number of data bytes the backend logic can supply to GPCIe.
                                                                   -- For DMA read  : The number of data bytes the backend logic can receive from GPCIe.

    dma_paddrlow_in   : in  each32b(NDMACH-1 downto 0);            -- Lower 32-bit of PCI address at which a DMA transfer starts.
    dma_paddrhigh_in  : in  each32b(NDMACH-1 downto 0);            -- Higher 32-bit of PCI address at which a DMA transfer starts.
    dma_laddr_in      : in  each32b(NDMACH-1 downto 0);            -- Local address  at which a DMA transfer starts.
    dma_size_in       : in  each32b(NDMACH-1 downto 0);            -- Size of a DMA transfer (in byte unit).

    dma_paddrlow_out  : out each32b(NDMACH-1 downto 0);            -- Lower 32-bit of PCI address at which a DMA transfer is in progress.
    dma_paddrhigh_out : out each32b(NDMACH-1 downto 0);            -- Higher 32-bit of PCI address at which a DMA transfer is in progress.
    dma_laddr_out     : out each32b(NDMACH-1 downto 0);            -- Local address at which a DMA transfer is in progress.
    dma_size_out      : out each32b(NDMACH-1 downto 0)             -- Remaining byte count of a DMA transfer in progress.
  );

end gpcie;


------------------------
Details of Entity PHY125
------------------------

library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned.all;

entity phy125 is
  generic (
    DEVICE       : string := "Arria GX"; -- Targetting FPGA device. Should be set to "Arria GX" or "Stratix II GX".
    NLANE        : integer;              -- Lane width of the PCI Express link. Should be set to 4 or 8.
    USE_CLK32    : integer := 1          -- Should be set to 1 whenever possible. You may set this value to 0 if you cannot
                                         -- supply clk32 input. Then, at boot time, 'phy125' try to initialize
                                         -- transceiver without using clk32, at the risk of malfunction.

  );
  port (
    cal_blk_clk             : in    std_logic;    -- A clock input used for transceiver calibration.
                                                  -- The clock frequency can be any value in the range of 10MHz-125MHz.
    clk32                   : in    std_logic;    -- A clock input used to generate timing for power of reset signal.
                                                  -- The clock frequency can be any value in the range of 10MHz-125MHz.
    clk100                  : in    std_logic;    -- A 100MHz differencial input for Gbit transceiver reference clock.
    clk125out               : out   std_logic;    -- A 125MHz clock output generated in the PHY PCS layer based on 'clk100' input.
                                                  -- PIPE interface is synchronized to this clock.
    clk125plllock           : out   std_logic;    -- Asserted when internal PLL is locked and clock output from 'clk125out' becomes stable.
    rstn                    : in    std_logic;    -- An active low reset signal.

    --
    -- PCI Express Serial Interface
    --
    rx_in                   : in    std_logic_vector(NLANE-1 downto 0);   -- Input from the PCI Express high-speed serial receiver port.
    tx_out                  : out   std_logic_vector(NLANE-1 downto 0);   -- Output to the PCI Express high-speed serial transmitter port.

    --
    -- PIPE Interface
    --
    wake                    : out   std_logic;
    phystatus               : out   std_logic;
    powerdown               : in    std_logic_vector(1 downto 0);
    txdetectrx              : in    std_logic;
    txdata                  : in    std_logic_vector(NLANE*16-1 downto 0);
    txdatak                 : in    std_logic_vector(NLANE*2-1 downto 0);
    txelecidle              : in    std_logic_vector(NLANE-1 downto 0);
    txcompl                 : in    std_logic_vector(NLANE-1 downto 0);
    rxpolarity              : in    std_logic_vector(NLANE-1 downto 0);
    rxdata                  : out   std_logic_vector(NLANE*16-1 downto 0);
    rxdatak                 : out   std_logic_vector(NLANE*2-1 downto 0);
    rxvalid                 : out   std_logic_vector(NLANE-1 downto 0);
    rxelecidle              : out   std_logic_vector(NLANE-1 downto 0);
    rxstatus                : out   std_logic_vector(NLANE*3-1 downto 0)
  );
end phy125;


--------------------------------------------------------------------------------------------------
Contact address for questions and bug reports:
K&F Computing Research Co. (support@kfcr.jp)
