DPDK series twenty-six buffer Cache management

1. The usefulness of Cache

In fact, I have never wanted to analyze this problem, mainly because there are too many problems. Even if you don't learn DPDK, you can't avoid this problem in computer principles and operating systems, memory-based frameworks, etc., including multi-threaded pseudo-sharing. It can be said that this problem cannot be avoided, and the clichés are blurred.
So the focus here is not to talk about the principle of Cache, there are too many books and the Internet to read. Here we focus on analyzing how to use Cache in DPDK, that is to say, what is the use of Cache in DPDK?
1. To reduce the concurrent conflicts of memory locks, the purpose is also to improve the read and write speed
2. To improve the read and write speed

2. Cache processing in DPDK

1. Support for Cache
Huge page memory: improve hit rate
DIDO: directly deal with hardware buffer, skip memory
TLB: TLB combined with huge page is still a way to improve Cache hit rate

2. Prefetching instructions
are generally at the level of Cache, which are all hardware. At most, they are processed by OS operations, and are generally not open to the upper layer. But with the development of technology, software developers can also operate prefetch instructions, and the bottom layer also opens these software prefetch instructions. DPDK can use this technology to handle the data loading of the Cache and improve execution efficiency. But it should be noted that if you want to do the same with software, you have to consider a good strategy, so that you don't become a dog when you draw a tiger.
Prefetch instructions are generally assembly commands, but some programs also provide encapsulated upper-layer API libraries.
In DPDK, in order to keep the clock cycle matching with data processing, that is, to achieve maximum efficiency, it is necessary to ensure that all data can be in the Cache, otherwise the performance will be severely degraded. And this prefetch instruction is a means to cooperate with this hit. Of course, there are other means in DPDK that can also achieve this goal. There is only one result, which makes the processing and data reading consistent.

3. Prefetching consistency processing in DPDK
Modern computers are basically multi-core or multi-CPU. How does DPDK handle conflicts when different cores access the same Cache? In other words, how to ensure data consistency? The solution is very simple and rude, directly give each core a separate Cache. In this way, reading and writing only operate their own data queues, so there will be no conflicts. However, this also brings about the final consistency of the data, which needs to be solved by design. Try to avoid conflicts. If you have to, then you have to lock or use some protocols to solve it.
In addition, in order to keep conflicts to a minimum, the Cache Line (the smallest unit of the Cache) is directly aligned during allocation. This is another means. In other words, try to ensure that they are not divided into two in one Cache Line.
Then there are several kinds of consistency protocols:
Directory-based protocol and Bus snooping protocol, which are not expanded here, and you can check the information if you are interested.

3. Source code analysis

Let's take a look at the relevant source code of Cache:

//Ring
struct rte_ring {
	/*
	 * Note: this field kept the RTE_MEMZONE_NAMESIZE size due to ABI
	 * compatibility requirements, it could be changed to RTE_RING_NAMESIZE
	 * next time the ABI changes
	 */
	char name[RTE_MEMZONE_NAMESIZE] __rte_cache_aligned; /**< Name of the ring. */
	int flags;               /**< Flags supplied at creation. */
	const struct rte_memzone *memzone;
			/**< Memzone, if any, containing the rte_ring */
	uint32_t size;           /**< Size of ring. */
	uint32_t mask;           /**< Mask (size-1) of ring. */
	uint32_t capacity;       /**< Usable size of ring */

	char pad0 __rte_cache_aligned; /**< empty cache line */

	/** Ring producer status. */
	struct rte_ring_headtail prod __rte_cache_aligned;
	char pad1 __rte_cache_aligned; /**< empty cache line */

	/** Ring consumer status. */
	struct rte_ring_headtail cons __rte_cache_aligned;
	char pad2 __rte_cache_aligned; /**< empty cache line */
};

// librte_eal/common/include/rte_common.h
/** Force alignment to cache line. */
#define __rte_cache_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
#define __rte_aligned(a) __attribute__((__aligned__(a)))  

The macro __rte_cache_aligned can often be seen in the basic data structure, which is actually a processing alignment method for Cache Line.
Look again at the configuration data structure definition for each core:

struct lcore_conf {
	uint16_t nb_rx_queue;
	struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE];
	uint16_t tx_queue_id[RTE_MAX_ETHPORTS];
	struct buffer tx_mbufs[RTE_MAX_ETHPORTS];
	struct ipsec_ctx inbound;
	struct ipsec_ctx outbound;
	struct rt_ctx *rt4_ctx;
	struct rt_ctx *rt6_ctx;
	struct {
		struct rte_ip_frag_tbl *tbl;
		struct rte_mempool *pool_dir;
		struct rte_mempool *pool_indir;
		struct rte_ip_frag_death_row dr;
	} frag;
} __rte_cache_aligned;//总是行对齐,防止跨Cache Line
static struct lcore_conf lcore_conf[RTE_MAX_LCORE];

RTE_MAX_LCORE is the current maximum number of cores. Access to cores is controlled by numbers to avoid the problem of multiple cores accessing the same data structure. Similarly, modern network cards generally support multi-queue network cards, and DPDK prepares multiple read and write queues for them to deal with these problems. Readers who have come from the previous installation may recall the configuration during installation.
Take a look at the received prefetch:

uint16_t
ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
		uint16_t nb_pkts)
{
	struct ixgbe_rx_queue *rxq;
	volatile union ixgbe_adv_rx_desc *rx_ring;
	volatile union ixgbe_adv_rx_desc *rxdp;
	struct ixgbe_rx_entry *sw_ring;
	struct ixgbe_rx_entry *rxe;
	struct rte_mbuf *rxm;
	struct rte_mbuf *nmb;
	union ixgbe_adv_rx_desc rxd;
	uint64_t dma_addr;
	uint32_t staterr;
	uint32_t pkt_info;
	uint16_t pkt_len;
	uint16_t rx_id;
	uint16_t nb_rx;
	uint16_t nb_hold;
	uint64_t pkt_flags;
	uint64_t vlan_flags;

	nb_rx = 0;
	nb_hold = 0;
	rxq = rx_queue;
	rx_id = rxq->rx_tail;
	rx_ring = rxq->rx_ring;
	sw_ring = rxq->sw_ring;
	vlan_flags = rxq->vlan_flags;
	while (nb_rx < nb_pkts) {
		/*
		 * The order of operations here is important as the DD status
		 * bit must not be read after any other descriptor fields.
		 * rx_ring and rxdp are pointing to volatile data so the order
		 * of accesses cannot be reordered by the compiler. If they were
		 * not volatile, they could be reordered which could lead to
		 * using invalid descriptor fields when read from rxd.
		 */
		rxdp = &rx_ring[rx_id];
		staterr = rxdp->wb.upper.status_error;
		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
			break;
		rxd = *rxdp;

		/*
		 * End of packet.
		 *
		 * If the IXGBE_RXDADV_STAT_EOP flag is not set, the RX packet
		 * is likely to be invalid and to be dropped by the various
		 * validation checks performed by the network stack.
		 *
		 * Allocate a new mbuf to replenish the RX ring descriptor.
		 * If the allocation fails:
		 *    - arrange for that RX descriptor to be the first one
		 *      being parsed the next time the receive function is
		 *      invoked [on the same queue].
		 *
		 *    - Stop parsing the RX ring and return immediately.
		 *
		 * This policy do not drop the packet received in the RX
		 * descriptor for which the allocation of a new mbuf failed.
		 * Thus, it allows that packet to be later retrieved if
		 * mbuf have been freed in the mean time.
		 * As a side effect, holding RX descriptors instead of
		 * systematically giving them back to the NIC may lead to
		 * RX ring exhaustion situations.
		 * However, the NIC can gracefully prevent such situations
		 * to happen by sending specific "back-pressure" flow control
		 * frames to its peer(s).
		 */
		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
			   "ext_err_stat=0x%08x pkt_len=%u",
			   (unsigned) rxq->port_id, (unsigned) rxq->queue_id,
			   (unsigned) rx_id, (unsigned) staterr,
			   (unsigned) rte_le_to_cpu_16(rxd.wb.upper.length));

		nmb = rte_mbuf_raw_alloc(rxq->mb_pool);
		if (nmb == NULL) {
			PMD_RX_LOG(DEBUG, "RX mbuf alloc failed port_id=%u "
				   "queue_id=%u", (unsigned) rxq->port_id,
				   (unsigned) rxq->queue_id);
			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed++;
			break;
		}

		nb_hold++;
		rxe = &sw_ring[rx_id];
		rx_id++;
		if (rx_id == rxq->nb_rx_desc)
			rx_id = 0;

		/* Prefetch next mbuf while processing current one. */
		rte_ixgbe_prefetch(sw_ring[rx_id].mbuf);

		/*
		 * When next RX descriptor is on a cache-line boundary,
		 * prefetch the next 4 RX descriptors and the next 8 pointers
		 * to mbufs.
		 */
		if ((rx_id & 0x3) == 0) {
			rte_ixgbe_prefetch(&rx_ring[rx_id]);
			rte_ixgbe_prefetch(&sw_ring[rx_id]);
		}

		rxm = rxe->mbuf;
		rxe->mbuf = nmb;
		dma_addr =
			rte_cpu_to_le_64(rte_mbuf_data_iova_default(nmb));
		rxdp->read.hdr_addr = 0;
		rxdp->read.pkt_addr = dma_addr;

		/*
		 * Initialize the returned mbuf.
		 * 1) setup generic mbuf fields:
		 *    - number of segments,
		 *    - next segment,
		 *    - packet length,
		 *    - RX port identifier.
		 * 2) integrate hardware offload data, if any:
		 *    - RSS flag & hash,
		 *    - IP checksum flag,
		 *    - VLAN TCI, if any,
		 *    - error flags.
		 */
		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
				      rxq->crc_len);
		rxm->data_off = RTE_PKTMBUF_HEADROOM;
		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
		rxm->nb_segs = 1;
		rxm->next = NULL;
		rxm->pkt_len = pkt_len;
		rxm->data_len = pkt_len;
		rxm->port = rxq->port_id;

		pkt_info = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
		/* Only valid if PKT_RX_VLAN set in pkt_flags */
		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);

		pkt_flags = rx_desc_status_to_pkt_flags(staterr, vlan_flags);
		pkt_flags = pkt_flags |
			rx_desc_error_to_pkt_flags(staterr, (uint16_t)pkt_info,
						   rxq->rx_udp_csum_zero_err);
		pkt_flags = pkt_flags |
			ixgbe_rxd_pkt_info_to_pkt_flags((uint16_t)pkt_info);
		rxm->ol_flags = pkt_flags;
		rxm->packet_type =
			ixgbe_rxd_pkt_info_to_pkt_type(pkt_info,
						       rxq->pkt_type_mask);

		if (likely(pkt_flags & PKT_RX_RSS_HASH))
			rxm->hash.rss = rte_le_to_cpu_32(
						rxd.wb.lower.hi_dword.rss);
		else if (pkt_flags & PKT_RX_FDIR) {
			rxm->hash.fdir.hash = rte_le_to_cpu_16(
					rxd.wb.lower.hi_dword.csum_ip.csum) &
					IXGBE_ATR_HASH_MASK;
			rxm->hash.fdir.id = rte_le_to_cpu_16(
					rxd.wb.lower.hi_dword.csum_ip.ip_id);
		}
		/*
		 * Store the mbuf address into the next entry of the array
		 * of returned packets.
		 */
		rx_pkts[nb_rx++] = rxm;
	}
	rxq->rx_tail = rx_id;

	/*
	 * If the number of free RX descriptors is greater than the RX free
	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
	 * register.
	 * Update the RDT with the value of the last processed RX descriptor
	 * minus 1, to guarantee that the RDT register is never equal to the
	 * RDH register, which creates a "full" ring situtation from the
	 * hardware point of view...
	 */
	nb_hold = (uint16_t) (nb_hold + rxq->nb_rx_hold);
	if (nb_hold > rxq->rx_free_thresh) {
		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
			   "nb_hold=%u nb_rx=%u",
			   (unsigned) rxq->port_id, (unsigned) rxq->queue_id,
			   (unsigned) rx_id, (unsigned) nb_hold,
			   (unsigned) nb_rx);
		rx_id = (uint16_t) ((rx_id == 0) ?
				     (rxq->nb_rx_desc - 1) : (rx_id - 1));
		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
		nb_hold = 0;
	}
	rxq->nb_rx_hold = nb_hold;
	return nb_rx;
}

The above prefetching is written very clearly. Look at the definition. There are three types of platforms in the source code. Here we only look at X86:

static inline void rte_prefetch0(const volatile void *p)
{
    
    
	asm volatile ("prefetcht0 %[p]" : : [p] "m" (*(const volatile char *)p));
}

static inline void rte_prefetch1(const volatile void *p)
{
    
    
	asm volatile ("prefetcht1 %[p]" : : [p] "m" (*(const volatile char *)p));
}

static inline void rte_prefetch2(const volatile void *p)
{
    
    
	asm volatile ("prefetcht2 %[p]" : : [p] "m" (*(const volatile char *)p));
}

static inline void rte_prefetch_non_temporal(const volatile void *p)
{
    
    
	asm volatile ("prefetchnta %[p]" : : [p] "m" (*(const volatile char *)p));
}

For other prefetches, you can search for rte_packet_prefetch, all are OK. I won't go into details here.

Four. Summary

In fact, from the above analysis, no matter what method is used, there is only one purpose, and the assembly line operation should try to ensure that the flow is not messy. In this way, data can be produced and processed efficiently. This kind of data processing software framework pays the most attention to these. As long as the data keeps flowing and progresses according to the intention, the design goal will be achieved.
After all, as long as there is external intervention, the intervention time is an extremely long time period for the CPU, so efficiency is no longer a concern. In the absence of intervention, the maximum flow of data must be guaranteed. Such as downloading network data, watching videos online, online video conferencing and so on.
In fact, this also shows that software design focuses on application scenarios, including some frameworks, such as Redis, which also clearly positions itself in application scenarios, which is why it is so popular. So, what can we learn from this? It goes without saying.

Guess you like

Origin blog.csdn.net/fpcc/article/details/132004583