Modern datacenters spend a significant amount of CPU time on simple but frequent data movement operations such as memory copying, inter-process communication, and I/O. These operations, often referred to as “datacenter taxes,” degrade performance and increase CPU cache pressure. While existing DMA hardware can offload some of these tasks, conventional solutions face three limitations: they lack user-space accessibility, cannot handle virtual addresses, and do not support general device-to-device transfers. We propose a hardware-assisted data movement and transfer scheme that offloads memory operations without involving the OS kernel. Our hardware engine supports user-space DMA, virtual address translation, and peer-to-peer DMA, and is tightly integrated with the OS or hypervisor to manage multiple address spaces and device memory maps. We implemented a prototype using an Alveo U50 FPGA and confirmed significant performance gains in kernel-level memcpy/memset. We are extending the framework with secure key exchange and eBPF-style programmability to enable flexible, in-hardware data processing from user space.