Deep learning models are generally trained using a single-precision floating-point number system. However, in the inference process, they use simpler number systems like integers and fixed-points, because of their small design area and low power consumption, despite the accuracy loss and quantization parameter overhead due to their quantization. In general, a floating-point MAC unit makes it unsuitable for inference engines and especially for the area, power, and heat-sensitive devices such as Processor-In-Memory (PIM). In this paper, we propose an efficient MAC design based on the bfloat16 suitable for neural network operations while considering the characteristics of data used for deep learning. Our techniques simplified the design by removing the circuits for handling an underflow, an overflow, and a normalization from the critical path and treating them as exceptions. Also, we improved the computational accuracy by extending the bit-width of the mantissa inside the MAC unit and eliminated unnecessary normalization at every computation. Compared with a MAC unit without our optimization by using the Samsung 65nm library, we reduced the delay of a non-pipelined MAC unit by 47.3%, the area by 9.1 %, and the power consumption by 24.2%, respectively. Furthermore, we show that the proposed bfloat16 MAC outperformed the 16-bit integer MAC in terms of area and power consumption. We also show the design of a 1GHz 3-stage pipelined MAC unit with its performance analysis.