VLSI-Based Parallel CNN Accelerator with Quantization for High-Performance Edge Intelligence
Main Article Content
Abstract
The increasing use of deep learning models, especially Convolutional Neural Networks (CNNs), has created a demand for efficient hardware solutions due to their high computational and energy requirements. Traditional CPU and GPU-based systems often face challenges such as high latency and power consumption, particularly in edge devices. The objective of this study is to design and implement an energy-efficient CNN hardware accelerator using VLSI architecture suitable for real-time image classification tasks. The proposed approach integrates a lightweight CNN model with a VLSI-based hardware design that includes parallel processing elements, optimized dataflow, and fixed-point quantization. The system is evaluated using the CIFAR-10 dataset, which consists of 60,000 images across 10 classes. Preprocessing techniques such as normalization and data augmentation are applied, and the trained model is mapped onto hardware using an efficient pipeline. Experimental results show that the proposed system achieves an accuracy of 94.8%, precision of 94.1%, recall of 93.6%, and F1-score of 93.8%. Compared to conventional approaches, the design demonstrates reduced latency and lower power consumption while maintaining high throughput. The use of quantization significantly improves energy efficiency with minimal impact on accuracy. In conclusion, the proposed VLSI-based CNN accelerator provides a practical solution for real-time edge AI applications, offering a balanced trade-off between performance and energy efficiency. This work contributes to the development of scalable and hardware-efficient deep learning systems
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, pp. 161–170, 2015. https://doi.org/10.1145/2684746.2689060
J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with embedded FPGA platform for convolutional neural networks,” Proc. ACM/SIGDA FPGA, pp. 26–35, 2016. https://doi.org/10.1145/2847263.2847265
Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017. https://doi.org/10.1109/JSSC.2016.2616357
V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, 2017. https://doi.org/10.1109/JPROC.2017.2761740
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, M. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, and D. Yoon, “In-datacenter performance analysis of a tensor processing unit,” Proc. ISCA, pp. 1–12, 2017. https://doi.org/10.1145/3079856.3080246
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine learning,” Proc. ASPLOS, pp. 269–284, 2014. https://doi.org/10.1145/2541940.2541967
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting vision processing closer to the sensor,” Proc. ISCA, pp. 92–104, 2015. https://doi.org/10.1145/2749469.2750389
D. V. Jayaraj, B. T. Selvi, D. A. Udhayakumar, D. J. Dhanasekar, D. P. Jayashree, and D. A. K. Kumar, “VLSI architecture for energy-efficient convolutional neural networks in embedded image recognition systems,” Int. J. Adv. Smart Inf. Syst., vol. 12, no. 1, pp. 46–62, 2026. https://doi.org/10.29284/ijasis.12.1.2026.46-62
M. Kavitha, “Energy-efficient edge-AI accelerator design using reconfigurable FPGA-based VLSI architecture,” J. VLSI Embedded Syst. Design, pp. 26–33, 2025. https://iaeces.com/Index/index.php/JVESD/article/view/28
H. M. Snousi, F. A. Aleej, M. F. Bara, and A. Alkilany, “Design and implementation of an energy-efficient AI accelerator architecture for edge-based embedded VLSI platforms,” Prog. AI-Accelerated VLSI Syst., pp. 22–31, 2026. https://iaeces.com/Index/index.php/PAIVS/article/view/90
H. M. Snousi and F. A. Aleej, “Energy-efficient VLSI architecture for lightweight CNN inference on edge devices,” J. Reconfigurable Hardware Archit. Embedded Syst., vol. 2, no. 1, pp. 7–13, 2025. https://fsrap.com/index.php/JRHAES/article/view/8
K. N. Reddy, R. D, V. Gutam, K. Navya, S. P. A, and R. Karne, “Architectural design and optimization of energy-efficient deep learning accelerators in VLSI,” Proc. ICRTEECT, pp. 1–6, 2025. https://doi.org/10.1109/ICRTEECT67512.2025.11448659
Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,” Proc. FPGA, pp. 65–74, 2017. https://doi.org/10.1145/3020078.3021744
R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, and Z. Zhang, “Accelerating binarized convolutional neural networks with software-programmable FPGAs,” Proc. FPGA, pp. 15–24, 2017. https://doi.org/10.1145/3020078.3021741
H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high-performance FPGA-based accelerator for large-scale convolutional neural networks,” Electronics, vol. 8, no. 3, p. 281, 2019. https://doi.org/10.3390/electronics8030281
S. Wang, Z. Liu, and T. Chen, “High-speed CNN accelerator SoC design based on systolic array architecture,” Electronics, vol. 13, no. 8, p. 1564, 2024. https://doi.org/10.3390/electronics13081564
Y. Chen, T. Liu, and Q. Zhang, “Efficient CNN accelerator using decomposable Winograd method,” Electronics, vol. 14, no. 6, p. 1182, 2024. https://doi.org/10.3390/electronics14061182
Y. Shen, R. Zhao, and K. Li, “An efficient CNN accelerator for pattern-compressed sparse neural networks,” Neurocomputing, 2024. https://doi.org/10.1016/j.neucom.2024.128700
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proc. CVPR, pp. 770–778, 2016. https://doi.org/10.1109/CVPR.2016.90
A. M. Agrawal, “CIFAR-10 dataset including train and test images,” Kaggle, 2022. https://www.kaggle.com/datasets/ayush1220/cifar10