TY - GEN
T1 - A Low-power network-on-chip architecture for tile-based chip multi-processors
AU - Psarras, Anastasios
AU - Lee, Junghee
AU - Mattheakis, Pavlos
AU - Nicopoulos, Chrysostomos
AU - Dimitrakopoulos, Giorgos
PY - 2016/5/18
Y1 - 2016/5/18
N2 - Technology scaling of tiled-based CMPs reduces the physical size of each tile and increases the number of tiles per die. This trend directly impacts the on-chip interconnect; even though the tile population increases, the inter-tile link distances scale down proportionally to the tile dimensions. The decreasing inter-tile wire lengths can be exploited to enable swift link traversal between neighboring tiles, after appropriate wire engineering. Building on this premise, we propose a technique to rapidly transfer flits between adjacent routers in half a clock cycle, by utilizing both edges of the clock during the sending and receiving operations. Half-cycle link traversal enables, for the first time, substantial reductions in (a) link power, irrespective of the data switching profile, and (b) buffer power (through buffer-size reduction), without incurring any latency/throughput loss. In fact, the proposed architecture also yields some latency improvements over a baseline NoC. Detailed hardware analysis using placed-and-routed designs, and cycle-accurate full-system simulations corroborate the significant power and latency improvements.
AB - Technology scaling of tiled-based CMPs reduces the physical size of each tile and increases the number of tiles per die. This trend directly impacts the on-chip interconnect; even though the tile population increases, the inter-tile link distances scale down proportionally to the tile dimensions. The decreasing inter-tile wire lengths can be exploited to enable swift link traversal between neighboring tiles, after appropriate wire engineering. Building on this premise, we propose a technique to rapidly transfer flits between adjacent routers in half a clock cycle, by utilizing both edges of the clock during the sending and receiving operations. Half-cycle link traversal enables, for the first time, substantial reductions in (a) link power, irrespective of the data switching profile, and (b) buffer power (through buffer-size reduction), without incurring any latency/throughput loss. In fact, the proposed architecture also yields some latency improvements over a baseline NoC. Detailed hardware analysis using placed-and-routed designs, and cycle-accurate full-system simulations corroborate the significant power and latency improvements.
UR - http://www.scopus.com/inward/record.url?scp=84974733264&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84974733264&partnerID=8YFLogxK
U2 - 10.1145/2902961.2903010
DO - 10.1145/2902961.2903010
M3 - Conference contribution
AN - SCOPUS:84974733264
T3 - Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI
SP - 335
EP - 340
BT - GLSVLSI 2016 - Proceedings of the 2016 ACM Great Lakes Symposium on VLSI
PB - Association for Computing Machinery
T2 - 26th ACM Great Lakes Symposium on VLSI, GLSVLSI 2016
Y2 - 18 May 2016 through 20 May 2016
ER -