For gpt2-small (d=768, 12 layers), MAC utilization holds around 33% all the way up to N=256, reaching 63,788 tokens/s at N=256 / 400 MHz. For gpt2-nano (d=128, 4 layers), utilization starts collapsing ...
A technical paper titled “Analyzing and Improving Hardware Modeling of Accel-Sim” was published by researchers at Universitat Politècnica de Catalunya. “GPU architectures have become popular for ...