AI Architecture 13. Roofline Model Analysis
In our previous posts, we discussed the two main culprits degrading deep learning model performance:
In our previous posts, we discussed the two main culprits degrading deep learning model performance:
In the previous MLP and Memory Wall, we discussed the "memory wall" phenomenon, where memory bandwidth limits system performance. In CNN and Locality,
In the previous 3 Mappings of Conv Operations, we looked at a strategy to sacrifice memory and gain computational speed (GEMM) through the Im2Col method when processing standard convolutions in hardware.
In the previous 3 Mappings of Conv Operations, we explored the massive trade-off (like Im2Col) of exchanging memory ...
In the previous post, we learned that hardware loves CNNs (Convolutional Neural Networks) because of Locality and Data Reuse. Theoretically, CNNs seem like the perfect hardware-friendly algorithm.
In the previous post, we confirmed how inefficient MLP (Fully Connected Layer) is from a hardware perspective. Due to its structure of fetching a weight once,