--- title: 2015q3 Homework #1 Ext toc: no ... 目標 ------ 利用以下公式 $\pi = 4 \int_{0}^{1} \frac{1}{1 + x^2} dx$ 採用離散積分的方法求圓周率,並著手透過 SIMD 指令作效能最佳化 Baseline ------------ ```c #include double compute_pi(size dt) { double pi = 0.0; double delta = 1.0 / dt; for (size_t i = 0; i < dt; i++) { double x = (double) i / dt; pi += delta / (1.0 + x * x); } return pi * 4.0; } ``` dt 為 [0, 1] 區間的大小,理論上 dt 切割地越細,計算結果越精確,但運算量自然越大。 請用 dt = 128M 作為輸入 AVX SIMD -------------- ```c #include #include double compute_pi(size_t dt) { double pi = 0.0; double delta = 1.0 / dt; register __m256d ymm0, ymm1, ymm2, ymm3, ymm4; ymm0 = _mm256_set1_pd(1.0); ymm1 = _mm256_set1_pd(delta); ymm2 = _mm256_set_pd(delta * 3, delta * 2, delta * 1, 0.0); ymm4 = _mm256_setzero_pd(); for (int i = 0; i <= dt - 4; i += 4) { ymm3 = _mm256_set1_pd(i * delta); ymm3 = _mm256_add_pd(ymm3, ymm2); ymm3 = _mm256_mul_pd(ymm3, ymm3); ymm3 = _mm256_add_pd(ymm0, ymm3); ymm3 = _mm256_div_pd(ymm1, ymm3); ymm4 = _mm256_add_pd(ymm4, ymm3); } double tmp[4] __attribute__((aligned(32))); _mm256_store_pd(tmp, ymm4); pi += tmp[0] + tmp[1] + tmp[2] + tmp[3]; return pi * 4.0; } ``` 參考 [flops](https://github.com/nanoant/flops) 去修改上述程式碼,使其得以在 GNU/Linux 編譯和運作,並且測試其效能改進 參考資訊 ------------- * [Introduction to Intel® Advanced Vector Extensions](https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions) * [Intel® 64 and IA-32 Architectures Optimization Reference Manual](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) * [Online LaTeX editor](https://www.codecogs.com/latex/eqneditor.php)