版本 e0a140f6efe9b26a401a132e1fd0376f68971410
Changes from e0a140f6efe9b26a401a132e1fd0376f68971410 to 41dfd5fb7f7bab7fb3d79048e6f4ab4468ff1b32
---
title: 2015q3 Homework #1 Ext
toc: no
...
目標
------
利用以下公式
$\pi = 4 \int_{0}^{1} \frac{1}{1 + x^2} dx$
採用離散積分的方法求圓周率,並著手透過 SIMD 指令作效能最佳化
Baseline
------------
```c
#include <stdint.h>
double compute_pi(size dt)
{
double pi = 0.0;
double delta = 1.0 / dt;
for (size_t i = 0; i < dt; i++) {
double x = (double) i / dt;
pi += delta / (1.0 + x * x);
}
return pi * 4.0;
}
```
dt 為 [0, 1] 區間的大小,理論上 dt 切割地越細,計算結果越精確,但運算量自然越大。
請用 dt = 128M 作為輸入
AVX SIMD
--------------
```c
#include <immintrin.h>
#include <stdint.h>
double compute_pi(size_t dt)
{
double pi = 0.0;
double delta = 1.0 / dt;
register __m256d ymm0, ymm1, ymm2, ymm3, ymm4;
ymm0 = _mm256_set1_pd(1.0);
ymm1 = _mm256_set1_pd(delta);
ymm2 = _mm256_set_pd(delta * 3, delta * 2, delta * 1, 0.0);
ymm4 = _mm256_setzero_pd();
for (int i = 0; i <= dt - 4; i += 4) {
ymm3 = _mm256_set1_pd(i * delta);
ymm3 = _mm256_add_pd(ymm3, ymm2);
ymm3 = _mm256_mul_pd(ymm3, ymm3);
ymm3 = _mm256_add_pd(ymm0, ymm3);
ymm3 = _mm256_div_pd(ymm1, ymm3);
ymm4 = _mm256_add_pd(ymm4, ymm3);
}
double tmp[4] __attribute__((aligned(32)));
_mm256_store_pd(tmp, ymm4);
pi += tmp[0] + tmp[1] + tmp[2] + tmp[3];
return pi * 4.0;
}
```
參考 [flops](https://github.com/nanoant/flops) 去修改上述程式碼,使其得以在 GNU/Linux 編譯和運作,並且測試其效能改進
參考資訊
-------------
* [Introduction to Intel® Advanced Vector Extensions](https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions)
* [Intel® 64 and IA-32 Architectures Optimization Reference Manual](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf)
* [Online LaTeX editor](https://www.codecogs.com/latex/eqneditor.php)