This is program for my Polish article Rozmycie obrazów (Blurring images); try Google translate if you wish, it gives quite good translation.
Blurring in this case means calculating following kernel:
+---+---+---+ | 1 | 1 | 1 | +---+---+---+ | 1 | 1 | 1 | +---+---+---+ | 1 | 1 | 1 | +---+---+---+
where bias is 0 and divisor is 9.
Following observation have made possible significant speedup:
- each pixel could be fetched excatly once;
- some calculation can be shared across calculations for adjacent blocks.
The sample program contains several implementations:
x86
--- a scalar reference implementation;MMX
--- an MMX variant of the scalar algorithm;MMX2
--- an MMX variant with minimized memory fetches;SSE2
--- an SSE2 variant of the MMX implementation.
CPU: Core i5 M 540 @ 2.53GHz
Note: a 32-bit code
proc | time [s] | speedup | |
---|---|---|---|
x86 | 0:08.27 | 1.00 | █████ |
mmx | 0:01.82 | 4.52 | ██████████████████████ |
mmx2 | 0:02.35 | 3.52 | █████████████████ |
sse2 | 0:01.04 | 7.95 | ████████████████████████████████████████ |