| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This update pulls in commit c495077c [1] to fix a build error.
commit c495077cf8a8c37afd90875ec5a5b16b294be15e
Author: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Tue Mar 29 01:57:39 2011 +0300
sbc: better compatibility with ARM thumb/thumb2
ARM assembly optimizations fail to compile in thumb mode, but are fine
for thumb2. Update ifdefs in the code to make use of ARM assembly only
when it is safe and also make sure that no optimizations are missed
when compiling for thumb2.
The problem was reported by Paul Menzel:
https://tango.0pointer.de/pipermail/pulseaudio-discuss/2011-February/009022.html
This patch is tested with OpenEmbedded using `minimal-uclibc` for `MACHINE = "at91sam9260ek"`.
Note that changes to ipc.h from 8f3ef04b had to be manually reapplied.
[1] http://git.kernel.org/?p=bluetooth/bluez.git;a=commit;h=c495077cf8a8c37afd90875ec5a5b16b294be15e
|
|
|
|
| |
Note that changes to ipc.h from 8f3ef04b had to be manually reapplied.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Benchmarked on ARM PXA platform:
=== Before (4 bands) ====
$ time ./sbcenc_orig -s 4 long.au > /dev/null
real 0m 2.44s
user 0m 2.39s
sys 0m 0.05s
=== After (4 bands) ====
$ time ./sbcenc -s 4 long.au > /dev/null
real 0m 1.59s
user 0m 1.49s
sys 0m 0.10s
=== Before (8 bands) ====
$ time ./sbcenc_orig -s 8 long.au > /dev/null
real 0m 4.05s
user 0m 3.98s
sys 0m 0.07s
=== After (8 bands) ====
$ time ./sbcenc -s 8 long.au > /dev/null
real 0m 1.48s
user 0m 1.41s
sys 0m 0.06s
=== Before (a2dp usage) ====
$ time ./sbcenc_orig -b53 -s8 -j long.au > /dev/null
real 0m 4.51s
user 0m 4.41s
sys 0m 0.10s
=== After (a2dp usage) ====
$ time ./sbcenc -b53 -s8 -j long.au > /dev/null
real 0m 2.05s
user 0m 1.99s
sys 0m 0.06s
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The optimized filter gets enabled when the code is compiled
with -mcpu=/-march options set to target the processors which
support ARMv6 instructions. This code is also disabled when
NEON is used (which is a lot better alternative). For additional
safety ARM EABI is required and thumb mode should not be used.
Benchmarks from ARM11:
== 8 subbands ==
$ time ./sbcenc -b53 -s8 -j test.au > /dev/null
real 0m 35.65s
user 0m 34.17s
sys 0m 1.28s
$ time ./sbcenc.armv6 -b53 -s8 -j test.au > /dev/null
real 0m 17.29s
user 0m 15.47s
sys 0m 0.67s
== 4 subbands ==
$ time ./sbcenc -b53 -s4 -j test.au > /dev/null
real 0m 25.28s
user 0m 23.76s
sys 0m 1.32s
$ time ./sbcenc.armv6 -b53 -s4 -j test.au > /dev/null
real 0m 18.64s
user 0m 15.78s
sys 0m 2.22s
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the case of scale factors calculation optimizations, the inline
assembly code has instructions which update flags register, but
"cc" was not mentioned in the clobber list. When optimizing code,
gcc theoretically is allowed to do a comparison before the inline
assembly block, and a conditional branch after it which would lead
to a problem if the flags register gets clobbered. While this is
apparently not happening in practice with the current versions of
gcc, the clobber list needs to be corrected.
Regarding the other inline assembly blocks. While most likely it
is actually unnecessary based on quick review, "cc" is also added
there to the clobber list because it should have no impact on
performance in practice. It's kind of cargo cult, but relieves
us from the need to track the potential updates of flags register
in all these places.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By using SBC_ALWAYS_INLINE trick, the implementation of 'sbc_calculate_bits'
function is split into two branches, each having 'subband' variable value
known at compile time. It helps the compiler to generate more optimal code
by saving at least one extra register, and also provides more obvious
opportunities for loops unrolling.
Benchmarked on ARM Cortex-A8:
== Before: ==
$ time ./sbcenc -b53 -s8 -j test.au > /dev/null
real 0m3.989s
user 0m3.602s
sys 0m0.391s
samples % image name symbol name
26057 32.6128 sbcenc sbc_pack_frame
20003 25.0357 sbcenc sbc_analyze_4b_8s_neon
14220 17.7977 sbcenc sbc_calculate_bits
8498 10.6361 no-vmlinux /no-vmlinux
5300 6.6335 sbcenc sbc_calc_scalefactors_j_neon
3235 4.0489 sbcenc sbc_enc_process_input_8s_be_neon
2172 2.7185 sbcenc sbc_encode
== After: ==
$ time ./sbcenc -b53 -s8 -j test.au > /dev/null
real 0m3.652s
user 0m3.195s
sys 0m0.445s
samples % image name symbol name
26207 36.0095 sbcenc sbc_pack_frame
19820 27.2335 sbcenc sbc_analyze_4b_8s_neon
8629 11.8566 no-vmlinux /no-vmlinux
6988 9.6018 sbcenc sbc_calculate_bits
5094 6.9994 sbcenc sbc_calc_scalefactors_j_neon
3351 4.6044 sbcenc sbc_enc_process_input_8s_be_neon
2182 2.9982 sbcenc sbc_encode
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous variant was basically derived from C and MMX implementations.
Now new variant makes use of 'vmax' instruction, which is available in
NEON and can do this job faster. The same method for calculating scale
factors is also used in 'sbc_calc_scalefactors_j_neon'.
Benchmarked without joint stereo on ARM Cortex-A8:
== Before: ==
$ time ./sbcenc -b53 -s8 test.au > /dev/null
real 0m3.851s
user 0m3.375s
sys 0m0.469s
samples % image name symbol name
26260 34.2672 sbcenc sbc_pack_frame
20013 26.1154 sbcenc sbc_analyze_4b_8s_neon
13796 18.0027 sbcenc sbc_calculate_bits
8388 10.9457 no-vmlinux /no-vmlinux
3229 4.2136 sbcenc sbc_enc_process_input_8s_be_neon
2408 3.1422 sbcenc sbc_calc_scalefactors_neon
2093 2.7312 sbcenc sbc_encode
== After: ==
$ time ./sbcenc -b53 -s8 test.au > /dev/null
real 0m3.796s
user 0m3.344s
sys 0m0.438s
samples % image name symbol name
26582 34.8726 sbcenc sbc_pack_frame
20032 26.2797 sbcenc sbc_analyze_4b_8s_neon
13808 18.1146 sbcenc sbc_calculate_bits
8374 10.9858 no-vmlinux /no-vmlinux
3187 4.1810 sbcenc sbc_enc_process_input_8s_be_neon
2027 2.6592 sbcenc sbc_encode
1766 2.3168 sbcenc sbc_calc_scalefactors_neon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using SIMD optimizations for 'sbc_enc_process_input_*' functions provides
a modest, but consistent speedup in all SBC encoding cases.
Benchmarked on ARM Cortex-A8:
== Before: ==
$ time ./sbcenc -b53 -s8 -j test.au > /dev/null
real 0m4.389s
user 0m3.969s
sys 0m0.422s
samples % image name symbol name
26234 29.9625 sbcenc sbc_pack_frame
20057 22.9076 sbcenc sbc_analyze_4b_8s_neon
14306 16.3393 sbcenc sbc_calculate_bits
9866 11.2682 sbcenc sbc_enc_process_input_8s_be
8506 9.7149 no-vmlinux /no-vmlinux
5219 5.9608 sbcenc sbc_calc_scalefactors_j_neon
2280 2.6040 sbcenc sbc_encode
661 0.7549 libc-2.10.1.so memcpy
== After: ==
$ time ./sbcenc -b53 -s8 -j test.au > /dev/null
real 0m3.989s
user 0m3.602s
sys 0m0.391s
samples % image name symbol name
26057 32.6128 sbcenc sbc_pack_frame
20003 25.0357 sbcenc sbc_analyze_4b_8s_neon
14220 17.7977 sbcenc sbc_calculate_bits
8498 10.6361 no-vmlinux /no-vmlinux
5300 6.6335 sbcenc sbc_calc_scalefactors_j_neon
3235 4.0489 sbcenc sbc_enc_process_input_8s_be_neon
2172 2.7185 sbcenc sbc_encode
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improves SBC encoding performance when joint stereo is used, which
is a typical A2DP configuration.
Benchmarked on ARM Cortex-A8:
== Before: ==
$ time ./sbcenc -b53 -s8 -j test.au > /dev/null
real 0m5.239s
user 0m4.805s
sys 0m0.430s
samples % image name symbol name
26083 25.0856 sbcenc sbc_pack_frame
21548 20.7240 sbcenc sbc_calc_scalefactors_j
19910 19.1486 sbcenc sbc_analyze_4b_8s_neon
14377 13.8272 sbcenc sbc_calculate_bits
9990 9.6080 sbcenc sbc_enc_process_input_8s_be
8667 8.3356 no-vmlinux /no-vmlinux
2263 2.1765 sbcenc sbc_encode
696 0.6694 libc-2.10.1.so memcpy
== After: ==
$ time ./sbcenc -b53 -s8 -j test.au > /dev/null
real 0m4.389s
user 0m3.969s
sys 0m0.422s
samples % image name symbol name
26234 29.9625 sbcenc sbc_pack_frame
20057 22.9076 sbcenc sbc_analyze_4b_8s_neon
14306 16.3393 sbcenc sbc_calculate_bits
9866 11.2682 sbcenc sbc_enc_process_input_8s_be
8506 9.7149 no-vmlinux /no-vmlinux
5219 5.9608 sbcenc sbc_calc_scalefactors_j_neon
2280 2.6040 sbcenc sbc_encode
661 0.7549 libc-2.10.1.so memcpy
|
|
|
|
|
| |
The written parameter of sbc_encode can be negative so it should be
ssize_t instead of size_t.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improves SBC encoding performance when joint stereo is not used.
Benchmarked on ARM Cortex-A8:
== Before: ==
$ time ./sbcenc -b53 -s8 test.au > /dev/null
real 0m4.756s
user 0m4.313s
sys 0m0.438s
samples % image name symbol name
2569 27.6296 sbcenc sbc_pack_frame
1934 20.8002 sbcenc sbc_analyze_4b_8s_neon
1386 14.9064 sbcenc sbc_calculate_bits
1221 13.1319 sbcenc sbc_calc_scalefactors
996 10.7120 sbcenc sbc_enc_process_input_8s_be
878 9.4429 no-vmlinux /no-vmlinux
204 2.1940 sbcenc sbc_encode
56 0.6023 libc-2.10.1.so memcpy
== After: ==
$ time ./sbcenc -b53 -s8 test.au > /dev/null
real 0m4.220s
user 0m3.797s
sys 0m0.422s
samples % image name symbol name
2563 31.3249 sbcenc sbc_pack_frame
1892 23.1239 sbcenc sbc_analyze_4b_8s_neon
1368 16.7196 sbcenc sbc_calculate_bits
961 11.7453 sbcenc sbc_enc_process_input_8s_be
836 10.2176 no-vmlinux /no-vmlinux
262 3.2022 sbcenc sbc_calc_scalefactors_neon
199 2.4322 sbcenc sbc_encode
49 0.5989 libc-2.10.1.so memcpy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improves SBC encoding performance when joint stereo is not used.
Benchmarked on Pentium-M:
== Before: ==
$ time ./sbcenc -b53 -s8 test.au > /dev/null
real 0m1.439s
user 0m1.336s
sys 0m0.104s
samples % image name symbol name
8642 33.7473 sbcenc sbc_pack_frame
5873 22.9342 sbcenc sbc_analyze_4b_8s_mmx
4435 17.3188 sbcenc sbc_calc_scalefactors
4285 16.7331 sbcenc sbc_calculate_bits
1942 7.5836 sbcenc sbc_enc_process_input_8s_be
322 1.2574 sbcenc sbc_encode
== After: ==
$ time ./sbcenc -b53 -s8 test.au > /dev/null
real 0m1.319s
user 0m1.220s
sys 0m0.084s
samples % image name symbol name
8706 37.9959 sbcenc sbc_pack_frame
5740 25.0513 sbcenc sbc_analyze_4b_8s_mmx
4307 18.7972 sbcenc sbc_calculate_bits
1937 8.4537 sbcenc sbc_enc_process_input_8s_be
1801 7.8602 sbcenc sbc_calc_scalefactors_mmx
307 1.3399 sbcenc sbc_encode
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code for scale factors calculation with joint stereo support has
been moved to a separate function. It can get platform-specific
SIMD optimizations later for best possible performance.
But even this change in C code improves performance because of the
use of __builtin_clz() instead of loops similar to what was done
to sbc_calc_scalefactors earlier. Also technically it does loop
unrolling by processing two channels at once, which might be either
good or bad for performance (if the registers pressure is increased
and more data is spilled to memory). But the benchmark from 32-bit
x86 system (pentium-m) shows that it got clearly faster:
$ time ./sbcenc.old -b53 -s8 -j test.au > /dev/null
real 0m1.868s
user 0m1.808s
sys 0m0.048s
$ time ./sbcenc.new -b53 -s8 -j test.au > /dev/null
real 0m1.742s
user 0m1.668s
sys 0m0.064s
|
|
|
|
| |
Issues found by smatch static check: http://smatch.sourceforge.net/
|
|
|
|
|
|
| |
This prevents overflows and audible artefacts for the audio files which
originally had loudness maximized. Music from audio CD disks is an
example of such files, see http://en.wikipedia.org/wiki/Loudness_war
|
|
|
|
|
|
|
| |
Buffer position in X array was not always 16-bytes aligned.
Strict 16-byte alignment is strictly required for powerpc altivec
simd optimizations because altivec does not have support for
unaligned vector loads at all.
|
|
This should make it easier to apply patches from BlueZ which also uses
sbc subdir for this files.
|