3 — Section 3.3.9
3.3.9.1 to 3.3.9.8
Advanced
The LTPF (Long Term Postfilter) is a pitch-based postfilter that runs at the decoder. The encoder’s job is to estimate the fundamental pitch of the audio signal each frame and transmit that pitch information in the bitstream — along with a flag indicating whether the LTPF should be active this frame. The decoder then uses this transmitted pitch to configure an IIR filter that attenuates quantization noise in the spectral valleys between harmonics.
3.3.9.1 Overview — Why Pitch Detection?
Voiced speech and many musical instruments have a harmonic structure — strong frequency components at the fundamental frequency (pitch) and its integer multiples, separated by spectral valleys. After quantization, these valleys are filled with quantization noise that is clearly audible.
The LTPF IIR filter at the decoder uses the pitch period to notch the noise in these valleys, significantly improving the perceived quality of voiced speech at lower bitrates.
Important: even at high bitrates where LTPF has no audible effect (gain_ltpf = 0), the encoder still estimates and transmits the pitch — because a Packet Loss Concealment algorithm can use this pitch information for better speech reconstruction during lost frames.
3.3.9.2 and 3.3.9.3 Time Domain Signals and Resampling to 12.8 kHz
Pitch detection always works at a fixed internal rate of 12.8 kHz (11.76 kHz for 44.1 kHz input), regardless of the session sampling rate. This normalization simplifies pitch lag range management — the same lag range [17, 114] samples at 6.4 kHz covers the same pitch range (approximately 56 Hz to 376 Hz) regardless of session fs.
Resampling is done using a polyphase FIR filter with coefficients from Section 3.7.6 (tab_resamp_filter, length 240):
len12.8 = Nms × 128 / 10 (= 128 for 10ms, = 96 for 7.5ms)
resfac = 0.5 if fs = 8 kHz, else 1
x12.8(n) = resfac × P × SUM[k] xs(floor(15n/P) + k − 15) × h12.8(P×k − 15×n mod P)
The resfac = 0.5 at 8 kHz scales the signal correctly since 8 kHz is already below 12.8 kHz — the resampling effectively upsamples from 8 kHz to 12.8 kHz.
3.3.9.4 High-Pass Filtering and Delay
The 12.8 kHz signal is high-pass filtered with a 2nd-order IIR at 50 Hz cutoff to remove DC and very low frequency content that would corrupt pitch detection:
After filtering, the signal is delayed by DLTPF samples (24 for 10ms, 44 for 7.5ms) to produce x̃12.8_D(n). This delay aligns the pitch detection window with the actual audio content being encoded.
3.3.9.5 Pitch Detection Algorithm
Pitch detection uses a two-stage process at 6.4 kHz (the delayed signal is further downsampled by 2):
Step 1: Autocorrelation at 6.4 kHz
R6.4(k) is the autocorrelation of x6.4 for lags k = 17..114. Lag 17 at 6.4 kHz corresponds to a pitch of ~376 Hz (voice maximum). Lag 114 corresponds to ~56 Hz (voice minimum).
Step 2: Weighted search — T1
The weighted autocorrelation R6.4_w(k) = R6.4(k) × w(k) applies a decreasing weight w(k) = 1 − 0.5 × (k − kmin)/(kmax − kmin) that biases the search toward shorter pitch lags (higher pitched voices). T1 = argmax of R6.4_w.
Step 3: Unweighted local search — T2
T2 = argmax of R6.4 within ±4 samples of the previous frame’s pitch lag Tprev. This provides temporal continuity — sudden pitch jumps are avoided.
Step 4: Final pitch Tcurr
If normcorr(x6.4, corrlen, T2) ≤ 0.85 × normcorr(x6.4, corrlen, T1):
Tcurr = T1 (global max is significantly better)
Else: Tcurr = T2 (local continuity wins)
3.3.9.6 LTPF Bitstream
The LTPF bitstream is either 1 bit or 11 bits depending on whether pitch is present:
| Condition | pitch_present | nbits_LTPF | Bits Transmitted |
|---|---|---|---|
| normcorr(x6.4, corrlen, Tcurr) ≤ 0.6 | 0 | 1 | Only pitch_present = 0 |
| normcorr > 0.6 | 1 | 11 | pitch_present=1 + 9-bit pitch_index + 1-bit ltpf_active |
When pitch is not present (no strong periodic signal), there is no point transmitting pitch information. This saves 10 bits per frame.
3.3.9.7 LTPF Pitch-Lag Parameter
The pitch lag is refined from the 6.4 kHz estimate Tcurr to a higher-resolution estimate at 12.8 kHz to get a more accurate fractional pitch lag. The integer part pitch_int is the argmax of R12.8(k) — the autocorrelation of the full 12.8 kHz delayed signal — in a search window around 2×Tcurr.
The fractional part pitch_fr refines the pitch lag to 1/4 sample precision using interpolated autocorrelation values (from tab_ltpf_interp_R in Section 3.7.6). It is determined by:
If 127 ≤ pitch_int < 157: search d in {−2, 0, 2}
If 33 ≤ pitch_int < 127: search d in {−3, −2, −1, 0, 1, 2, 3}
If pitch_int = 32: search d in {0, 1, 2, 3}
pitch_fr = argmax interp(d) over allowed range
If pitch_fr is negative, pitch_int is decremented and pitch_fr += 4 to keep the fractional part in [0, 3].
The final bitstream index pitch_index (9 bits, value 0–511) encodes both pitch_int and pitch_fr:
If 127 ≤ pitch_int < 157: pitch_index = 2×pitch_int + floor(pitch_fr/2) + 126
If pitch_int < 127: pitch_index = 4×pitch_int + pitch_fr − 128
3.3.9.8 LTPF Activation Bit
The activation bit ltpf_active signals whether the LTPF should actually filter the audio at the decoder. The filter is only active when the pitch-periodic structure is strong enough to benefit from it, and the gain table yields a non-zero gain:
ltpf_active = 1 if gain_ltpf ≠ 0 AND any of these conditions hold:
• Was inactive last frame AND (Nms=10 OR penultimate nc > 0.94) AND last nc > 0.94 AND nc > 0.94
• Was active last frame AND nc > 0.9
• Was active last frame AND |pitch_curr − pitch_prev| < 2 AND (nc − nc_prev) > −0.1 AND nc > 0.84
The logic is: activate LTPF only for strongly periodic signals, and keep it active (with relaxed threshold) once activated to avoid flickering on and off. The near_nyquist_flag overrides to 0 when set.
The gain_ltpf value (0.4, 0.35, 0.3, or 0.25) is looked up from the bitrate in Section 3.4.9.4 — higher bitrates use lower LTPF gain since the spectrum is already well quantized.
Next in this Series
Sections 3.3.10–3.3.12 — Spectral Quantization, Residual Coding, and Noise Level Estimation
