Pitch Tracker

Tier: Analysis | ComponentType: 36 | Params: 3

Monophonic YIN pitch detection with cumulative mean normalized difference, parabolic interpolation, and median filtering. Audio passes through unchanged.

Overview

PitchTracker implements the YIN algorithm for fundamental frequency estimation. It accumulates input samples into a 2048-sample frame buffer and runs analysis every 512 samples (hop size). The algorithm computes the difference function across half the frame, normalizes it with cumulative mean normalization (CMNDF), then searches for the first dip below a confidence threshold. Parabolic interpolation refines the lag estimate to sub-sample accuracy, and a 5-element median filter stabilizes the output against octave jumps.

This is an analysis-only component — audio passes through the process() method unchanged. All pitch information is emitted via the snapshot pipeline. The detected frequency is converted to a MIDI note number for display, and a 32-element note history ring buffer provides a scrolling pitch trace.

The Threshold parameter controls detection sensitivity. Lower values accept weaker pitch candidates (more detections but more false positives). Higher values require stronger periodicity (fewer detections but higher confidence). The frequency range is bounded by Min Freq and Max Freq, which convert to lag search bounds internally.

File Locations

	Path
Header	`Sources/FolioDSP/include/FolioDSP/Analysis/PitchTracker.h`
Implementation	`Sources/FolioDSP/src/Analysis/PitchTracker.cpp`
Tests	`Tests/FolioDSPTests/PitchTrackerTests.swift`
Bridge	`Sources/FolioDSPBridge/src/FolioDSPBridge.mm` (PitchTrackerBridge)

Parameters

Index	Name	Description	Min	Max	Default Min	Default Max	Default	Unit
0	Threshold	YIN detection threshold (lower = more sensitive)	0.01	0.8	0.05	0.5	0.15
1	Min Freq	Minimum detectable frequency	20.0	500.0	30.0	200.0	50.0	Hz
2	Max Freq	Maximum detectable frequency	200.0	8000.0	500.0	5000.0	2000.0	Hz

Processing Algorithm

The process() function accumulates samples and triggers analysis every hop. The analysis pipeline executes these steps:

1. Frame Accumulation

Samples are written into a 2048-sample circular buffer. Every 512 samples (hop size), analyzeFrame() runs:

\[\text{hop} = 512, \quad N = 2048\]

2. Difference Function

The squared difference function measures how similar a signal is to a time-shifted version of itself:

\[d(\tau) = \sum_{i=0}^{N/2 - 1} \left[ x[i] - x[i + \tau] \right]^2\]

At the true period \(\tau_0\), \(d(\tau_0)\) reaches a minimum because the signal aligns with itself.

3. Cumulative Mean Normalized Difference (CMNDF)

The raw difference function is normalized to remove the dependence on signal amplitude:

\[d'(\tau) = \begin{cases} 1 & \text{if } \tau = 0 \\ d(\tau) \cdot \frac{\tau}{\sum_{j=1}^{\tau} d(j)} & \text{otherwise} \end{cases}\]

This normalization ensures \(d'(0) = 1\) and prevents the trivial minimum at \(\tau = 0\) from being selected. Values of \(d'(\tau) < 1\) indicate periodicity stronger than average.

4. Absolute Threshold Search

The algorithm finds the first lag \(\tau\) in the valid range where the CMNDF falls below the threshold, then continues to the local minimum:

\[\tau_{\min} = \left\lfloor \frac{f_s}{f_{\max}} \right\rfloor, \quad \tau_{\max} = \left\lfloor \frac{f_s}{f_{\min}} \right\rfloor\]

\[\text{Find first } \tau \in [\tau_{\min}, \tau_{\max}] \text{ where } d'(\tau) < \text{threshold}\]

\[\text{Then advance to local minimum: while } d'(\tau+1) < d'(\tau), \; \tau \leftarrow \tau + 1\]

5. Parabolic Interpolation

Sub-sample accuracy is achieved by fitting a parabola through three points around the best lag:

\[s_0 = d'(\tau - 1), \quad s_1 = d'(\tau), \quad s_2 = d'(\tau + 1)\]

\[\tau_{\text{refined}} = \tau + \frac{s_2 - s_0}{2(2s_1 - s_2 - s_0)}\]

6. Frequency Estimation

The fundamental frequency is the sample rate divided by the refined lag:

\[f = \frac{f_s}{\tau_{\text{refined}}}\]

7. Median Filter

A 5-element median filter smooths the frequency output to reject octave jumps and spurious detections:

\[f_{\text{filtered}} = \text{median}(f_{n}, f_{n-1}, f_{n-2}, f_{n-3}, f_{n-4})\]

8. MIDI Note Conversion

The detected frequency is converted to a MIDI note number:

\[m = 69 + 12 \cdot \log_2\!\left(\frac{f}{440}\right)\]

9. Confidence

Confidence is derived from the CMNDF value at the best lag. Lower CMNDF values indicate stronger periodicity:

\[\text{confidence} = \text{clamp}(1 - d'(\tau_{\text{best}}), \; 0, \; 1)\]

When no valid lag is found, confidence decays exponentially:

\[\text{confidence} \leftarrow \text{confidence} \times 0.95\]

10. Audio Passthrough

The input sample is returned unchanged. PitchTracker is a pure analysis component with no effect on the audio signal.

Core Equations

\[d(\tau) = \sum_{i=0}^{N/2-1} [x[i] - x[i+\tau]]^2\]

\[d'(\tau) = d(\tau) \cdot \frac{\tau}{\sum_{j=1}^{\tau} d(j)}\]

\[f = \frac{f_s}{\tau + \frac{s_2 - s_0}{2(2s_1 - s_2 - s_0)}}\]

\[m = 69 + 12 \cdot \log_2(f / 440)\]

Snapshot Fields

Field	Type	Range	Unit	Description
Frequency	Float	20–8000	Hz	Detected fundamental frequency
Confidence	Float	0–1		Detection confidence (1 - CMNDF at best lag)
MIDI Note	Float	0–127		Frequency converted to MIDI note number
Tracking	Bool	0–1		Whether a pitch is currently being tracked
Note History	Float[32]	0–127		Ring buffer of recent MIDI note detections

Implementation Notes

Frame size is 2048 samples with a hop size of 512, giving ~86 Hz analysis rate at 44.1 kHz. This limits the lowest detectable frequency to approximately sampleRate / 1024 (the YIN search spans half the frame).
CMNDF is the key innovation of YIN over raw autocorrelation — it eliminates the need for an explicit threshold on the difference function magnitude, instead normalizing so that values below 1.0 indicate periodicity.
Parabolic interpolation provides fractional-lag precision essential for accurate pitch tracking at high frequencies, where a single-sample error represents a large frequency difference.
Median filter uses a fixed 5-element window with insertion sort. This rejects isolated octave errors (e.g., detecting the second harmonic instead of the fundamental) without adding significant latency.
Confidence decay at 0.95 per hop means tracking is declared lost after approximately 20 hops (~120 ms at 44.1 kHz) without a valid detection.
All parameters use std::atomic<float> for lock-free thread safety.
Snapshot emission is decimated to ~60 fps (every 735 samples at 44.1 kHz).

Equation Summary

f = YIN(CMNDF); passthrough