I believe the filter is working as intended. Since it is a 1-pass filter, it can only respond changes in loudness slowly on a delay. So the final result will never perfectly match the target loudness. I predict that you will find that it works better for some clips and worse for others. It depends on the dynamics of the clip.
If you need the final result to be exactly -23LUFS, I recommend the 2-pass filter.
If you want to test further, it might be interesting to try your test with a fixed tone (where the loudness is constant for the duration of the clip).