Allocate filter_buf for two scan lines that we do all the filter
processing in, then do all other conversions (bit depth,
endianness, inserting alpha=255 values) on the way out.
Separating the two concerns makes everything much clearer.
This formulation is equivalent to the original (reference)
implementation but runs _significantly_ faster - this speeds up
the filtering portion of a Paeth-heavy 8192x8192 16-bit/channel
image by a factor of more than 2 on a Zen2 CPU.
I'm investigating doing a more thorough restructuring of this pass,
but this seems like a good first step.
I want to make some changes to the PNG loader, this is to get some
test coverage at least to make it easier not to break anything.
The two ref_results files that are "corrupt" files that stb_image
nevertheless loads without error are checksum failures; this is
by design, since stb_image does not verify checksums.
We speculatively try to fill our bit buffer to always contain
at least 16 bits for stbi__zhuffman_decode. It's not a sign of
a malformed stream for us to be reading past the end there,
because the contents of that bit buffer are speculative; it's
only a malformed stream if we actually _consume_ the extra bits.
This fix adds some extra logic where we the first time we hit
zeof, we add an explicit 16 extra zero bits at the top of the
bit buffer just so that for the purposes of the decoder, we have
16 bits in the buffer.
However, if at the end of stream, we have the "hit zeof once"
flag set and less than 16 bits remaining in the bit buffer, we
know some of those implicit zero bits got read, which indicates
we actually had a past-end-of-stream read. In that case, flag
it as an error.
While I'm in here, also rephrase the length-too-large check to
not do any potentially-overflowing pointer arithmetic.
Fixes issue #1456.