Consider three cases just to suggest the spectrum of possiblities: a) linear upsample: each output pixel is a weighted sum of 4 input pixels b) cubic upsample: each output pixel is a weighted sum of 16 input pixels c) downsample by N with box filter: each output pixel is a weighted sum of NxN input pixels, N can be very large Now, suppose you want to handle 8-bit input, 16-bit input, and float input, and you want to do sRGB correction or not. Suppose you create a temporary buffer of float pixels, say one scanline tall. Actually two temp buffers, one for the input and one for the output. You decode a scanline of the input into the temp buffer which is always linear floats. This isolates the handling of 8/16/float and sRGB to one place (and still allows you to make optimized 8-bit-sRGB-to-float lookup tables). This also allows you to put wrap logic here, explicitly wrapping, reflecting, or replicating-from-edge pixels that would come from off-edge. You then do whatever the appropriate weighted sums are into the output buffer, and you move on to the next scanline of the input. The algorithm just described works directly for case (c). Suppose you're downsampling by 2.5; then output scanline 0 sums from input scanlines 0, 1, and 2; output scanline 1 sums from 2,3,4; output 2 from 5,6,7; output 3 from 7,8,9. Note how 2 & 7 get reused, but we don't have to recompute them because we can do things in a single linear pass through the input and output at the same time. Now, consider case (a). When upsampling, the same two input scanlines will get sampled-from for multiple output scanlines. So, to avoid recomputing the input scanlines, we need either multiple input or multiple output temp buffer lines. Since the number of output lines a given pair of input scanlines might touch scales with the upsample amount, it makes more sense to use two input scanline buffers. For cubic, you'll need four scanline buffers, and in general the number of buffers will be limited by the max filter width, which is presumably hardcoded. You want to avoid memory allocations (since you're passing in the target buffer already), so instead of using a scanline-width temp buffer, use some fixed-width temp buffer that's W pixels, and scale the image in vertical stripes that are that wide. Suppose you make the temp buffers 256 wide; then an upsample by 8 computes 256-pixel-width strips (from ~32-pixel-wide input strips), but a downsample by 8 computes ~32-pixel-width strips (from a 256-pixel width strip). Note this limits the max down/upsampling to be ballpark 256x along the horizontal axis. Function prototypes: the highest-level one could be: stb_resample_8bit(uint8_t *dest, int dest_width, int dest_height, uint8_t const *src , int src_width, int src_height, int channels, stbr_filter filter); the lowest-level one could be: stb_resample_arbitrary(void *dest, stbr_type dest_type, int dest_width, int dest_height, int dest_stride_in_bytes, void const *src , stbr_type src_type, int src_width, int src_height, int src_stride_in_bytes, int channels, int nonpremul_alpha_channel_index, stbr_wrapmode wrap, // clamp, wrap, mirror stbr_filter filter, float s0, float t0, float s1, float t1, // range of source to use, 0..1 in GPU texture-coordinate style void *tempmem, size_t tempmem_size_in_bytes); And there would be a bunch of convenience functions at in-between levels. Some notes: Intermediate-level functions should be provided for each source type & same dest type so that the code is typesafe; only when people fall back to stb_resample_arbitrary should they be at risk for type unsafety. (One way to deal with the explosion of functions of every possible type would be to define one function for each input type, and accept three separate output pointers, one for each type, only one of which can be non-NULL.) nonpremul_alpha_channel_index: if this is negative, no channels are processed specially if this is non-negative, then it's the index of the alpha channel, and the image should be treated as non-premultiplied alpha that needs to be resampled accounting for this (weight the sampling by the alpha channel, i.e. premultiply, filter, unpremultiply). this mechanism only allows one alpha channel and ALL channels are scaled by it; an alternative would be to find some way to pass in which channels serve as alpha channels for which other channels, but eh. s0,t0,s1,t1: this allows fine subpixel-positioning and subpixel-resizing in an explicit way without things having to be exact pixel multiples. it allows people to pseudo-stream images by computing "tiles" of images a bit at a time without forcing those tiles to quantize their source data. tempmem, tempmem_size all functions will needed tempmem, but they can allocate a fixed tempmem buffer on the stack. providing an API that allows overriding the amount of tempmem available allows people to process arbitrarily large images. the return value for the function could be 0 on success or non-0 being the size of tempmem needed. Reference: Cubic sampling function for seperable cubic: f(x) = (a+2)*x^3 - (a+3)*x^2 + 1 for 0 <= x <= 1 f(x) = a*x^3 - 5*a*x^2 + 8*a*x - 4*a for 1 < x <= 2 f(x) = 0 otherwise "a" is configurable, try -1/2 (from http://pixinsight.com/forum/index.php?topic=556.0 )