GMP Itemized Development Tasks

An up-to-date html version of this file is available at http://www.swox.com/gmp/tasks.html.

This file lists itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.

Correctness and Completeness

HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the new instructions. Unfortunately, the HPUX 9 assembler as well as earlier assemblers reject that directive. How very clever of HP! We will have to pass assembler options, and make sure it works with new and old systems and GNU assembler.
The various reuse.c tests need to force reallocation by calling _mpz_realloc with a small (1 limb) size.
One reuse case is missing from mpX/tests/reuse.c: mpz_XXX(a,a,a).
When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit mp_exp_t, the precision of __mp_bases[base].chars_per_bit_exactly is insufficient and mpf_get_str aborts. Detect and compensate.
Fix mpz_get_si to work properly for MIPS N32 ABI (and other machines that use long long for storing limbs.)
Make the string reading functions allow the `0x' prefix when the base is explicitly 16. They currently only allow that prefix when the base is unspecified.
In the development sources, we return abs(a%b) in the mpz_*_ui division routines. Perhaps make them return the real remainder instead? Changes return type to signed long int.
mpf_eq is not always correct, when one operand is 1000000000... and the other operand is 0111111111..., i.e., extremely close. There is a special case in mpf_sub for this situation; put similar code in mpf_eq.
mpf_eq doesn't implement what gmp.texi specifies. It should not use just whole limbs, but partial limbs.
Install Alpha assembly changes (prec/gmp-alpha-patches).
NeXT has problems with newlines in asm strings in longlong.h. Also, __builtin_constant_p is unavailable? Same problem with MacOS X.
Shut up SGI's compiler by declaring dump_abort in mp?/tests/*.c.
mpz_get_si returns 0x80000000 for -0x100000000.

Machine Independent Optimization

In hundreds of places in the code, we invoke count_leading_zeros and then check if the returned count is zero. Instead check the most significant bit of the operand, and avoid invoking count_leading_zeros if the bit is set. This is an optimization on all machines, and significant on machines with slow count_leading_zeros.
In a couple of places count_trailing_zeros is used on more or less uniformly distributed numbers. For some CPUs count_trailing_zeros is slow and it's probably worth handling the frequently occurring 0 to 2 trailing zeros cases specially.
Change all places that use udiv_qrnnd for inverting limbs to instead use invert_limb.
Reorganize longlong.h so that we can inline the operations even for the system compiler. When there is no such compiler feature, make calls to stub functions. Write such stub functions for as many machines as possible.
Rewrite umul_ppmm to use floating-point for generating the most significant limb (if BITS_PER_MP_LIMB <= 52 bits). (Peter Montgomery has some ideas on this subject.)
Improve the default umul_ppmm code in longlong.h: Add partial products with fewer operations.
Write new mpn_get_str and mpn_set_str running in the sub O(n^2) range, using some divide-and-conquer approach, preferably without using division.
Copy tricky code for converting a limb from development version of mpn_get_str to mpf/get_str. (Talk to Torbjörn about this.)
Consider inlining these functions: mpz_size, mpz_set_ui, mpz_set_q, mpz_clear, mpz_init, mpz_get_ui, mpz_scan0, mpz_scan1, mpz_getlimbn, mpz_init_set_ui, mpz_perfect_square_p, mpz_popcount, mpf_size, mpf_get_prec, mpf_set_prec_raw, mpf_set_ui, mpf_init, mpf_init2, mpf_clear, mpf_set_si.
mpz_powm and mpz_powm_ui aren't very fast on one or two limb moduli, due to a lot of function call overheads. These could perhaps be handled as special cases.
mpz_powm and mpz_powm_ui want better algorithm selection, and the latter should use REDC. Both could change to use an mpn_powm and mpn_redc.
mpn_gcd might be able to be sped up on small to moderate sizes by improving find_a, possibly just by providing an alternate implementation for CPUs with slowish count_leading_zeros.
Implement a cache localized evaluate and interpolate for the toom3 USE_MORE_MPN code. The necessary right-to-left mpn_divexact_by3c exists.
mpn_mul_basecase on NxM with big N but small M could try for better cache locality by taking N piece by piece. The current code could be left available for CPUs without caching. Depending how karatsuba etc is applied to unequal size operands it might be possible to assume M is always smallish.

Machine Dependent Optimization

Run the `tune' utility for more compiler/CPU combinations. We would like to have gmp-mparam.h files in practically every implementation specific mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system compiler. See the `tune' top-level directory for more information.
Alpha: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21264. On 21264, they should run at 4, 3, and 3 cycles/limb respectively, if the code is unrolled properly. (Ask Torbjörn for his xm.s and xam.s skeleton files.)
Alpha: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21164. This should use both integer multiplies and floating-point multiplies. For the floating-point operations, the single-limb multiplier should be split into three 21-bit chunks.
UltraSPARC: Rewrite 64-bit mpn_addmul_1, mpn_submul_1, and mpn_mul_1. Should use floating-point operations, and split the invariant single-limb multiplier into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline will become very deep. (Torbjörn has C code that is useful as a starting point.)
UltraSPARC: Rewrite mpn_lshift and mpn_rshift. Should give 2 cycles/limb. (Torbjörn has code that just needs to be finished.)
SPARC32/V9: Find out why the speed of mpn_addmul_1 and the other multiplies varies so much on successive sizes.
PA64: Improve mpn_addmul_1, mpn_submul_1, and mpn_mul_1. The current development code runs at 11 cycles/limb, which is already very good. But it should be possible to saturate the cache, which will happen at 7.5 cycles/limb.
Sparc & SparcV8: Enable umul.asm for native cc. The generic longlong.h umul_ppmm is suspected to be causing sqr_basecase to be slower than mul_basecase.
UltraSPARC: Write umul_ppmm. Important in particular for mpn_sqr_basecase. Using four "mulx"s either with an asm block or via the generic C code is about 90 cycles.
Implement mpn_mul_basecase and mpn_sqr_basecase for important machines. Helping the generic sqr_basecase.c with an mpn_sqr_diagonal might be enough for some of the RISCs.
POWER2/POWER2SC: Schedule mpn_lshift/mpn_rshift. Will bring time from 1.75 to 1.25 cycles/limb.
X86: Optimize non-MMX mpn_lshift for shifts by 1. (See Pentium code.)
Alpha: Optimize count_leading_zeros.
Alpha: Optimize udiv_qrnnd. (Ask Torbjörn for the file test-udiv-preinv.c as a starting point.)
R10000/R12000: Rewrite mpn_add_n and mpn_sub_n. It should just require 3 cycles/limb, but the current code propagates carry poorly. The trick is to add carry-in later than we do now, decreasing the number of operations used to generate carry-out from 4 to to 3.
PPC32: Try using fewer registers in the current mpn_lshift. The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5 is unused. (Ask Torbjörn for a copy of the current code.)
PPC32: Write mpn_rshift based on new mpn_lshift.
PPC32: Rewrite mpn_add_n and mpn_sub_n. Should run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
Fujitsu VPP: Write mpn_mul_basecase and mpn_sqr_basecase. This should use a "vertical multiplication method", to avoid carry propagation. splitting one of the operands in 11-bit chunks.
Cray: Vectorize main functions, perhaps in assembly language.
Cray: Write mpn_mul_basecase and mpn_sqr_basecase. Same comment applies to this as to the same functions for Fujitsu VPP.

Improve count_leading_zeros for 64-bit machines:

  if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
  if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
  ...

New Functionality

mpz_get_nth_ui. Return the nth word (not necessarily the nth limb).
Maybe add mpz_crr (Chinese Remainder Reconstruction).
Let `0b' and `0B' mean binary input everywhere.
Add mpq_set_f for assignment from mpf_t (cf. mpq_set_d).
Maybe make mpz_init (and mpq_init) do lazy allocation. Set ALLOC(var) to 0, and have mpz_realloc special-handle that case. Update functions that rely on a single limb (like mpz_set_ui, mpz_[tfc]div_r_ui, and others).
Add mpf_out_raw and mpf_inp_raw. Make sure format is portable between 32-bit and 64-bit machines, and between little-endian and big-endian machines.
Handle numeric exceptions: Call an error handler, and/or set gmp_errno.
Implement gmp_fprintf, gmp_sprintf, and gmp_snprintf. Think about some sort of wrapper around printf so it and its several variants don't have to be completely reimplemented.
Implement some mpq input and output functions.
Implement a full precision mpz_kronecker, leave mpz_jacobi for compatibility.
Make the mpn logops and copys available in gmp.h. Since they can be either library functions or inlines, gmp.h would need to be generated from a gmp.in based on what's in the library. gmp.h would still be compiler-independent though.
Make versions of mpz_set_str etc taking string lengths rather than null-terminators.
Consider changing the thresholds to apply the simpler algorithm when "<=" rather than "<", so a threshold can be set to MP_SIZE_T_MAX to get only the simpler code (the compiler will know size <= MP_SIZE_T_MAX is always true).
mpz_cdiv_q_2exp and mpz_cdiv_r_2exp could be implemented to match the corresponding tdiv and fdiv. Maybe some code sharing is possible.

Configuration

Improve config.guess. We want to recognize the processor very accurately, more accurately than other GNU packages. config.guess does not currently make the distinctions we would like it to do and a --target often needs to be set explicitly. For example, "sparc" is not very useful as a machine architecture denotation. We want to distinguish old 32-bit SPARC without multiply support from newer 32-bit SPARC with such support. We want to recognize a SuperSPARC, since its implementation of the UDIV instruction is not complete, and will trap to the OS kernel for certain operands. And we want to recognize 64-bit capable SPARC processors as such. While the assembly routines can use 64-bit operations on all 64-bit SPARC processors, one can not use 64-bit limbs under all operating system. E.g., Solaris 2.5 and 2.6 doesn't preserve the upper 32 bits of most processor registers. For SPARC we therefore sometimes need to choose GMP configuration depending both on processor and operating system.
Remember to make sure config.sub accepts any output from config.guess.
Find out whether there's an alloca available and how to use it. AC_FUNC_ALLOCA has various system dependencies covered, but we don't want its alloca.c replacement. (One thing current cpp tests don't cover: HPUX 10 C compiler supports alloca, but cannot find any symbol to test in order to know if we're on HPUX 10. Damn.)
Identify Mips processor under Irix: `hinv -c processor'. config.guess should say mips2, mips3, and mips4.
Identify Alpha processor under OSF: "/usr/sbin/sizer -c". Unfortunately, sizer is not available before some revision of Dec Unix 4.0, and it also returns some rather cryptic names for processors. Perhaps the implver and amask assembly instructions are better, but that doesn't differentiate between ev5 and ev56.
Identify Sparc processors. config.guess should say supersparc, microsparc, ultrasparc1, ultrasparc2, etc.
Identify HPPA processors similarly.
Get lots of information about a Solaris system: prtconf -vp
For some target machines and some compilers, specific options are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92, Irix64/cc needs -64, Irix32/cc might need -n32, etc). Some are set already, add more, see configure.in.
Options to be passed to the assembler (via the compiler, using whatever syntax the compiler uses for passing options to the assembler).
On Solaris 7, check if gcc supports native v9 64-bit arithmetic. If not compile using "cc -fast -xarch=v9". (Problem: -fast requires that we link with -fast too, which might not be very good. Pass "-xO4 -xtarget=native" instead?)
Extend the "optional" compiler arguments to choose the first that works from from a set, so when gcc gets athlon support it can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486, whichever works.
Detect gcc >=2.96 and enable -march=pentiumpro for relevant x86s. (A bug in gcc 2.95.2 prevents it being used unconditionally.)
Build multiple variants of the library under certain systems. An example is -n32, -o32, and -64 on Irix.
There's a few filenames that don't fit in 14 chars, if this matters.
Enable support for FORTRAN versions of mpn files (eg. for mpn/cray/mulww.f). Add "f" to the mpn path searching, run AC_PROG_F77 if such a file is found. Automake will generate some of what's needed in the makefiles, but libtool doesn't know fortran and so rules like the current ".asm.lo" will be needed.
Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files selected from the mpn path. This might help say a generic C build on weird systems.

In general, getting the exact right configuration, passing the exact right options to the compiler, etc, might mean that the GMP performance more than doubles.

When testing, make sure to test at least the following for all out target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system `make' and GNU `make'. (4) With and without GNU binutils.

Miscellaneous

Work on the way we build the library. We now do it building convenience libraries but then listing all the object files a second time in the top level Makefile.am.
Get rid of mp[zq]/sub.c, and instead define a compile parameter to mp[zq]/add.c to decide whether it will add or subtract. Will decrease redundancy. Similarly in other places.
Make mpz_div and mpz_divmod use rounding analogous to mpz_mod. Document, and list as an incompatibility.
Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new mpn/generic/pow_ui.
Make mpz_invert call mpn_gcdext directly.
Make a build option to enable execution profiling with gprof. In particular look at getting the right mcount call at the start of each assembler subroutine (for important targets at least).

Aids to Debugging

Make an option for stack-alloc.c to call malloc separately for each TMP_ALLOC block, so a redzoning malloc debugger could be used during development.
Add ASSERTs at the start of each user-visible mpz/mpq/mpf function to check the validity of each mp?_t parameter, in particular to check they've been mp?_inited. This might catch elementary mistakes in user programs. Care would need to be taken over MPZ_TMP_INITed variables used internally.

Documentation

Document conventions, like that unsigned long int is used for bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.