This file lists itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
_mpz_realloc
with a small (1 limb) size.
mpz_XXX(a,a,a)
.
mp_exp_t
, the precision of
__mp_bases[base].chars_per_bit_exactly
is insufficient and
mpf_get_str
aborts. Detect and compensate.
mpz_get_si
to work properly for MIPS N32 ABI (and other
machines that use long long
for storing limbs.)
mpz_*_ui
division routines. Perhaps make them return the
real remainder instead? Changes return type to signed long int
.
mpf_eq
is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in mpf_sub
for this
situation; put similar code in mpf_eq
.
__builtin_constant_p
is unavailable? Same problem with MacOS
X.
dump_abort
in
mp?/tests/*.c.
mpz_get_si
returns 0x80000000 for -0x100000000.
count_leading_zeros
if
the bit is set. This is an optimization on all machines, and significant
on machines with slow count_leading_zeros
.
count_trailing_zeros
is used
on more or less uniformly distributed numbers. For some CPUs
count_trailing_zeros
is slow and it's probably worth
handling the frequently occurring 0 to 2 trailing zeros cases specially.
udiv_qrnnd
for inverting limbs to
instead use invert_limb
.
umul_ppmm
to use floating-point for generating the
most significant limb (if BITS_PER_MP_LIMB
<= 52 bits).
(Peter Montgomery has some ideas on this subject.)
umul_ppmm
code in longlong.h: Add partial
products with fewer operations.
mpn_get_str
and mpn_set_str
running in
the sub O(n^2) range, using some divide-and-conquer approach, preferably
without using division.
mpn_get_str
to mpf/get_str. (Talk to Torbjörn about this.)
mpz_size
,
mpz_set_ui
, mpz_set_q
, mpz_clear
,
mpz_init
, mpz_get_ui
, mpz_scan0
,
mpz_scan1
, mpz_getlimbn
,
mpz_init_set_ui
, mpz_perfect_square_p
,
mpz_popcount
, mpf_size
,
mpf_get_prec
, mpf_set_prec_raw
,
mpf_set_ui
, mpf_init
, mpf_init2
,
mpf_clear
, mpf_set_si
.
mpz_powm
and mpz_powm_ui
aren't very
fast on one or two limb moduli, due to a lot of function call
overheads. These could perhaps be handled as special cases.
mpz_powm
and mpz_powm_ui
want better
algorithm selection, and the latter should use REDC. Both could
change to use an mpn_powm
and mpn_redc
.
mpn_gcd
might be able to be sped up on small to
moderate sizes by improving find_a
, possibly just by
providing an alternate implementation for CPUs with slowish
count_leading_zeros
.
USE_MORE_MPN
code. The necessary
right-to-left mpn_divexact_by3c
exists.
mpn_mul_basecase
on NxM with big N but small M could try for
better cache locality by taking N piece by piece. The current code could
be left available for CPUs without caching. Depending how karatsuba etc
is applied to unequal size operands it might be possible to assume M is
always smallish.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
for the 21264. On 21264, they should run at 4, 3,
and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
Torbjörn for his xm.s and xam.s skeleton files.)
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
chunks.
mpn_addmul_1
,
mpn_submul_1
, and mpn_mul_1
. Should use
floating-point operations, and split the invariant single-limb multiplier
into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
will become very deep. (Torbjörn has C code that is useful as a starting
point.)
mpn_lshift
and mpn_rshift
.
Should give 2 cycles/limb. (Torbjörn has code that just needs to be
finished.)
mpn_addmul_1
and the other multiplies varies so much on successive sizes.
mpn_addmul_1
, mpn_submul_1
, and
mpn_mul_1
. The current development code runs at 11
cycles/limb, which is already very good. But it should be possible to
saturate the cache, which will happen at 7.5 cycles/limb.
umul_ppmm
. Important in particular for
mpn_sqr_basecase
. Using four "mulx
"s either
with an asm block or via the generic C code is about 90 cycles.
mpn_mul_basecase
and mpn_sqr_basecase
for important machines. Helping the generic sqr_basecase.c with an
mpn_sqr_diagonal
might be enough for some of the RISCs.
mpn_lshift
/mpn_rshift
.
Will bring time from 1.75 to 1.25 cycles/limb.
mpn_lshift
for shifts by 1. (See Pentium code.)
count_leading_zeros
.
udiv_qrnnd
. (Ask Torbjörn for the file
test-udiv-preinv.c as a starting point.)
mpn_add_n
and mpn_sub_n
.
It should just require 3 cycles/limb, but the current code propagates
carry poorly. The trick is to add carry-in later than we do now,
decreasing the number of operations used to generate carry-out from 4 to
to 3.
mpn_lshift
.
The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
is unused. (Ask Torbjörn for a copy of the current code.)
mpn_rshift
based on new mpn_lshift
.
mpn_add_n
and mpn_sub_n
. Should
run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
mpn_mul_basecase
and
mpn_sqr_basecase
. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
mpn_mul_basecase
and
mpn_sqr_basecase
. Same comment applies to this as to the
same functions for Fujitsu VPP.
count_leading_zeros
for 64-bit machines:
if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2} if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4} ...
mpz_get_nth_ui
. Return the nth word (not necessarily the nth limb).
mpz_crr
(Chinese Remainder Reconstruction).
mpq_set_f
for assignment from mpf_t
(cf. mpq_set_d
).
mpz_init
(and mpq_init
) do lazy
allocation. Set ALLOC(var)
to 0, and have
mpz_realloc
special-handle that case. Update functions that
rely on a single limb (like mpz_set_ui
,
mpz_[tfc]div_r_ui
, and others).
mpf_out_raw
and mpf_inp_raw
. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
gmp_errno
.
gmp_fprintf
, gmp_sprintf
, and
gmp_snprintf
. Think about some sort of wrapper
around printf
so it and its several variants don't
have to be completely reimplemented.
mpq
input and output functions.
mpz_kronecker
, leave
mpz_jacobi
for compatibility.
mpz_set_str
etc taking string
lengths rather than null-terminators.
<=
" rather than "<
", so a threshold can
be set to MP_SIZE_T_MAX
to get only the simpler code (the
compiler will know size <= MP_SIZE_T_MAX
is always true).
mpz_cdiv_q_2exp
and mpz_cdiv_r_2exp
could be implemented to match the corresponding tdiv and fdiv.
Maybe some code sharing is possible.
implver
and
amask
assembly instructions are better, but that
doesn't differentiate between ev5 and ev56.
In general, getting the exact right configuration, passing the exact right options to the compiler, etc, might mean that the GMP performance more than doubles.
When testing, make sure to test at least the following for all out target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system `make' and GNU `make'. (4) With and without GNU binutils.
mpz_div
and mpz_divmod
use rounding
analogous to mpz_mod
. Document, and list as an
incompatibility.
mcount
call at
the start of each assembler subroutine (for important targets at
least).
malloc
separately for each TMP_ALLOC
block, so a redzoning
malloc debugger could be used during development.
ASSERT
s at the start of each user-visible
mpz/mpq/mpf function to check the validity of each
mp?_t
parameter, in particular to check they've been
mp?_init
ed. This might catch elementary mistakes in
user programs. Care would need to be taken over
MPZ_TMP_INIT
ed variables used internally.
unsigned long int
is used for
bit counts/ranges, and that mp_size_t
is used for limb counts.
mpz_inp_str
(etc) doesn't say when it stops reading digits.
Please send comments about this page to
tege@swox.com. Copyright (C) 1999, 2000 Torbjörn Granlund. |