macssh/gmp/doc/tasks.html

380 lines
19 KiB
HTML
Executable File
Raw Blame History

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>
GMP Itemized Development Tasks
</title>
</head>
<body bgcolor=lightgreen>
<center>
<h1>
GMP Itemized Development Tasks
</h1>
</center>
<comment>
An up-to-date html version of this file is available at
<a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
</comment>
<p> This file lists itemized GMP development tasks. Not all the tasks
listed here are suitable for volunteers, but many of them are.
Please see the <a href="projects.html">projects file</a> for more
sizeable projects.
<h4>Correctness and Completeness</h4>
<ul>
<li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
new instructions. Unfortunately, the HPUX 9 assembler as well as earlier
assemblers reject that directive. How very clever of HP! We will have to
pass assembler options, and make sure it works with new and old systems
and GNU assembler.
<li> The various reuse.c tests need to force reallocation by calling
<code>_mpz_realloc</code> with a small (1 limb) size.
<li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
<li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
<code>mp_exp_t</code>, the precision of
<code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
<code>mpf_get_str</code> aborts. Detect and compensate.
<li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
machines that use <code>long long</code> for storing limbs.)
<li> Make the string reading functions allow the `0x' prefix when the base is
explicitly 16. They currently only allow that prefix when the base is
unspecified.
<li> In the development sources, we return abs(a%b) in the
<code>mpz_*_ui</code> division routines. Perhaps make them return the
real remainder instead? Changes return type to <code>signed long int</code>.
<li> <code>mpf_eq</code> is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in <code>mpf_sub</code> for this
situation; put similar code in <code>mpf_eq</code>.
<li> mpf_eq doesn't implement what gmp.texi specifies. It should not use just
whole limbs, but partial limbs.
<li> Install Alpha assembly changes (prec/gmp-alpha-patches).
<li> NeXT has problems with newlines in asm strings in longlong.h. Also,
<code>__builtin_constant_p</code> is unavailable? Same problem with MacOS
X.
<li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
mp?/tests/*.c.
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
</ul>
<h4>Machine Independent Optimization</h4>
<ul>
<li> In hundreds of places in the code, we invoke count_leading_zeros and then
check if the returned count is zero. Instead check the most significant
bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
the bit is set. This is an optimization on all machines, and significant
on machines with slow <code>count_leading_zeros</code>.
<li> In a couple of places <code>count_trailing_zeros</code> is used
on more or less uniformly distributed numbers. For some CPUs
<code>count_trailing_zeros</code> is slow and it's probably worth
handling the frequently occurring 0 to 2 trailing zeros cases specially.
<li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
instead use <code>invert_limb</code>.
<li> Reorganize longlong.h so that we can inline the operations even for the
system compiler. When there is no such compiler feature, make calls to
stub functions. Write such stub functions for as many machines as
possible.
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
(Peter Montgomery has some ideas on this subject.)
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
products with fewer operations.
<li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
the sub O(n^2) range, using some divide-and-conquer approach, preferably
without using division.
<li> Copy tricky code for converting a limb from development version of
<code>mpn_get_str</code> to mpf/get_str. (Talk to Torbj<62>rn about this.)
<li> Consider inlining these functions: <code>mpz_size</code>,
<code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
<code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
<code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
<code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
<code>mpz_popcount</code>, <code>mpf_size</code>,
<code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
<code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
<code>mpf_clear</code>, <code>mpf_set_si</code>.
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
fast on one or two limb moduli, due to a lot of function call
overheads. These could perhaps be handled as special cases.
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
algorithm selection, and the latter should use REDC. Both could
change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
<li> <code>mpn_gcd</code> might be able to be sped up on small to
moderate sizes by improving <code>find_a</code>, possibly just by
providing an alternate implementation for CPUs with slowish
<code>count_leading_zeros</code>.
<li> Implement a cache localized evaluate and interpolate for the
toom3 <code>USE_MORE_MPN</code> code. The necessary
right-to-left <code>mpn_divexact_by3c</code> exists.
<li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
better cache locality by taking N piece by piece. The current code could
be left available for CPUs without caching. Depending how karatsuba etc
is applied to unequal size operands it might be possible to assume M is
always smallish.
</ul>
<h4>Machine Dependent Optimization</h4>
<ul>
<li> Run the `tune' utility for more compiler/CPU combinations. We would like
to have gmp-mparam.h files in practically every implementation specific
mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
compiler. See the `tune' top-level directory for more information.
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
<code>mpn_mul_1</code> for the 21264. On 21264, they should run at 4, 3,
and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
Torbj<62>rn for his xm.s and xam.s skeleton files.)
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
<code>mpn_mul_1</code> for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
chunks.
<li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
<code>mpn_submul_1</code>, and <code>mpn_mul_1</code>. Should use
floating-point operations, and split the invariant single-limb multiplier
into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
will become very deep. (Torbj<62>rn has C code that is useful as a starting
point.)
<li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
Should give 2 cycles/limb. (Torbj<62>rn has code that just needs to be
finished.)
<li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>
and the other multiplies varies so much on successive sizes.
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
<code>mpn_mul_1</code>. The current development code runs at 11
cycles/limb, which is already very good. But it should be possible to
saturate the cache, which will happen at 7.5 cycles/limb.
<li> Sparc & SparcV8: Enable umul.asm for native cc. The generic
longlong.h umul_ppmm is suspected to be causing sqr_basecase to
be slower than mul_basecase.
<li> UltraSPARC: Write <code>umul_ppmm</code>. Important in particular for
<code>mpn_sqr_basecase</code>. Using four "<code>mulx</code>"s either
with an asm block or via the generic C code is about 90 cycles.
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
for important machines. Helping the generic sqr_basecase.c with an
<code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
Will bring time from 1.75 to 1.25 cycles/limb.
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See Pentium code.)
<li> Alpha: Optimize <code>count_leading_zeros</code>.
<li> Alpha: Optimize <code>udiv_qrnnd</code>. (Ask Torbj<62>rn for the file
test-udiv-preinv.c as a starting point.)
<li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
It should just require 3 cycles/limb, but the current code propagates
carry poorly. The trick is to add carry-in later than we do now,
decreasing the number of operations used to generate carry-out from 4 to
to 3.
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
is unused. (Ask Torbj<62>rn for a copy of the current code.)
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. Should
run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
<code>mpn_sqr_basecase</code>. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
<li> Cray: Vectorize main functions, perhaps in assembly language.
<li> Cray: Write <code>mpn_mul_basecase</code> and
<code>mpn_sqr_basecase</code>. Same comment applies to this as to the
same functions for Fujitsu VPP.
<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
<pre>
if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x &lt&lt= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x &lt&lt= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
... </pre>
</ul>
<h4>New Functionality</h4>
<ul>
<li> <code>mpz_get_nth_ui</code>. Return the nth word (not necessarily the nth limb).
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
<li> Let `0b' and `0B' mean binary input everywhere.
<li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
(cf. <code>mpq_set_d</code>).
<li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
allocation. Set <code>ALLOC(var)</code> to 0, and have
<code>mpz_realloc</code> special-handle that case. Update functions that
rely on a single limb (like <code>mpz_set_ui</code>,
<code>mpz_[tfc]div_r_ui</code>, and others).
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
<li> Handle numeric exceptions: Call an error handler, and/or set
<code>gmp_errno</code>.
<li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
<code>gmp_snprintf</code>. Think about some sort of wrapper
around <code>printf</code> so it and its several variants don't
have to be completely reimplemented.
<li> Implement some <code>mpq</code> input and output functions.
<li> Implement a full precision <code>mpz_kronecker</code>, leave
<code>mpz_jacobi</code> for compatibility.
<li> Make the mpn logops and copys available in gmp.h. Since they can
be either library functions or inlines, gmp.h would need to be
generated from a gmp.in based on what's in the library. gmp.h
would still be compiler-independent though.
<li> Make versions of <code>mpz_set_str</code> etc taking string
lengths rather than null-terminators.
<li> Consider changing the thresholds to apply the simpler algorithm when
"<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
<li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>
could be implemented to match the corresponding tdiv and fdiv.
Maybe some code sharing is possible.
</ul>
<h4>Configuration</h4>
<ul>
<li> Improve config.guess. We want to recognize the processor very
accurately, more accurately than other GNU packages.
config.guess does not currently make the distinctions we would
like it to do and a --target often needs to be set explicitly.
For example, "sparc" is not very useful as a machine architecture
denotation. We want to distinguish old 32-bit SPARC without
multiply support from newer 32-bit SPARC with such support. We
want to recognize a SuperSPARC, since its implementation of the
UDIV instruction is not complete, and will trap to the OS kernel
for certain operands. And we want to recognize 64-bit capable
SPARC processors as such. While the assembly routines can use
64-bit operations on all 64-bit SPARC processors, one can not use
64-bit limbs under all operating system. E.g., Solaris 2.5 and
2.6 doesn't preserve the upper 32 bits of most processor
registers. For SPARC we therefore sometimes need to choose GMP
configuration depending both on processor and operating system.
<li> Remember to make sure config.sub accepts any output from config.guess.
<li> Find out whether there's an alloca available and how to use it.
AC_FUNC_ALLOCA has various system dependencies covered, but we
don't want its alloca.c replacement. (One thing current cpp
tests don't cover: HPUX 10 C compiler supports alloca, but
cannot find any symbol to test in order to know if we're on
HPUX 10. Damn.)
<li> Identify Mips processor under Irix: `hinv -c processor'.
config.guess should say mips2, mips3, and mips4.
<li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".
Unfortunately, sizer is not available before some revision of
Dec Unix 4.0, and it also returns some rather cryptic names for
processors. Perhaps the <code>implver</code> and
<code>amask</code> assembly instructions are better, but that
doesn't differentiate between ev5 and ev56.
<li> Identify Sparc processors. config.guess should say supersparc,
microsparc, ultrasparc1, ultrasparc2, etc.
<li> Identify HPPA processors similarly.
<li> Get lots of information about a Solaris system: prtconf -vp
<li> For some target machines and some compilers, specific options
are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,
Irix64/cc needs -64, Irix32/cc might need -n32, etc). Some are
set already, add more, see configure.in.
<li> Options to be passed to the assembler (via the compiler, using
whatever syntax the compiler uses for passing options to the
assembler).
<li> On Solaris 7, check if gcc supports native v9 64-bit
arithmetic. If not compile using "cc -fast -xarch=v9".
(Problem: -fast requires that we link with -fast too, which
might not be very good. Pass "-xO4 -xtarget=native" instead?)
<li> Extend the "optional" compiler arguments to choose the first
that works from from a set, so when gcc gets athlon support it
can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,
whichever works.
<li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant
x86s. (A bug in gcc 2.95.2 prevents it being used
unconditionally.)
<li> Build multiple variants of the library under certain systems.
An example is -n32, -o32, and -64 on Irix.
<li> There's a few filenames that don't fit in 14 chars, if this
matters.
<li> Enable support for FORTRAN versions of mpn files (eg. for
mpn/cray/mulww.f). Add "f" to the mpn path searching, run AC_PROG_F77 if
such a file is found. Automake will generate some of what's needed in the
makefiles, but libtool doesn't know fortran and so rules like the current
".asm.lo" will be needed.
<li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files
selected from the mpn path. This might help say a generic C
build on weird systems.
</ul>
<p> In general, getting the exact right configuration, passing the
exact right options to the compiler, etc, might mean that the GMP
performance more than doubles.
<p> When testing, make sure to test at least the following for all out
target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode
and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system
`make' and GNU `make'. (4) With and without GNU binutils.
<h4>Miscellaneous</h4>
<ul>
<li> Work on the way we build the library. We now do it building
convenience libraries but then listing all the object files a
second time in the top level Makefile.am.
<li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
mp[zq]/add.c to decide whether it will add or subtract. Will decrease
redundancy. Similarly in other places.
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
analogous to <code>mpz_mod</code>. Document, and list as an
incompatibility.
<li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
mpn/generic/pow_ui.
<li> Make mpz_invert call mpn_gcdext directly.
<li> Make a build option to enable execution profiling with gprof. In
particular look at getting the right <code>mcount</code> call at
the start of each assembler subroutine (for important targets at
least).
</ul>
<h4>Aids to Debugging</h4>
<ul>
<li> Make an option for stack-alloc.c to call <code>malloc</code>
separately for each <code>TMP_ALLOC</code> block, so a redzoning
malloc debugger could be used during development.
<li> Add <code>ASSERT</code>s at the start of each user-visible
mpz/mpq/mpf function to check the validity of each
<code>mp?_t</code> parameter, in particular to check they've been
<code>mp?_init</code>ed. This might catch elementary mistakes in
user programs. Care would need to be taken over
<code>MPZ_TMP_INIT</code>ed variables used internally.
</ul>
<h4>Documentation</h4>
<ul>
<li> Document conventions, like that <code> unsigned long int</code> is used for
bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
</ul>
<hr>
<table width="100%">
<tr>
<td>
<font size=2>
Please send comments about this page to
<a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
Copyright (C) 1999, 2000 Torbj<62>rn Granlund.
</font>
</td>
<td align=right>
</td>
</tr>
</table>
</body>
</html>