mirror of https://github.com/macssh/macssh.git
380 lines
19 KiB
HTML
Executable File
380 lines
19 KiB
HTML
Executable File
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
||
<html>
|
||
<head>
|
||
<title>
|
||
GMP Itemized Development Tasks
|
||
</title>
|
||
</head>
|
||
<body bgcolor=lightgreen>
|
||
|
||
<center>
|
||
<h1>
|
||
GMP Itemized Development Tasks
|
||
</h1>
|
||
</center>
|
||
|
||
<comment>
|
||
An up-to-date html version of this file is available at
|
||
<a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
|
||
</comment>
|
||
|
||
<p> This file lists itemized GMP development tasks. Not all the tasks
|
||
listed here are suitable for volunteers, but many of them are.
|
||
Please see the <a href="projects.html">projects file</a> for more
|
||
sizeable projects.
|
||
|
||
<h4>Correctness and Completeness</h4>
|
||
<ul>
|
||
<li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
|
||
new instructions. Unfortunately, the HPUX 9 assembler as well as earlier
|
||
assemblers reject that directive. How very clever of HP! We will have to
|
||
pass assembler options, and make sure it works with new and old systems
|
||
and GNU assembler.
|
||
<li> The various reuse.c tests need to force reallocation by calling
|
||
<code>_mpz_realloc</code> with a small (1 limb) size.
|
||
<li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
|
||
<li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
|
||
<code>mp_exp_t</code>, the precision of
|
||
<code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
|
||
<code>mpf_get_str</code> aborts. Detect and compensate.
|
||
<li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
|
||
machines that use <code>long long</code> for storing limbs.)
|
||
<li> Make the string reading functions allow the `0x' prefix when the base is
|
||
explicitly 16. They currently only allow that prefix when the base is
|
||
unspecified.
|
||
<li> In the development sources, we return abs(a%b) in the
|
||
<code>mpz_*_ui</code> division routines. Perhaps make them return the
|
||
real remainder instead? Changes return type to <code>signed long int</code>.
|
||
<li> <code>mpf_eq</code> is not always correct, when one operand is
|
||
1000000000... and the other operand is 0111111111..., i.e., extremely
|
||
close. There is a special case in <code>mpf_sub</code> for this
|
||
situation; put similar code in <code>mpf_eq</code>.
|
||
<li> mpf_eq doesn't implement what gmp.texi specifies. It should not use just
|
||
whole limbs, but partial limbs.
|
||
<li> Install Alpha assembly changes (prec/gmp-alpha-patches).
|
||
<li> NeXT has problems with newlines in asm strings in longlong.h. Also,
|
||
<code>__builtin_constant_p</code> is unavailable? Same problem with MacOS
|
||
X.
|
||
<li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
|
||
mp?/tests/*.c.
|
||
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
|
||
</ul>
|
||
|
||
|
||
|
||
<h4>Machine Independent Optimization</h4>
|
||
<ul>
|
||
<li> In hundreds of places in the code, we invoke count_leading_zeros and then
|
||
check if the returned count is zero. Instead check the most significant
|
||
bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
|
||
the bit is set. This is an optimization on all machines, and significant
|
||
on machines with slow <code>count_leading_zeros</code>.
|
||
<li> In a couple of places <code>count_trailing_zeros</code> is used
|
||
on more or less uniformly distributed numbers. For some CPUs
|
||
<code>count_trailing_zeros</code> is slow and it's probably worth
|
||
handling the frequently occurring 0 to 2 trailing zeros cases specially.
|
||
<li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
|
||
instead use <code>invert_limb</code>.
|
||
<li> Reorganize longlong.h so that we can inline the operations even for the
|
||
system compiler. When there is no such compiler feature, make calls to
|
||
stub functions. Write such stub functions for as many machines as
|
||
possible.
|
||
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
|
||
most significant limb (if <code>BITS_PER_MP_LIMB</code> <= 52 bits).
|
||
(Peter Montgomery has some ideas on this subject.)
|
||
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
|
||
products with fewer operations.
|
||
<li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
|
||
the sub O(n^2) range, using some divide-and-conquer approach, preferably
|
||
without using division.
|
||
<li> Copy tricky code for converting a limb from development version of
|
||
<code>mpn_get_str</code> to mpf/get_str. (Talk to Torbj<62>rn about this.)
|
||
<li> Consider inlining these functions: <code>mpz_size</code>,
|
||
<code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
|
||
<code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
|
||
<code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
|
||
<code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
|
||
<code>mpz_popcount</code>, <code>mpf_size</code>,
|
||
<code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
|
||
<code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
|
||
<code>mpf_clear</code>, <code>mpf_set_si</code>.
|
||
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
|
||
fast on one or two limb moduli, due to a lot of function call
|
||
overheads. These could perhaps be handled as special cases.
|
||
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
|
||
algorithm selection, and the latter should use REDC. Both could
|
||
change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
|
||
<li> <code>mpn_gcd</code> might be able to be sped up on small to
|
||
moderate sizes by improving <code>find_a</code>, possibly just by
|
||
providing an alternate implementation for CPUs with slowish
|
||
<code>count_leading_zeros</code>.
|
||
<li> Implement a cache localized evaluate and interpolate for the
|
||
toom3 <code>USE_MORE_MPN</code> code. The necessary
|
||
right-to-left <code>mpn_divexact_by3c</code> exists.
|
||
<li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
|
||
better cache locality by taking N piece by piece. The current code could
|
||
be left available for CPUs without caching. Depending how karatsuba etc
|
||
is applied to unequal size operands it might be possible to assume M is
|
||
always smallish.
|
||
</ul>
|
||
|
||
|
||
<h4>Machine Dependent Optimization</h4>
|
||
<ul>
|
||
<li> Run the `tune' utility for more compiler/CPU combinations. We would like
|
||
to have gmp-mparam.h files in practically every implementation specific
|
||
mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
|
||
compiler. See the `tune' top-level directory for more information.
|
||
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
|
||
<code>mpn_mul_1</code> for the 21264. On 21264, they should run at 4, 3,
|
||
and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
|
||
Torbj<62>rn for his xm.s and xam.s skeleton files.)
|
||
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
|
||
<code>mpn_mul_1</code> for the 21164. This should use both integer
|
||
multiplies and floating-point multiplies. For the floating-point
|
||
operations, the single-limb multiplier should be split into three 21-bit
|
||
chunks.
|
||
<li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
|
||
<code>mpn_submul_1</code>, and <code>mpn_mul_1</code>. Should use
|
||
floating-point operations, and split the invariant single-limb multiplier
|
||
into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
|
||
will become very deep. (Torbj<62>rn has C code that is useful as a starting
|
||
point.)
|
||
<li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
|
||
Should give 2 cycles/limb. (Torbj<62>rn has code that just needs to be
|
||
finished.)
|
||
<li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>
|
||
and the other multiplies varies so much on successive sizes.
|
||
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
|
||
<code>mpn_mul_1</code>. The current development code runs at 11
|
||
cycles/limb, which is already very good. But it should be possible to
|
||
saturate the cache, which will happen at 7.5 cycles/limb.
|
||
<li> Sparc & SparcV8: Enable umul.asm for native cc. The generic
|
||
longlong.h umul_ppmm is suspected to be causing sqr_basecase to
|
||
be slower than mul_basecase.
|
||
<li> UltraSPARC: Write <code>umul_ppmm</code>. Important in particular for
|
||
<code>mpn_sqr_basecase</code>. Using four "<code>mulx</code>"s either
|
||
with an asm block or via the generic C code is about 90 cycles.
|
||
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
|
||
for important machines. Helping the generic sqr_basecase.c with an
|
||
<code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
|
||
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
|
||
Will bring time from 1.75 to 1.25 cycles/limb.
|
||
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See Pentium code.)
|
||
<li> Alpha: Optimize <code>count_leading_zeros</code>.
|
||
<li> Alpha: Optimize <code>udiv_qrnnd</code>. (Ask Torbj<62>rn for the file
|
||
test-udiv-preinv.c as a starting point.)
|
||
<li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
|
||
It should just require 3 cycles/limb, but the current code propagates
|
||
carry poorly. The trick is to add carry-in later than we do now,
|
||
decreasing the number of operations used to generate carry-out from 4 to
|
||
to 3.
|
||
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
|
||
The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
|
||
is unused. (Ask Torbj<62>rn for a copy of the current code.)
|
||
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
|
||
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. Should
|
||
run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
|
||
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
|
||
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
|
||
<code>mpn_sqr_basecase</code>. This should use a "vertical multiplication
|
||
method", to avoid carry propagation. splitting one of the operands in
|
||
11-bit chunks.
|
||
<li> Cray: Vectorize main functions, perhaps in assembly language.
|
||
<li> Cray: Write <code>mpn_mul_basecase</code> and
|
||
<code>mpn_sqr_basecase</code>. Same comment applies to this as to the
|
||
same functions for Fujitsu VPP.
|
||
<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
|
||
|
||
<pre>
|
||
if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
|
||
if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
|
||
... </pre>
|
||
|
||
</ul>
|
||
|
||
<h4>New Functionality</h4>
|
||
<ul>
|
||
<li> <code>mpz_get_nth_ui</code>. Return the nth word (not necessarily the nth limb).
|
||
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
|
||
<li> Let `0b' and `0B' mean binary input everywhere.
|
||
<li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
|
||
(cf. <code>mpq_set_d</code>).
|
||
<li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
|
||
allocation. Set <code>ALLOC(var)</code> to 0, and have
|
||
<code>mpz_realloc</code> special-handle that case. Update functions that
|
||
rely on a single limb (like <code>mpz_set_ui</code>,
|
||
<code>mpz_[tfc]div_r_ui</code>, and others).
|
||
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure
|
||
format is portable between 32-bit and 64-bit machines, and between
|
||
little-endian and big-endian machines.
|
||
<li> Handle numeric exceptions: Call an error handler, and/or set
|
||
<code>gmp_errno</code>.
|
||
<li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
|
||
<code>gmp_snprintf</code>. Think about some sort of wrapper
|
||
around <code>printf</code> so it and its several variants don't
|
||
have to be completely reimplemented.
|
||
<li> Implement some <code>mpq</code> input and output functions.
|
||
<li> Implement a full precision <code>mpz_kronecker</code>, leave
|
||
<code>mpz_jacobi</code> for compatibility.
|
||
<li> Make the mpn logops and copys available in gmp.h. Since they can
|
||
be either library functions or inlines, gmp.h would need to be
|
||
generated from a gmp.in based on what's in the library. gmp.h
|
||
would still be compiler-independent though.
|
||
<li> Make versions of <code>mpz_set_str</code> etc taking string
|
||
lengths rather than null-terminators.
|
||
<li> Consider changing the thresholds to apply the simpler algorithm when
|
||
"<code><=</code>" rather than "<code><</code>", so a threshold can
|
||
be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
|
||
compiler will know <code>size <= MP_SIZE_T_MAX</code> is always true).
|
||
<li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>
|
||
could be implemented to match the corresponding tdiv and fdiv.
|
||
Maybe some code sharing is possible.
|
||
</ul>
|
||
|
||
|
||
<h4>Configuration</h4>
|
||
|
||
<ul>
|
||
<li> Improve config.guess. We want to recognize the processor very
|
||
accurately, more accurately than other GNU packages.
|
||
config.guess does not currently make the distinctions we would
|
||
like it to do and a --target often needs to be set explicitly.
|
||
|
||
For example, "sparc" is not very useful as a machine architecture
|
||
denotation. We want to distinguish old 32-bit SPARC without
|
||
multiply support from newer 32-bit SPARC with such support. We
|
||
want to recognize a SuperSPARC, since its implementation of the
|
||
UDIV instruction is not complete, and will trap to the OS kernel
|
||
for certain operands. And we want to recognize 64-bit capable
|
||
SPARC processors as such. While the assembly routines can use
|
||
64-bit operations on all 64-bit SPARC processors, one can not use
|
||
64-bit limbs under all operating system. E.g., Solaris 2.5 and
|
||
2.6 doesn't preserve the upper 32 bits of most processor
|
||
registers. For SPARC we therefore sometimes need to choose GMP
|
||
configuration depending both on processor and operating system.
|
||
|
||
<li> Remember to make sure config.sub accepts any output from config.guess.
|
||
|
||
<li> Find out whether there's an alloca available and how to use it.
|
||
AC_FUNC_ALLOCA has various system dependencies covered, but we
|
||
don't want its alloca.c replacement. (One thing current cpp
|
||
tests don't cover: HPUX 10 C compiler supports alloca, but
|
||
cannot find any symbol to test in order to know if we're on
|
||
HPUX 10. Damn.)
|
||
<li> Identify Mips processor under Irix: `hinv -c processor'.
|
||
config.guess should say mips2, mips3, and mips4.
|
||
<li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".
|
||
Unfortunately, sizer is not available before some revision of
|
||
Dec Unix 4.0, and it also returns some rather cryptic names for
|
||
processors. Perhaps the <code>implver</code> and
|
||
<code>amask</code> assembly instructions are better, but that
|
||
doesn't differentiate between ev5 and ev56.
|
||
<li> Identify Sparc processors. config.guess should say supersparc,
|
||
microsparc, ultrasparc1, ultrasparc2, etc.
|
||
<li> Identify HPPA processors similarly.
|
||
<li> Get lots of information about a Solaris system: prtconf -vp
|
||
<li> For some target machines and some compilers, specific options
|
||
are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,
|
||
Irix64/cc needs -64, Irix32/cc might need -n32, etc). Some are
|
||
set already, add more, see configure.in.
|
||
<li> Options to be passed to the assembler (via the compiler, using
|
||
whatever syntax the compiler uses for passing options to the
|
||
assembler).
|
||
<li> On Solaris 7, check if gcc supports native v9 64-bit
|
||
arithmetic. If not compile using "cc -fast -xarch=v9".
|
||
(Problem: -fast requires that we link with -fast too, which
|
||
might not be very good. Pass "-xO4 -xtarget=native" instead?)
|
||
<li> Extend the "optional" compiler arguments to choose the first
|
||
that works from from a set, so when gcc gets athlon support it
|
||
can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,
|
||
whichever works.
|
||
<li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant
|
||
x86s. (A bug in gcc 2.95.2 prevents it being used
|
||
unconditionally.)
|
||
<li> Build multiple variants of the library under certain systems.
|
||
An example is -n32, -o32, and -64 on Irix.
|
||
<li> There's a few filenames that don't fit in 14 chars, if this
|
||
matters.
|
||
<li> Enable support for FORTRAN versions of mpn files (eg. for
|
||
mpn/cray/mulww.f). Add "f" to the mpn path searching, run AC_PROG_F77 if
|
||
such a file is found. Automake will generate some of what's needed in the
|
||
makefiles, but libtool doesn't know fortran and so rules like the current
|
||
".asm.lo" will be needed.
|
||
<li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files
|
||
selected from the mpn path. This might help say a generic C
|
||
build on weird systems.
|
||
</ul>
|
||
|
||
<p> In general, getting the exact right configuration, passing the
|
||
exact right options to the compiler, etc, might mean that the GMP
|
||
performance more than doubles.
|
||
|
||
<p> When testing, make sure to test at least the following for all out
|
||
target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode
|
||
and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system
|
||
`make' and GNU `make'. (4) With and without GNU binutils.
|
||
|
||
|
||
<h4>Miscellaneous</h4>
|
||
<ul>
|
||
|
||
<li> Work on the way we build the library. We now do it building
|
||
convenience libraries but then listing all the object files a
|
||
second time in the top level Makefile.am.
|
||
<li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
|
||
mp[zq]/add.c to decide whether it will add or subtract. Will decrease
|
||
redundancy. Similarly in other places.
|
||
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
|
||
analogous to <code>mpz_mod</code>. Document, and list as an
|
||
incompatibility.
|
||
<li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
|
||
mpn/generic/pow_ui.
|
||
<li> Make mpz_invert call mpn_gcdext directly.
|
||
<li> Make a build option to enable execution profiling with gprof. In
|
||
particular look at getting the right <code>mcount</code> call at
|
||
the start of each assembler subroutine (for important targets at
|
||
least).
|
||
</ul>
|
||
|
||
|
||
<h4>Aids to Debugging</h4>
|
||
<ul>
|
||
<li> Make an option for stack-alloc.c to call <code>malloc</code>
|
||
separately for each <code>TMP_ALLOC</code> block, so a redzoning
|
||
malloc debugger could be used during development.
|
||
<li> Add <code>ASSERT</code>s at the start of each user-visible
|
||
mpz/mpq/mpf function to check the validity of each
|
||
<code>mp?_t</code> parameter, in particular to check they've been
|
||
<code>mp?_init</code>ed. This might catch elementary mistakes in
|
||
user programs. Care would need to be taken over
|
||
<code>MPZ_TMP_INIT</code>ed variables used internally.
|
||
</ul>
|
||
|
||
|
||
<h4>Documentation</h4>
|
||
<ul>
|
||
<li> Document conventions, like that <code> unsigned long int</code> is used for
|
||
bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
|
||
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
|
||
</ul>
|
||
|
||
<hr>
|
||
|
||
<table width="100%">
|
||
<tr>
|
||
<td>
|
||
<font size=2>
|
||
Please send comments about this page to
|
||
<a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
|
||
Copyright (C) 1999, 2000 Torbj<62>rn Granlund.
|
||
</font>
|
||
</td>
|
||
<td align=right>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
|
||
</body>
|
||
</html>
|