macssh/gmp/doc/tasks.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
  <title>
    GMP Itemized Development Tasks
  </title>
</head>
<body bgcolor=lightgreen>

<center>
  <h1>
    GMP Itemized Development Tasks
  </h1>
</center>

<comment>
  An up-to-date html version of this file is available at
  <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
</comment>

<p> This file lists itemized GMP development tasks.  Not all the tasks
    listed here are suitable for volunteers, but many of them are.
    Please see the <a href="projects.html">projects file</a> for more
    sizeable projects.

<h4>Correctness and Completeness</h4>
<ul>
<li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
     new instructions.  Unfortunately, the HPUX 9 assembler as well as earlier
     assemblers reject that directive.  How very clever of HP!  We will have to
     pass assembler options, and make sure it works with new and old systems
     and GNU assembler.
<li> The various reuse.c tests need to force reallocation by calling
     <code>_mpz_realloc</code> with a small (1 limb) size.
<li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
<li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
     <code>mp_exp_t</code>, the precision of
     <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
     <code>mpf_get_str</code> aborts.  Detect and compensate.
<li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
     machines that use <code>long long</code> for storing limbs.)
<li> Make the string reading functions allow the `0x' prefix when the base is
     explicitly 16.  They currently only allow that prefix when the base is
     unspecified.
<li> In the development sources, we return abs(a%b) in the
     <code>mpz_*_ui</code> division routines.  Perhaps make them return the
     real remainder instead?  Changes return type to <code>signed long int</code>.
<li> <code>mpf_eq</code> is not always correct, when one operand is
     1000000000... and the other operand is 0111111111..., i.e., extremely
     close.  There is a special case in <code>mpf_sub</code> for this
     situation; put similar code in <code>mpf_eq</code>.
<li> mpf_eq doesn't implement what gmp.texi specifies.  It should not use just
     whole limbs, but partial limbs.
<li> Install Alpha assembly changes (prec/gmp-alpha-patches).
<li> NeXT has problems with newlines in asm strings in longlong.h.  Also,
     <code>__builtin_constant_p</code> is unavailable?  Same problem with MacOS
     X.
<li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
     mp?/tests/*.c.
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
</ul>


<h4>Machine Independent Optimization</h4>
<ul>
<li> In hundreds of places in the code, we invoke count_leading_zeros and then
     check if the returned count is zero.  Instead check the most significant
     bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
     the bit is set.  This is an optimization on all machines, and significant
     on machines with slow <code>count_leading_zeros</code>.
<li> In a couple of places <code>count_trailing_zeros</code> is used
     on more or less uniformly distributed numbers.  For some CPUs
     <code>count_trailing_zeros</code> is slow and it's probably worth
     handling the frequently occurring 0 to 2 trailing zeros cases specially.
<li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
     instead use <code>invert_limb</code>.
<li> Reorganize longlong.h so that we can inline the operations even for the
     system compiler.  When there is no such compiler feature, make calls to
     stub functions.  Write such stub functions for as many machines as
     possible.
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
     most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
     (Peter Montgomery has some ideas on this subject.)
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
     products with fewer operations.
<li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
     the sub O(n^2) range, using some divide-and-conquer approach, preferably
     without using division.
<li> Copy tricky code for converting a limb from development version of
     <code>mpn_get_str</code> to mpf/get_str.  (Talk to Torbj<62>rn about this.)
<li> Consider inlining these functions: <code>mpz_size</code>,
     <code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
     <code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
     <code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
     <code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
     <code>mpz_popcount</code>, <code>mpf_size</code>,
     <code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
     <code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
     <code>mpf_clear</code>, <code>mpf_set_si</code>.
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
     fast on one or two limb moduli, due to a lot of function call
     overheads.  These could perhaps be handled as special cases.
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
     algorithm selection, and the latter should use REDC.  Both could
     change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
<li> <code>mpn_gcd</code> might be able to be sped up on small to
     moderate sizes by improving <code>find_a</code>, possibly just by
     providing an alternate implementation for CPUs with slowish
     <code>count_leading_zeros</code>.
<li> Implement a cache localized evaluate and interpolate for the
     toom3 <code>USE_MORE_MPN</code> code.  The necessary
     right-to-left <code>mpn_divexact_by3c</code> exists.
<li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
     better cache locality by taking N piece by piece.  The current code could
     be left available for CPUs without caching.  Depending how karatsuba etc
     is applied to unequal size operands it might be possible to assume M is
     always smallish.
</ul>


<h4>Machine Dependent Optimization</h4>
<ul>
<li> Run the `tune' utility for more compiler/CPU combinations.  We would like
     to have gmp-mparam.h files in practically every implementation specific
     mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
     compiler.  See the `tune' top-level directory for more information.
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code> for the 21264.  On 21264, they should run at 4, 3,
     and 3 cycles/limb respectively, if the code is unrolled properly.  (Ask
     Torbj<62>rn for his xm.s and xam.s skeleton files.)
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code> for the 21164.  This should use both integer
     multiplies and floating-point multiplies.  For the floating-point
     operations, the single-limb multiplier should be split into three 21-bit
     chunks.
<li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
     <code>mpn_submul_1</code>, and <code>mpn_mul_1</code>.  Should use
     floating-point operations, and split the invariant single-limb multiplier
     into 21-bit chunks.  Should give about 18 cycles/limb, but the pipeline
     will become very deep.  (Torbj<62>rn has C code that is useful as a starting
     point.)
<li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
     Should give 2 cycles/limb.  (Torbj<62>rn has code that just needs to be
     finished.)
<li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>
     and the other multiplies varies so much on successive sizes.
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code>.  The current development code runs at 11
     cycles/limb, which is already very good.  But it should be possible to
     saturate the cache, which will happen at 7.5 cycles/limb.
<li> Sparc & SparcV8: Enable umul.asm for native cc.  The generic
     longlong.h umul_ppmm is suspected to be causing sqr_basecase to
     be slower than mul_basecase.
<li> UltraSPARC: Write <code>umul_ppmm</code>.  Important in particular for
     <code>mpn_sqr_basecase</code>.  Using four "<code>mulx</code>"s either
     with an asm block or via the generic C code is about 90 cycles.
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
     for important machines.  Helping the generic sqr_basecase.c with an
     <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
     Will bring time from 1.75 to 1.25 cycles/limb.
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See Pentium code.)
<li> Alpha: Optimize <code>count_leading_zeros</code>.
<li> Alpha: Optimize <code>udiv_qrnnd</code>.  (Ask Torbj<62>rn for the file
     test-udiv-preinv.c as a starting point.)
<li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
     It should just require 3 cycles/limb, but the current code propagates
     carry poorly.  The trick is to add carry-in later than we do now,
     decreasing the number of operations used to generate carry-out from 4 to
     to 3.
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
     The pipeline is now extremely deep, perhaps unnecessarily deep.  Also, r5
     is unused.  (Ask Torbj<62>rn for a copy of the current code.)
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
     run at just 3.25 cycles/limb.  (Ask for xxx-add_n.s as a starting point.)
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
     <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
     method", to avoid carry propagation.  splitting one of the operands in
     11-bit chunks.
<li> Cray: Vectorize main functions, perhaps in assembly language.
<li> Cray: Write <code>mpn_mul_basecase</code> and
     <code>mpn_sqr_basecase</code>.  Same comment applies to this as to the
     same functions for Fujitsu VPP.
<li> Improve <code>count_leading_zeros</code> for 64-bit machines:

  <pre>
  if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x &lt&lt= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
  if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x &lt&lt= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
  ... </pre>

</ul>

<h4>New Functionality</h4>
<ul>
<li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the nth limb).
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
<li> Let `0b' and `0B' mean binary input everywhere.
<li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
     (cf. <code>mpq_set_d</code>).
<li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
     allocation.  Set <code>ALLOC(var)</code> to 0, and have
     <code>mpz_realloc</code> special-handle that case.  Update functions that
     rely on a single limb (like <code>mpz_set_ui</code>,
     <code>mpz_[tfc]div_r_ui</code>, and others).
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
     format is portable between 32-bit and 64-bit machines, and between
     little-endian and big-endian machines.
<li> Handle numeric exceptions: Call an error handler, and/or set
     <code>gmp_errno</code>.
<li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
     <code>gmp_snprintf</code>.  Think about some sort of wrapper
     around <code>printf</code> so it and its several variants don't
     have to be completely reimplemented.
<li> Implement some <code>mpq</code> input and output functions.
<li> Implement a full precision <code>mpz_kronecker</code>, leave
     <code>mpz_jacobi</code> for compatibility.
<li> Make the mpn logops and copys available in gmp.h.  Since they can
     be either library functions or inlines, gmp.h would need to be
     generated from a gmp.in based on what's in the library.  gmp.h
     would still be compiler-independent though.
<li> Make versions of <code>mpz_set_str</code> etc taking string
     lengths rather than null-terminators.
<li> Consider changing the thresholds to apply the simpler algorithm when
     "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
     be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
     compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
<li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>
     could be implemented to match the corresponding tdiv and fdiv.
     Maybe some code sharing is possible.
</ul>


<h4>Configuration</h4>

<ul>
<li> Improve config.guess.  We want to recognize the processor very
     accurately, more accurately than other GNU packages.
     config.guess does not currently make the distinctions we would
     like it to do and a --target often needs to be set explicitly.

     For example, "sparc" is not very useful as a machine architecture
     denotation.  We want to distinguish old 32-bit SPARC without
     multiply support from newer 32-bit SPARC with such support.  We
     want to recognize a SuperSPARC, since its implementation of the
     UDIV instruction is not complete, and will trap to the OS kernel
     for certain operands.  And we want to recognize 64-bit capable
     SPARC processors as such.  While the assembly routines can use
     64-bit operations on all 64-bit SPARC processors, one can not use
     64-bit limbs under all operating system.  E.g., Solaris 2.5 and
     2.6 doesn't preserve the upper 32 bits of most processor
     registers.  For SPARC we therefore sometimes need to choose GMP
     configuration depending both on processor and operating system.

<li> Remember to make sure config.sub accepts any output from config.guess.

<li> Find out whether there's an alloca available and how to use it.
     AC_FUNC_ALLOCA has various system dependencies covered, but we
     don't want its alloca.c replacement.  (One thing current cpp
     tests don't cover: HPUX 10 C compiler supports alloca, but
     cannot find any symbol to test in order to know if we're on
     HPUX 10.  Damn.)
<li> Identify Mips processor under Irix: `hinv -c processor'.
     config.guess should say mips2, mips3, and mips4.
<li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".
     Unfortunately, sizer is not available before some revision of
     Dec Unix 4.0, and it also returns some rather cryptic names for
     processors.  Perhaps the <code>implver</code> and
     <code>amask</code> assembly instructions are better, but that
     doesn't differentiate between ev5 and ev56.
<li> Identify Sparc processors.  config.guess should say supersparc,
     microsparc, ultrasparc1, ultrasparc2, etc.
<li> Identify HPPA processors similarly.
<li> Get lots of information about a Solaris system: prtconf -vp
<li> For some target machines and some compilers, specific options
     are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,
     Irix64/cc needs -64, Irix32/cc might need -n32, etc).  Some are
     set already, add more, see configure.in.
<li> Options to be passed to the assembler (via the compiler, using
     whatever syntax the compiler uses for passing options to the
     assembler).
<li> On Solaris 7, check if gcc supports native v9 64-bit
     arithmetic.  If not compile using "cc -fast -xarch=v9".
     (Problem: -fast requires that we link with -fast too, which
     might not be very good.  Pass "-xO4 -xtarget=native" instead?)
<li> Extend the "optional" compiler arguments to choose the first
     that works from from a set, so when gcc gets athlon support it
     can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,
     whichever works.
<li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant
     x86s.  (A bug in gcc 2.95.2 prevents it being used
     unconditionally.)
<li> Build multiple variants of the library under certain systems.
     An example is -n32, -o32, and -64 on Irix.
<li> There's a few filenames that don't fit in 14 chars, if this
     matters.
<li> Enable support for FORTRAN versions of mpn files (eg. for
     mpn/cray/mulww.f).  Add "f" to the mpn path searching, run AC_PROG_F77 if
     such a file is found.  Automake will generate some of what's needed in the
     makefiles, but libtool doesn't know fortran and so rules like the current
     ".asm.lo" will be needed.
<li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files
     selected from the mpn path.  This might help say a generic C
     build on weird systems.
</ul>

<p> In general, getting the exact right configuration, passing the
exact right options to the compiler, etc, might mean that the GMP
performance more than doubles.

<p> When testing, make sure to test at least the following for all out
target machines: (1) Both gcc and cc (and c89).  (2) Both 32-bit mode
and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system
`make' and GNU `make'. (4) With and without GNU binutils.


<h4>Miscellaneous</h4>
<ul>

<li> Work on the way we build the library.  We now do it building
     convenience libraries but then listing all the object files a
     second time in the top level Makefile.am.
<li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
     mp[zq]/add.c to decide whether it will add or subtract.  Will decrease
     redundancy.  Similarly in other places.
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
     analogous to <code>mpz_mod</code>.  Document, and list as an
     incompatibility.
<li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
     mpn/generic/pow_ui.
<li> Make mpz_invert call mpn_gcdext directly.
<li> Make a build option to enable execution profiling with gprof.  In
     particular look at getting the right <code>mcount</code> call at
     the start of each assembler subroutine (for important targets at
     least).
</ul>


<h4>Aids to Debugging</h4>
<ul>
<li> Make an option for stack-alloc.c to call <code>malloc</code>
     separately for each <code>TMP_ALLOC</code> block, so a redzoning
     malloc debugger could be used during development.
<li> Add <code>ASSERT</code>s at the start of each user-visible
     mpz/mpq/mpf function to check the validity of each
     <code>mp?_t</code> parameter, in particular to check they've been
     <code>mp?_init</code>ed.  This might catch elementary mistakes in
     user programs.  Care would need to be taken over
     <code>MPZ_TMP_INIT</code>ed variables used internally.
</ul>


<h4>Documentation</h4>
<ul>
<li> Document conventions, like that <code> unsigned long int</code> is used for
     bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
</ul>

<hr>

<table width="100%">
  <tr>
    <td>
      <font size=2>
      Please send comments about this page to
      <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
      Copyright (C) 1999, 2000 Torbj<62>rn Granlund.
      </font>
    </td>
    <td align=right>
    </td>
  </tr>
</table>

</body>
</html>