REAL
signatureThe REAL signature specifies structures that implement floating-point numbers. The semantics of floating-point numbers should follow the IEEE standard [CITE]754-1985/ and the ANSI/IEEE standard [CITE]854-1987/. In addition, implementations of the REAL signature are required to use non-trapping semantics. Additional aspects of the design of the REAL and MATH signatures were guided by the Floating-Point C [CITE]Extensions/ developed by the X3J11 ANSI committee and the lecture [CITE]notes/ by W. Kahan on the IEEE standard 754.
The relation between the comparison predicates defined here and those defined by IEEE, ANSI C and FORTRAN is specified in the following table.
SML | IEEE | C | FORTRAN |
---|---|---|---|
== | = | == | .EQ. |
!= | ?<> | != | .NE. |
< | < | < | .LT. |
<= | <= | <= | .LE. |
> | > | > | .GT. |
>= | >= | >= | .GE. |
?= | ?= | !islessgreater | .UE. |
not o ?= | <> | islessgreater | .LG. |
unordered | ? | isunordered | unordered |
not o unordered | <=> | !isunordered | .LEG. |
not o op < | ?>= | ! < | .UGE. |
not o op <= | ?> | ! <= | .UG. |
not o op > | ?<= | ! > | .ULE. |
not o op >= | ?< | ! >= | .UL. |
In the functions below, unless specified otherwise, if any argument is a NaN, the return value is a NaN. In a list of rules specifying the behavior of a function in special cases, the first matching rule defines the semantics.
Rationale:
The specification of the default signature and structure for non-integer arithmetic, particularly concerning exceptional conditions, was the source of much debate, given the desire of allowing implementations to provide efficient floating-point modules. Permitting implementations to differ on whether or not, for example, to raise Div on division by zero meant that the user really did not have a standard to program against. Portable code would require adopting the more conservative position of explicitly handling exceptions. A second alternative was to specify that functions in the Real structure must raise exceptions, but that implementations so desiring could provide additional structures matching REAL with explicit floating-point semantics. This was rejected because it meant that the default
real
type would not be the same as a defined floating-pointreal
type. This imbued a second-class status on the latter, while providing a default real of lesser performance and involving additional implementation complexity for little benefit.Deciding if real should be an
eqtype
, and if so, what should equality mean, was also problematic. IEEE specifies that the sign of zeros be ignored in comparisons, and that equality evaluate to false if either argument is a NaN. These constraints are disturbing to the SML programmer. The former implies that0 = ~0
is true whiler/0 = r/~0
is false. The latter implies such anomalies asr = r
is false, or that, for a ref cellrr
, we could haverr = rr
but not have!rr = !rr
. We accepted the unsigned comparison of zeros, but felt that the reflexive property of equality, structural equality, and the desire that<>
be equivalent tonot o =
ought to be preserved. Additional complications led to the decision to not have real be aneqtype
. Additional rationale.The type, signature and structure identifiers real, REAL and Real, although misnomers in light of the floating-point-specific nature of the modules, were retained for historical reasons.
signature REAL
structure Real
: REAL
structure LargeReal
: REAL
structure Real{N}
: REAL
type real
structure Math : MATH
val radix : int
val precision : int
val maxFinite : real
val minPos : real
val minNormalPos : real
val posInf : real
val negInf : real
val + : (real * real) -> real
val - : (real * real) -> real
val * : (real * real) -> real
val / : (real * real) -> real
val *+ : real * real * real -> real
val *- : real * real * real -> real
val ~ : real -> real
val abs : real -> real
val min : (real * real) -> real
val max : (real * real) -> real
val sign : real -> int
val signBit : real -> bool
val sameSign : (real * real) -> bool
val copySign : (real * real) -> real
val compare : (real * real) -> order
val compareReal : (real * real) -> IEEEReal.real_order
val < : (real * real) -> bool
val <= : (real * real) -> bool
val > : (real * real) -> bool
val >= : (real * real) -> bool
val == : (real * real) -> bool
val != : (real * real) -> bool
val ?= : (real * real) -> bool
val unordered : (real * real) -> bool
val isFinite : real -> bool
val isNan : real -> bool
val isNormal : real -> bool
val class : real -> IEEEReal.float_class
val fmt : StringCvt.realfmt -> real -> string
val toString : real -> string
val fromString : string -> real option
val scan : (char, 'a) StringCvt.reader -> (real, 'a) StringCvt.reader
val toManExp : real -> {man : real, exp : int}
val fromManExp : {man : real, exp : int} -> real
val split : real -> {whole : real, frac : real}
val realMod : real -> real
val rem : (real * real) -> real
val nextAfter : (real * real) -> real
val checkFloat : real ->real
val realFloor : real -> real
val realCeil : real -> real
val realTrunc : real -> real
val floor : real -> Int.int
val ceil : real -> Int.int
val trunc : real -> Int.int
val round : real -> Int.int
val toInt : IEEEReal.rounding_mode -> real -> int
val toLargeInt : IEEEReal.rounding_mode -> real -> LargeInt.int
val fromInt : int -> real
val fromLargeInt : LargeInt.int -> real
val toLarge : real -> LargeReal.real
val fromLarge : IEEEReal.rounding_mode -> LargeReal.real -> real
val toDecimal : real -> IEEEReal.decimal_approx
val fromDecimal : IEEEReal.decimal_approx -> real
type real
eqtype
.
structure Math
radix
precision
0
and radix-1
, in the mantissa.
maxFinite
minPos
minNormalPos
val posInf
val negInf
r1 + r2
r1 - r2
r1 * r2
r1 / r2
NaN
and +-infinity / +-infinity = NaN
. Dividing a finite, non-zero number by a zero, or an infinity by a finite number produces an infinity with the correct sign. (Note that zeros are signed.) A finite number divided by an infinity is 0 with the correct sign.
*+ (a, b, c)
*- (a, b, c)
a*b + c
and a*b - c
, respectively. Their behaviors on infinities follow from the behaviors derived from addition, subtraction and multiplication.
The precise semantics of these operations depend on the language implementation and the underlying hardware. Specifically, certain architectures provide these operations as a single instruction, possibly using a single rounding operation. Thus, the use of these operations may be faster than performing the individual arithmetic operations sequentially, but may also cause different rounding behavior.
~ r
(- r)
. ~
(+-infinity) = -+infinity.
abs r
abs
(+-infinity) = infinity.
min (a, b)
max (a, b)
sign r
signBit r
true
if and only if the sign of r (infinities, zeros and NaNs, included) is negative.
sameSign (r1, r2)
signBit r1
equals signBit r2
.
copySign (x, y)
compare (r1, r2)
compareReal (r1, r2)
The function compareReal behaves similarly except it returns values of type IEEEReal.real_order and returns IEEEReal.UNORDERED on unordered arguments.
Implementation note:
Implementations should try to optimize use of Real.compare, since it is necessary for catching NaNs.
r1 < r2
r1 <= r2
r1 > r2
r1 >= r2
Note that these operators return false
on unordered arguments, i.e., if either argument is NaN, so that the usual reversal of comparison under negation does not hold, e.g., a < b
is not the same as not (a >= b)
.
== (x, y)
!= (x, y)
=
operator.
The second function !=
is equivalent to not o op ==
and the IEEE ?<>
operator.
?= (x, y)
?=
operator.
unordered (x, y)
isFinite x
isNan x
isNormal x
class x
fmt spec r
toString r
SCI arg
[~]d.dddE[~]dd
, where there is always one digit before the decimal point, nonzero if the number is nonzero. arg specifies the number of digits to appear after the decimal point, with 6 the default if arg is NONE
.
FIX arg
[~]ddd.ddd
. arg specifies the number of digits to appear after the decimal point, with 6 the default if arg is NONE
.
GEN arg
NONE
.
EXACT
"inf"
and "~inf"
, respectively. If spec is not EXACT
, NaN values are returned as "nan"
; otherwise, NaN values are converted to the form "nan(d(1)d(2)...d(n))".
fmt raises Size if spec is an invalid precision, i.e., if spec is
The value returned by toString is equivalent to:
(fmt (StringCvt.GEN NONE) r)
fromString s
SOME(r)
if a real value can be scanned from a prefix of s, ignoring any initial whitespace; otherwise, returns NONE. Equivalent to StringCvt.scanString scan
.
scan getc a
SOME(r,rest)
where r is the scanned real value and rest is the unused portion of the character source a. Raises Overflow if the value cannot be represented in real value.
The format for valid string representation of reals is given by the regular expression
[+~-]?(([0-9]+(\.[0-9]+)?)|(\.[0-9]+))([eE][+~-]?[0-9]+)?
toManExp r
{man, exp}
, where man and exp are the mantissa and exponent of r, respectively. Specifically, we have the relation
r = man * radix
(exp)
where 1.0 <= man * radix < radix. This function is comparable to frexp
in the C library.
If r is +-0, man is +-0 and exp is +0. If r is +-infinity, man is +-infinity and exp is unspecified. If r is NaN, man is Nan and exp is unspecified.
fromManExp {man, exp}
radix
(exp). This function is comparable to ldexp
in the C library. Note that non-exceptional arguments can produce zero or infinities, essentially because of underflows and overflows.
If man is +-0, the result is +-0. If man is +-infinity, the result is +-infinity. If man is NaN, the result is NaN.
split r
realMod r
{whole, frac}
, where frac and whole are the fractional and integral parts of r, respectively. Specifically, whole is integral, |frac| < 1.0, whole and frac have the same sign as r, and r = whole + frac. This function is comparable to modf
in the C library.
If r is +-infinity, whole is +-infinity and frac is +-0. If r is NaN, both whole and frac are NaN.
realMod
is equivalent to #frac o split
.
rem (x, y)
trunc
(x / y). The result has the same sign as x and has absolute value less than the absolute value of y.
If x is an infinity or y is 0, rem returns NaN. If y is an infinity, rem returns x.
nextAfter (r, t)
r = t
then it returns r. If r is +-infinity, it returns +-infinity. If either argument is a NaN, this returns NaN.
checkFloat x
This can be used to synthesize trapping arithmetic from the non-trapping operations given here. Note, however, that infinities can be converted to NaNs by some operations, so that if accurate exceptions are required, checks must be done after each operation.
realFloor r
realCeil r
realTrunc r
floor r
ceil r
trunc r
round r
These are respectively equivalent to:
toInt IEEEReal.TO_NEGINF r toInt IEEEReal.TO_POSINF r toInt IEEEReal.TO_ZERO r toInt IEEEReal.TO_NEAREST r
toInt mode x
toLargeInt mode x
fromInt i
fromLargeInt i
toLarge x
fromLarge mode x
toDecimal r
fromDecimal d
toDecimal
should produce only as many digits as are necessary for fromDecimal
to convert back to the same number, i.e., for any Normal
or SubNormal
real value r, we have:
fromDecimal (toDecimal r) = r.
For toDecimal
, when the kind
field is not Normal
or SubNormal
, then exp = 0
and digits = []
except if kind
is NAN
, which case the digits
field provides a decimal representation of the fraction field of r.
For fromDecimal
, if kind
is ZERO
or INF
, the resulting real is the appropriate signed zero or infinity, with the digits
and exp
fields ignored. If kind
is NAN
, a signed NaN is generated, where the exp
field is ignored and the digits
field is used as the decimal representation of the fractional field. If the resulting fractional field has all zero bits, which corresponds to an infinity, fromDecimal
raises the Domain exception. If digits
is empty, an implementation-dependent NaN is produced. If kind
is NORMAL
or SUBNORMAL
, the sign
, digits
and exp
fields are used to produce a real value. Note that the conversion itself should ignore the kind
field, so that the resulting value might have class NORMAL
, SUBNORMAL
or ZERO
. In particular, is digits
is empty or a list of all 0's, the result should be a signed zero.
Implementation note:
Algorithms for accurately and efficiently converting between binary and decimal real representations are readily available, e.g., see the technical report by [CITE]Gay/.
The sign of a zero is ignored in all comparisons.
Note that, if x is real, ~x
is equivalent to ~(x)
, that is, it is identical to x but with its sign bit flipped. In particular, the literal ~0.0
is just 0.0
with it sign bit set. On the other hand, this might not be the same as 0.0-0.0
, in which rounding modes come into play.
Except for the *+
and *-
functions, arithmetic should be done in the exact precision specified by the precision value. In particular, arithmetic must not be done in some extended precision and then rounded.
Implementation note:
Implementations may choose to provide a debugging mode, in which NaNs and Infs are detected when they are generated.
MATH, IEEEReal, StringCvt
Last Modified October 5, 1997
Comments to John Reppy.
Copyright © 1997 Bell Labs, Lucent Technologies