November 26, 2007 7:05 PM Subscribe

I am currently working on a by the book Ada to C++ conversion of legacy code and am having a bit of trouble coming with a C++ solution for a particular Ada idiom.

The Ada construct specifies floats with various levels of precision and then uses these types in bit packed fields:

Delta15: constant := (1.0 / 2**15)

type Data16 is delta Delta15 range -1.0 .. (1.0 - Delta15);

for Data16 'Size use 16;

...

type X is

record

Y: Data16;

Y2: Data16;

end record;

My question is this, how do I represent this in C++? Double's and floats are 64 bits by default. Essentially I need to create a 16 bit fixed point double. Any ideas are welcome. Thanks for your help.
posted by caflores22 to Computers & Internet (7 answers total) 2 users marked this as a favorite

The Ada construct specifies floats with various levels of precision and then uses these types in bit packed fields:

Delta15: constant := (1.0 / 2**15)

type Data16 is delta Delta15 range -1.0 .. (1.0 - Delta15);

for Data16 'Size use 16;

...

type X is

record

Y: Data16;

Y2: Data16;

end record;

My question is this, how do I represent this in C++? Double's and floats are 64 bits by default. Essentially I need to create a 16 bit fixed point double. Any ideas are welcome. Thanks for your help.

I imagine that since the original project was written in ADA, the precision of the numbers is very important.

Floating point in C(++) just compiles down to the floating point implementation of whatever processor you are compiling for. If you want the exact ADA behavior you will need to create your own class and override the mathematical operators to work the same way as the ADA runtime.

This seems easy, but I imagine that the devil is in the details of reimplementing the exact semantics of the ADA type system.

posted by AndrewStephens at 7:22 PM on November 26, 2007

Floating point in C(++) just compiles down to the floating point implementation of whatever processor you are compiling for. If you want the exact ADA behavior you will need to create your own class and override the mathematical operators to work the same way as the ADA runtime.

This seems easy, but I imagine that the devil is in the details of reimplementing the exact semantics of the ADA type system.

posted by AndrewStephens at 7:22 PM on November 26, 2007

This kind of number is called "fixed-point", as opposed to "floating-point". In C and C++, fixed-point arithmetic must be explicitly coded as functions or classes, it's not built into the language.

There are many web pages and presumably books on floating-point math. One I found first was dr dobb's.

The basics are simple: addition and subtraction are the same as for integers (unless a special provision for overflow is required, such as saturation). multiplication is integer multiplication to a wider integer type followed by a shift by the number of fractional bits. division is a shift followed by an integer division. Like addition and subtraction, multiplication and division can also overflow the representable values (and there's division by zero too).

In C++ it would seem natural to make both the number of fractional places and the underlying integral type template parameters. I don't know if C++ has a facility for specifying "the integral type with at least M more bits than this type" or "the integral type with at least twice as many bits as this type", both of which are things you might want when working with fixed-point maths.

posted by jepler at 8:19 PM on November 26, 2007

There are many web pages and presumably books on floating-point math. One I found first was dr dobb's.

The basics are simple: addition and subtraction are the same as for integers (unless a special provision for overflow is required, such as saturation). multiplication is integer multiplication to a wider integer type followed by a shift by the number of fractional bits. division is a shift followed by an integer division. Like addition and subtraction, multiplication and division can also overflow the representable values (and there's division by zero too).

In C++ it would seem natural to make both the number of fractional places and the underlying integral type template parameters. I don't know if C++ has a facility for specifying "the integral type with at least M more bits than this type" or "the integral type with at least twice as many bits as this type", both of which are things you might want when working with fixed-point maths.

posted by jepler at 8:19 PM on November 26, 2007

As jepler noted, that is a fixed-point number, not a float, i.e. it contains only mantissa and no exponent. The reason they did this is almost certainly performance; when the program was written, floating point units were likely horridly slow whereas now they are not. If performance was the only reason and the application is to run on modern PC-class hardware, feel free to use a float since that will have 23 bits of mantissa whereas the original implementation had 16.

If they did fixed-point like that for repeatability reasons, you want to continue using fixed point.

In most C++ compilers, a "short" will be 16 bits, so you want to use that. A signed short gives you [-32768, 32767], which you will note is exactly 32768 times larger than [-1.0, 1.0-2^{-15}], therefore your scale factor between the fixed-point format and float is 32768; you'll need that for initialisation and I/O purposes.

Addition and subtraction are just integer arithmetic, you don't have to do anything special except maybe check for overflows. Multiplication is integer multiplication into a 32-bit value (16 bits * 16 bits gives 32 bits) then right-shift by 16 bits to arrive at the correct scaling.

You might want to wrap it in a class and provide all the operators including conversions and arithmetic (inline!), at least that's what I'd do. Google has more to say.

posted by polyglot at 2:41 AM on November 27, 2007

If they did fixed-point like that for repeatability reasons, you want to continue using fixed point.

In most C++ compilers, a "short" will be 16 bits, so you want to use that. A signed short gives you [-32768, 32767], which you will note is exactly 32768 times larger than [-1.0, 1.0-2

Addition and subtraction are just integer arithmetic, you don't have to do anything special except maybe check for overflows. Multiplication is integer multiplication into a 32-bit value (16 bits * 16 bits gives 32 bits) then right-shift by 16 bits to arrive at the correct scaling.

You might want to wrap it in a class and provide all the operators including conversions and arithmetic (inline!), at least that's what I'd do. Google has more to say.

posted by polyglot at 2:41 AM on November 27, 2007

There's two reasonable approaches to this problem: a semantic approach and a pragmatic approach.

The semantic approach would be to assume that using 16 bit fixed point arithmetic is important both due to performance and size constraints and that to be true to ADA, you would define a Fixed16 class which has all the appropriate arithmetic operators. This is reasonable and an interesting exercise if you like figuring out fixed point math.

The pragmatic approach is to define a numeric class and use that instead. You can make the actual representation be 16 bit fixed if you like, but do most intermediate arithmetic in float or double units converting back and forth. This will add an expense, but has the added benefit of later optimization by switching to the approach above. The reason to go this way is that it gets you going faster with fewer bugs at the cost of performance.

posted by plinth at 5:57 AM on November 27, 2007

The semantic approach would be to assume that using 16 bit fixed point arithmetic is important both due to performance and size constraints and that to be true to ADA, you would define a Fixed16 class which has all the appropriate arithmetic operators. This is reasonable and an interesting exercise if you like figuring out fixed point math.

The pragmatic approach is to define a numeric class and use that instead. You can make the actual representation be 16 bit fixed if you like, but do most intermediate arithmetic in float or double units converting back and forth. This will add an expense, but has the added benefit of later optimization by switching to the approach above. The reason to go this way is that it gets you going faster with fewer bugs at the cost of performance.

posted by plinth at 5:57 AM on November 27, 2007

Side-note: beware of assuming that "short", "int", "long" etc. have particular lengths! Instead use stdint.h and its types int16_t, uint16_t, int32_t...

Especially when working with older hardware and situations (such as these) when storage size and bit position actually matters, this is a must.

(To solve your problem, you're best off reserving 16 bits of space and using your own functions to solve the problem)

posted by goingonit at 7:36 AM on November 28, 2007

Especially when working with older hardware and situations (such as these) when storage size and bit position actually matters, this is a must.

(To solve your problem, you're best off reserving 16 bits of space and using your own functions to solve the problem)

posted by goingonit at 7:36 AM on November 28, 2007

This thread is closed to new comments.

But you can create a fixed_point16 class (or FixedPoint16 if you prefer) and make it function just like a built-in.

btw, "float" is 32 bit, "double" is 64.

posted by aubilenon at 7:09 PM on November 26, 2007