Dr. Rx


Base64 MIME Encoding and Modularity

Hello, gentle readers,

First, a bit of protocol.  A doctor-patient relationship is privileged, so I won't disclose the name of anyone who wrote me this month.  However, I think in the future I will disclose names of those of you who write, unless you ask me not to.  So consider yourselves warned.

I got a letter this month from a guy who wanted to decipher some email which had arrived Base64-encoded in his MVS-based email system.  He sent along a Rexx (actually, a link to a Rexx) which claimed to decode a Base64-encoded file.  Unfortunately, he said, it was not working for him. He asked me to have a look.  So I downloaded the Rexx.

What a piece of spaghetti!  It was a modification of a Rexx originally written for OS/2 which had been enhanced to handle MVS and VM I/O.  It was full of branching based on the host opsys.

My first impression was that it would be easier to write another than to decipher what we had.  I wished that the I/O, the troublesome part, had been isolated from the part which did the actual decoding, so it would be easier to examine the decoding mechanism, which I figured would be platform-independent.

Which brings me to the second part of my topic.  In my dreams, the main routine of the decoding Rexx would have been:

   arg Fid
   call ReadFile Fid
   call Decode
   call WriteFile Fid
   exit

- which brings me to the topic of a future column:  Rexx style.  Should varnames be capitalized?  May they have embedded caps?  What indenting rules should obtain?  But let's leave that for later (but if you feel strongly about this, email me).  Back to Base64.

And Modularity.  When I first started programming (FORTRAN, 1963), I started at the beginning of the program and wrote till I came to the end.  Page after page.  No one told me different.  The result was a pile of spaghetti worse than this MIME decoder, which would have been nearly impossible to debug.  FORTRAN was made for creating such programs.

Since then, the design of languages has improved, and we are becoming more aware that small bits are easier to puzzle out than big bits.  I used to resent the "wasted" time the computer spent making the function calls and returning when the code could just as well have been inline.  Now I readily pay the price, for the decipherability benefit.  And this decoder routine would benefit enormously from some modularization.

I thought I'd just whip off a Decode function of my own, demonstrate to myself that it worked, and then package it with a generic file reader and writer.

The algorithm for decoding a Base64-encoded message seemed simple enough. Each character in the encoded message corresponded to a binary number from 0 to 63, depending on its position in the string "A...Za...z0...9+/".  The rightmost 6 bits of four consecutive characters' binary representation were to be concatenated into a 24-bit string, which was then to be broken apart into three 8-bit binary strings.  These binary strings then represented the next three characters in the decoded message.

No problem.  Well, maybe a bit of a problem.  Whew, not all that easy after all.  My dreams of a simple 3 or 4-line function vanished in a maze of twisty little passages, all different.

Well, maybe not quite that bad.  A couple dozen lines.  I thought of putting it in the column, but I think I'll wait a bit, because...

It didn't work.  I didn't get the same gibberish my correspondent had reported, but I got gibberish.  That started me thinking.

What character set had the message been encoded from?  I was decoding it into ASCII.  Could that be causing the problem?  At first glance, it doesn't seem so.

RFC 1341 talks about how portable the algorithm is because the characters into which a message is encoded are represented "identically in all versions of ISO 646, including US ASCII, and ... also identically in all versions of EBCDIC."  But the warm glow that information bestows turns chilly as I befuddle myself with the ASCII-EBCDIC nightmare.

The portability of the encoding set guarantees that a particular binary octet, encoded, will decode to the same octet, regardless of platform.  But depending on the platform, that octet might represent a different character than it did where it was encoded.

The stuff didn't appear to look any better when I pretended it was EBCDIC.

I found and downloaded a shareware gadget which purports to decode Base64-encoded files--so I could see what the source was supposed to look like.  This program insists that the file is not encoded.

Maybe there was something wrong with the encoding routine.

The bottom line:  Deadline calls.  I have no answer for my patient this month.  As it turns out, he has spontaneously recovered (his pen pal stopped encoding the messages), so the urgency is off.  But the problems are real, nonetheless, and my curiosity is piqued.  I'll try to have a brief answer next month--as well as responses to some of your correspondence (hint, hint).

Until next time,
Dr. Rx  <Dr_Rx@Hotmail.com>