CueCat Output Decode and Decipherment

Stephen Satchell, Thursday, September 21, 2000.

Copyright 2000 Stephen Satchell -- all rights reserved.  Anyone is free to link directly to this essay from any Web site.  Reproduction, but not for profit, is hereby granted as long as this Copyright notice remains attached.  Any commercial use of the information of this paper requires that appropriate publishing rights be secured from the author.


Table of Contents
  1. Goal of this paper
  2. Genesis: What CueCat Returns in the Raw
  3. Analyzing the Coding
  4. Cryptoanalysis, using known-plaintext attack
  5. Checking Our Crypto-Work
  6. The Final Two Code Positions
  7. Putting It All Together
  8. Conclusions
  9. So What's Next? (humor)

Goal

There have been claims that some people have gone to extreme lengths to determine the output scheme for the Digital Convergence CueCat.  This essay intends to show that the determination of the output of the CueCat is simple and straightforward, and requires a minimum of effort and cryptoanalysis.

The other purpose of this essay is to show that the intent of the encoding scheme is not to "protect" the output from being misused, but instead the output was designed to be placed into a URL with a minimum of post-processing on the part of the program that sends barcode output to a central site.


Genesis: What CueCat Returns in the Raw

The method used by the Cuecat barcode wand to return information from scanned barcodes is fairly simple.  First, the keyboard version of the wand emits standard IBM PC AT keyboard codes, which is interpreted by the keyboard controller in the IBM-compatible personal computer and the software driver in the BIOS or operating system (MS-DOS, Windows, OpenBSD, BeOS, whatever).  The keyboard codes generate keystrokes that correspond to keyboard characters.

Here are some examples of the output of the barcode wand:

alt+F10 .C3nZC3nZC3nZCNzYDNPYC3nX.CNf7.edCMmYSMlwmqiJCGkYyVlW. <enter>
alt+F10 .C3nZC3nZC3nZCNzYDNPYC3nX.aabI.F39-FN5+Fx19. <enter>
alt+F10 .C3nZC3nZC3nZCNzYDNPYC3nX.cGf2.ENr7CNz0DxjZD3rZDNzWENP6. <enter>
alt+F10 .C3nZC3nZC3nZCNzYDNPYC3nX.cGf2.ENr7CNz0DxjZD3v6ENzWENP6. <enter>
alt+F10 .C3nZC3nZC3nZCNzYDNPYC3nX.cGf2.ENr7CNz0DxjZD3v7CxzWENP6. <enter>

As is easy to inspect, the code consists of the "header", a period, a series of graphic characters, a second period, a much shorter series of graphic characters, a third period, a series of graphic characters, a fourth period, and an ENTER code.

First conclusion: the period characters delimit fields of data. So let's break the data into the obvious fields and see what we get:

C3nZC3nZC3nZCNzYDNPYC3nX CNf7 edCMmYSMlwmqiJCGkYyVlW
C3nZC3nZC3nZCNzYDNPYC3nX aabI F39-FN5+Fx192
C3nZC3nZC3nZCNzYDNPYC3nX cGf2 ENr7CNz0DxjZD3rZDNzWENP6
C3nZC3nZC3nZCNzYDNPYC3nX cGf2 ENr7CNz0DxjZD3v6ENzWENP6
C3nZC3nZC3nZCNzYDNPYC3nX cGf2 ENr7CNz0DxjZD3v7CxzWENP6

The first column of data is constant.  When different barcode wands are tried, though, this number changes, so it's obvious that this represents some sort of identifier.  The second and third fields change; the second field is a fixed-length field, while the third field varies. We don't show the barcodes that were scanned, but the first one is a base-128 representation of the author's name, the second is a "test" barcode in Q-fur format, and the third is the ISBN "BookLan" barcode of the author's book Linux IP Stacks Commentary.  The fourth barcode is particularly interesting in that it is the BookLan scan of a sister book in the Coriolis series, Linux Core Kernel Commendary -- interesting because the ISBN codes are exactly one digit apart.  Finally, the last barcode is of another companion book, Apache Server Commenary , which is one down from the IP Stacks number.  In other words, we have an interesting bracketing of numbers.

Analyzing the Coding

The following table shows the correspondence between the ISBN numbers and the scan codes in the third column:

Title ISBN Wand coding
Apache Server Commentary 978157610468253999 ENr7CNz0DxjZD3v7CxzWENP6
Linux Core Kernel Commentary 978157610469953999 ENr7CNz0DxjZD3v6ENzWENP6
Linux IP Stacks Commentary 978157610470553999 ENr7CNz0DxjZD3rZDNzWENP6

Now, let's eliminate all the "common" information and see what changes from code to code:

Title ISBN delta Wand coding delta
Apache Server 682 v7Cx
Linux Core Kernel 699 v6EN
Linux IP Stacks 705 rZDN

Let's look at the bit patterns.  We'll take the bits of the expected data and group them into six bits, the standard for base-64 conversion.  Here is what we get:

682
001101 100011 100000 110010
13 35 32 50
v 7 C x

699
001101 100011 100100 111001
13 35 36 57
v 6 E N

705
001101 110011 000000 110101
13 51 0 53
r Z D N

Cryptoanalysis, using known-plaintext attack

OK, so we have come this far, and can't establish a one-to-one correspondence with code positions and characters in the wand report.  Now it's time to do some more experimentation.  To do this, we need a way of generating our own barcodes.  There are plenty of barcode generators out there; the GNU Barcode system for Linux works pretty well, and there are demo systems that run under windows.  Using any one of those, we now generate a Code 39 (3-of-9) barcode in which we code a string of zeros, and scan it to obtain this result:

.C3nZC3nZC3nZCNzYDNPYC3nX.ahb6.C3nZC3nZC3nZC3m.

One thing pops out very quickly:  the pattern "C3nZ" repeated three times, corresponding to the three sets of three zeros in our test barcode.  We also see this in the first field, which suggests that the field also contains zeros in it -- for this particular wand, there appears to be at least nine leading zeros.  So let's start looking at bit patterns again:

000
00110000 00110000 00110000
001100 000011 000000 110000
12 3 0 48
C 3 n Z

Whatever scrambling is being done, it has to be done after the conversion from base-64 coding to 8-bit data.  Whatever code is being used doesn't use any form of cypher streaming, because identical input appears to generate identical output.

The generation of base-64 code sets usually follows a preset pattern.  For example, the classic uuencode(1) program used the decode table

`!"#$%&'()*+,-./
0123456789:;<=>?
@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_

Which is the ASCII collating sequence (except that the space is replaced by the backquote).  Most variants of the base-64 scheme tend to use regular sequences for their translate tables.  The most common alternative for base-64 make use of the upper-case letters, lower-case letters, and two special characters to make up the alphabet.  So let's assume that the coding follows this pattern, so we have an initial table that looks like this:

ABCDEFGHIJKLMNOP
QRSTUVWXYZabcdef
ghijklmnopqrstuv
wxyz0123456789??

Now let's see what we get when we take our "zero sequence" and apply it to our data.

C 3 n Z
2 55 39 25
000010 110111 100111 010101
00001011 01111001 11010101

Nope, that didn't work; the three resulting characters are not the same.  Let's reverse the capital and lower-case letters and try again.

abcdefghijklmnop
qrstuvwxyzABCDEF
GHIJKLMNOPQRSTUV
WXYZ0123456789??

C 3 n Z
28 55 13 51
011100 110111 001101 110011
01110011 01110011 01110011

Eureka! We have a constant pattern.  When we subtract out the desired value of the character '0', or 0011000, we get a delta pattern of 01000011 or the ASCII character 'C'.  Now the question is, how is this delta applied? Best guess is that we use an exclusive-OR operation on each character, because that way there is no end-around wrap problem.

Checking Our Crypto-Work

Now that it's established that we have a base-64 code, we need to recalculate the deltas so that three source digits correspond to four code digits, and that we maintain block order.  So the revised delta table looks like this.

Title ISBN delta Wand coding delta
Apache Server 468253 D3v7CxzW
Linux Core Kernel 469953 D3v6ENzW
Linux IP Stacks 470553 D3rZDNzW

Given our table of a..zA..Z0..9??, let's see if we can see the data come out correctly:

468253
00110100 00110110 00111000 00110010 00110101 00110011
01000011 01000011 01000011 01000011 01000011 01000011 pattern
01110111 01110101 01111011 01110001 01110110 01110000 XOR
011101 110111 010101 111011 011100 010111 011001 110000
29 55 21 59 28 23 25 48
D 3 v 7 C x z W

And we have a match.  I leave the other two ISBN codes to you, the student, as an exercise.

The Final Two Code Positions

There seems to be some confusion about positions 62 and 63 of the base-64 conversion table.  Comparisons of output from multiple units shows that there are differences from unit to unit.  So far, it appears that position 62 is represented by a plus sign (+) or a comma (,) depending on the version of wand you have.  Position 63 is most frequently represented by a minus sign (-), but there has been a report that position 63 may be represented by a forward-slash (/).

The best way to find out what's going on is to open this test document (PDF), print it on a 600-dpi laser printer, and (using the driver of your choice) swipe the barcode.  See for yourself what characters come back -- within the code report you will see the representation for both positions 62 and 63.  If you see something other that what's described on the page, let me know and I'll update this information.

Putting It All Together

The proof of the concept is to code up a decoder and try it.  So here is the source of a C-language filter that implements all the stuff above.  It has been tested under Linux and in the Windows environment using the Borland C++ compiler. Here are the results of the five scans we documented at the beginning of this paper:

.C3nZC3nZC3nZCNzYDNPYC3nX.CNf7.edCMmYSMlwmqiJCGkYyVlW.
000000000151591002  128  Stephen\x20Satchell\x00

This barcode was created using the B-Coder program.  It is a Code-128 encoding of the author's name.  The trailing null is an artifact of the program that signifies that the "extra" data at the end of the barcode consisted of nulls, and therefore is of no consequence.

.C3nZC3nZC3nZCNzYDNPYC3nX.aabI.F39-FN5+Fx19.
000000000151591002  CC!  <<<===>>>

This particular barcode was created using the Azalea Software package, which is a "Cue" with the special encoding seen here.  The purpose of the code is to investigate the 62/63 problem described above.  This encoding is guaranteed to exercise that portion of the base-64 table.

.C3nZC3nZC3nZCNzYDNPYC3nX.cGf2.ENr7CNz0DxjZD3rZDNzWENP6.
000000000151591002  IB5  978157610470553999

.C3nZC3nZC3nZCNzYDNPYC3nX.cGf2.ENr7CNz0DxjZD3v6ENzWENP6.
000000000151591002  IB5  978157610469953999

.C3nZC3nZC3nZCNzYDNPYC3nX.cGf2.ENr7CNz0DxjZD3v7CxzWENP6.
000000000151591002  IB5  978157610468253999

These three barcode scans are the ISBN numbers from the three Coriolis books.  Comparing the output of the barcode wand with the source material, we see that the code itself is correct.

In the process, we learned that the particular barcode wand used to decode output has a serial number of 151,591,002.  Because I doubt that 151 million of these things have been shipped, I supposed that the last three digits may indicate a code revision level within the barcode wand itself.  As we see additional samples of barcode wands, we may or may not be able to justify this speculation.

The second field reveals itself to identify the particular symbology detected by the barcode wand when scanning the field. The capabilities page shows all the symbology fields that have been confirmed by the author; others have reported additional report values as well that have not been confirmed and added to the list.

Scanning several dozen different barcodes with human-readable values, including proprietary cues, shows that the code developed to the guidelines described here works fine.  This completes the analysis of the output of the CueCat.

Conclusions

The choice of character code set used with the CueCat is obvious on its face when you compare the codeset with the character set allowed in HTML forms reply strings.  First, let's review how HTML forms work.

When a forms-capable browser is asked to submit form data to the HTTP server, it does it in one of two ways.  Using the GET method, the form data becomes part of the URL itself.  Using the POST method, the data is placed in a datastream.  In either case, the data is in a very specific format.  In general, the data itself is defined in BNF as follows:

<stream>  ::= <element>
            | <stream> & <element>
<element> ::= <name> = <value>

Within each and , spaces are converted to plus signs, and any occurrence of an ampersand, equal sign, percent sign, or plus sign must be escapes using the form "%XX" where the "X"s are hexidecimal values.

When building a GET method URL, there are a couple of additional considerations.  For example, it's a good idea to avoid the slash, blackslash, and colon characters because the presence of these characters in astonishing (to software) positions can screw up HTTP proxy servers.  Avoiding the ampersand and equal characters eliminate the need to post-process the CatCue output to generate escape sequences.  The plus sign is problematical, because an unadorned plus sign is supposed to be interpreted as a space code...but the CGI script or program can handle that one fairly easily.  The comma character can be problematic for the server programmer when CGI libraries are used that impose a meaning on comma.

The changes in the codes in positions 62 and 63 reflect obvious corrections to fix problems with HTML transmission. Originally the codes used in this position were comma and slash.  In other CueCats, these positions are now represented by plus and minus

In other words, the use of a base-64 strategy was to ease the programming of the client support software.

The reason for the XOR dodge is considerably more opaque. Perhaps it was a misguided attempt at trying to bring the output of the CueCat under the Digital Millenium Copyright Act, but such an attempt fails on its face because the information being "protected" does not fall under the classification of information that can be copyrighted.  The idea that a denial-of-service attack can be averted is doomed to fail because there is nothing that prevents, for example, a replay attack.

From a theft-of-service perspective, the encryption doesn't make sense.  The barcodes are unique to the CueCat, being a non-standard configuration of Code 128 symbology, so in theory no other scanner on the planet can read the damn things, let alone make sense of them.  Yes, the encoding scheme using XOR would prevent a casual third party from submitting codes for steering.

In short, I'm puzzled by what Digital Convergence was trying to do.


(Humor)

So What's Next?

Now I'm beginning to suspect that we will see future 'Cats with different 62/63 reports. What are possible additional 62/63 pairs? (Aren't you sorry you asked?)

( ) For LISPers
< > Why HTML people didn't think of these I don't know
[ ] I so love array indexes
{ } For the blockheads in the group
\ ` Hey, these are slanted just like 'CatCues!
@ ! Wasn't this an organization at SF conventions?
^ ~ Not Not joke for C/Java programmers
& | For bit-ty C'ers
? : For Java programmers wanting to make up their mind
% = The ultimate in breaking HTML form strings

Back to Main CueCat Principles of Operations page.