|
|
|
Here is a proposal I have for doing it with a simplified kind of schema. Basically, for a particular data stream with a particular set of serialized classes, you give each tag a 1-byte (or 2-byte, for large schema) representation. You also have to know which tags contain primitive types, i.e. integers and strings, because (binary) end-tags are eliminated in those cases. With just that information, you can parse anything.
It would also be possible to extend this idea so that even unexpected types, not part of the schema, could be properly serialized and later parsed (without the benefit of schema compression). The biggest problem I see here is that XStream's converters and stuff seem to be oriented toward writing Strings (Unicode), rather than byte arrays (which are appropriate for binary formats). I'm not sure how big of a problem that is, how deep in the XStream code one would have to hack to make it work well with binary data streams. In any case, I share here the binary proposal for review. If nothing happens on this issue in the next couple of months, I am likely to implement this proposal (or something like it) for my job. Sincerely, Proposed binary format for tree-structured data. This will be a binary analog to the XML currently produced by XStream. The current format is an XML tree with simplified syntax: Here is an example: <Xtest-Person> This encoding takes 252 bytes. Whitespace uses 53 of those bytes, so it would be trivial to encode using 199 bytes. Assuming 2-byte integers (typical for our application), the "actual data" takes up 31 bytes. Thus, we say our "control overhead" for this format is 199-31 = 168 bytes. Now how to compress it by recoding: Single-Byte Tags First, let's get rid of the overhead in the tags. A given XML schema uses only a finite number of node types. Let's assume it's less than 256, which is likely the case for our needs with SMART. Then, every node type can be represented by a 1-byte binary tag: #define XTEST_PERSON 1 We will have to define this TAG SET separately for every type of data stream we wish to use in our protocol. That's OK. We can also have a special marker for tags not found in the tag set: #define UNDEFINED 0 We also have an "end tag" marker, which is used to represent the </xxx> XML tags: #define ENDTAG 255 Using this scheme, we can immediately produce a semi-binary representation of the above XML (bytes are separated by whitespace; items in quotes are translated in the ACTUAL binary representation by their ASCII values). Basically, we have replaced each XML tag by a single-byte sequence. In addition, we have added an end-of-string marker to the strings contained in the leaf nodes: 1 Bytes used for this format are: Binary Representations Next, we can use binary representations for our non-String data (which is a much of what we use in SMART). This has many advantages: Using this idea, our representation is now as follows (using little-endian integer formats): 1 We have saved (in this case) two bytes per integer. Our representation now uses: Redundant End-Of-Tag Markers We can also eliminate the end-of-tag markers on our string nodes, since they are redundant. Using this, we save another 4 bytes, with control overhead down to 19 bytes. 1 Counted Arrays Finally, the <int> (3) tags for the array of integers are redundant. It would be better to explicitely represent (1-D) arrays of primitive types as an array-type tag, with a count. Going back to the original XML, we would replace: with something like: This translates in our binary format to something like: 2 // Parser knows an INTARRAY is coming now In this case, we have saved an addition 4 bytes, resulting in 15 byte control overhead. Summary This binary format provides for a general level of flexibility and ease-of-use that comes with XML. In fact, it is homomorphic to XML and can be converted to such relatively easily. It is still easily parsed, like XML. Thus, this format is suitable for modern serialization systems such as XStream. Although XStream generates XML by default, it has hooks that would allow it to create (and read) this format in a clean manner. This format is compact. Maybe not quite as compact as an (implied) fixed-record format. However, it is much easier to code and work with, and it produces self-documenting serialized object streams. It is also language-independent. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
However to handle a super small & fast binary schema, which survives version changes in classes, it might be nice to introduce a binary schema format.
Then the binary serialization can just include the data and not schema information. We can write the schema version first.
--------
http://localhost/schema/com/acme/CheeseSchema/1.0/3
binaryCrapGoesHere
--------
The overhead of a single schema URL is pretty small; then after that everyone knows the exact schema & can parse it nicely. We can then support
For 'generic' serialization where you can shove anything you like into a List, the schema may be pretty big (e.g. the whole of your classpath
. However its only loaded once, so it doesn't matter if its huge.