Parsing Unidata Multi-Value Strings
Sun, Dec 30, 2018The college I work for uses Colleague as its main database of record. Colleague returns data to our applications in Unidata multi-value strings. Essentially it’s a serialization that uses four ASCII characters as demarcations of various organizational levels. Parsing the Colleague data into Python data structures turned out to be easier than expected when I used comprehensions for the job. Let’s start by taking a close look at the multi-value strings.
Text Mark: û
(TM)
At the highest level a Colleague response can be divided into ‘texts’. Each ‘text’ is the response to a distinct query. With Unidata systems you can submit multiple queries at once. When you do that, you get back one string with a text mark between the individual query responses, something like:
<text>TM<text>TM<text>
This would be three texts divided by two text marks. I’m using TM
as a stand-in for ASCII #251, a u
with a circumflex û
.
Field Mark: þ
(FM)
At the next level of organization, each Colleague text can be divided into
‘fields’. The field mark is ASCII #254, a thorn þ
. I’ll use FM
as a
stand-in. Fields correspond to columns in a
data table. If you’re looking up information about a person, the fields are
likely to be things like NAME
, USERNAME
, EMAIL
, etc. Schematically:
<field>FM<field>FM<field>TM<field>FM<field>
The above would parse out to two texts, the first with three fields and the second with two fields.
Value Mark: ý
(VM)
Each field can then be divided into values as indicated by an accented ý
(ASCII #253), which I’ll represent with VM
. There’s a catch, though,
because the first ‘value’ is actually the name of the field. This is different
from how the other three marks work. For instance, schematically:
<name>VM<value>FM<name>VM<value>
Here we have two fields, each with a name and a value. Or to flesh it out with some data:
PERSONALNAMEýKevinþFAMILYNAMEýWiliarty
There’s another wrinkle, though: We might get multiple values for each field. When we query Colleague we can ask for multiple records at a time. This is not to be confused with running multiple queries. We can run a single query for multiple values. We might, for instance want the personal information about more than one account. In that case our data will have the general form:
<name>VM<value>VM<value>FM<name>VM<value>VM<value>
Now we’ve got two fields, each with two values corresponding to two separate records. Perhaps something like the following for Angus Podgorny and myself:
PERSONALNAMEýKevinýAngusþFAMILYNAMEýWiliartyýPodgorny
Sub-value Mark: ü
(SM)
Finally, each value can consist of an arbitrary number of sub-values. Sub-values
are marked off by ASCII #252, an umlauted ü
, which we can indicate with SM
. Let’s imagine that our first field from the example
above is PERSONALNAMES
rather than PERSONALNAME
. We might get data with
a structure like the following:
<name>VM<sub-value>SM<sub-value>VM<value>FM<name>VM<value>VM<value>
Or to fill in some blanks:
PERSONALNAMEýKevinüPatrickýAngusþFAMILYNAMEýWiliartyýPodgorny
Putting it all together
So let’s imagine now that we send two queries to Colleague:
- We are looking for information about a course
- We also want the personal information about students in the course
We might get back some data like this:
PERSONALNAMESýKevinüPatrickýAngusþFAMILYNAMEýWiliartyýPodgornyûCOURSENAMEýOld High GermanþCOURSENUMBERýOHG-101
How can we convert that into a Python data structure?
A list of dictionaries whose values are lists of lists
The first thing to figure out is what kind of data structure we can reliably map to. We don’t want to lose any data at this point, so even where only one sub-value or value is given, we need to represent these as lists — lists of one, but still lists. Since fields associate names with values, we’ll need a dictionary for that, and then we’ll ultimately need a list of such dictionaries, each of which represents a text. For our example above it would look like this:
|
|
Comprehensions
Since we are coding in Python, I decided to use nested comprehensions to do the
parsing. The result is somewhat more complicated than one might generally wish
comprehensions to be. On the other hand, it is very much less complicated than a
bunch of nested for
loops. Given the complexity of the task, I find the comprehensions relatively transparent. They also have the great virtue of looking
visually like a schematic of the structure we are aiming to create. Assuming that I’ve created constants TM
, FM
, etc. for the opaque ASCII characters, let’s jump
straight to the punchline:
|
|
That’s it. It’s a one-liner, though it’s more readable with some vertical relief. Let’s work from the outside in. To begin with, we know that we want to return a list of dictionaries, one for each text set off by a text mark:
|
|
We’ll get the pairs for the dictionary by first splitting the texts on field marks.
|
|
The key for each field in the dictionary is then the first value we get from splitting the field on value marks, and the values are the remainder of that split:
|
|
Now we just need to split each value into sub values, which brings us to the final version:
|
|
Having previously created a looping equivalent in PHP I can say without qualification that I find the comprehensions much more satisfying.