Parsing Unidata Multi-Value Strings

The college I work for uses Colleague as its main database of record. Colleague returns data to our applications in Unidata multi-value strings. Essentially it’s a serialization that uses four ASCII characters as demarcations of various organizational levels. Parsing the Colleague data into Python data structures turned out to be easier than expected when I used comprehensions for the job. Let’s start by taking a close look at the multi-value strings.

Text Mark: û (TM)

At the highest level a Colleague response can be divided into ‘texts’. Each ‘text’ is the response to a distinct query. With Unidata systems you can submit multiple queries at once. When you do that, you get back one string with a text mark between the individual query responses, something like:

    <text>TM<text>TM<text>

This would be three texts divided by two text marks. I’m using TM as a stand-in for ASCII #251, a u with a circumflex û.

Field Mark: þ (FM)

At the next level of organization, each Colleague text can be divided into ‘fields’. The field mark is ASCII #254, a thorn þ. I’ll use FM as a stand-in. Fields correspond to columns in a data table. If you’re looking up information about a person, the fields are likely to be things like NAME, USERNAME, EMAIL, etc. Schematically:

    <field>FM<field>FM<field>TM<field>FM<field>

The above would parse out to two texts, the first with three fields and the second with two fields.

Value Mark: ý (VM)

Each field can then be divided into values as indicated by an accented ý (ASCII #253), which I’ll represent with VM. There’s a catch, though, because the first ‘value’ is actually the name of the field. This is different from how the other three marks work. For instance, schematically:

    <name>VM<value>FM<name>VM<value>

Here we have two fields, each with a name and a value. Or to flesh it out with some data:

    PERSONALNAMEýKevinþFAMILYNAMEýWiliarty

There’s another wrinkle, though: We might get multiple values for each field. When we query Colleague we can ask for multiple records at a time. This is not to be confused with running multiple queries. We can run a single query for multiple values. We might, for instance want the personal information about more than one account. In that case our data will have the general form:

    <name>VM<value>VM<value>FM<name>VM<value>VM<value>

Now we’ve got two fields, each with two values corresponding to two separate records. Perhaps something like the following for Angus Podgorny and myself:

    PERSONALNAMEýKevinýAngusþFAMILYNAMEýWiliartyýPodgorny

Sub-value Mark: ü (SM)

Finally, each value can consist of an arbitrary number of sub-values. Sub-values are marked off by ASCII #252, an umlauted ü, which we can indicate with SM. Let’s imagine that our first field from the example above is PERSONALNAMES rather than PERSONALNAME. We might get data with a structure like the following:

    <name>VM<sub-value>SM<sub-value>VM<value>FM<name>VM<value>VM<value>

Or to fill in some blanks:

    PERSONALNAMEýKevinüPatrickýAngusþFAMILYNAMEýWiliartyýPodgorny

Putting it all together

So let’s imagine now that we send two queries to Colleague:

  1. We are looking for information about a course
  2. We also want the personal information about students in the course

We might get back some data like this:

    PERSONALNAMESýKevinüPatrickýAngusþFAMILYNAMEýWiliartyýPodgornyûCOURSENAMEýOld High GermanþCOURSENUMBERýOHG-101

How can we convert that into a Python data structure?

A list of dictionaries whose values are lists of lists

The first thing to figure out is what kind of data structure we can reliably map to. We don’t want to lose any data at this point, so even where only one sub-value or value is given, we need to represent these as lists — lists of one, but still lists. Since fields associate names with values, we’ll need a dictionary for that, and then we’ll ultimately need a list of such dictionaries, each of which represents a text. For our example above it would look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
[
    {
        'PERSONALNAMES': [
            ['Kevin', 'Patrick'],
            ['Angus'],
        ],
        'FAMILYNAME': [
            ['Wiliarty'],
            ['Podgorny'],
        ],
    },
    {
        'COURSENAME': [['Old High German']],
        'COURSENUMBER': [['OHG-101']],
    },
]

Comprehensions

Since we are coding in Python, I decided to use nested comprehensions to do the parsing. The result is somewhat more complicated than one might generally wish comprehensions to be. On the other hand, it is very much less complicated than a bunch of nested for loops. Given the complexity of the task, I find the comprehensions relatively transparent. They also have the great virtue of looking visually like a schematic of the structure we are aiming to create. Assuming that I’ve created constants TM, FM, etc. for the opaque ASCII characters, let’s jump straight to the punchline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def parse(mv_string):
    '''Convert an MV string into a list of dictionaries.'''

    return [
        {
            field.split(VM)[0]: [
                value.split(SM) for value in field.split(VM)[1:]
            ] for field in text.split(FM)
        } for text in mv_string.split(TM)
    ]

That’s it. It’s a one-liner, though it’s more readable with some vertical relief. Let’s work from the outside in. To begin with, we know that we want to return a list of dictionaries, one for each text set off by a text mark:

1
2
3
return [
    {} for text in mv_string.split(TM)
]

We’ll get the pairs for the dictionary by first splitting the texts on field marks.

1
2
3
4
5
return [
    {
        <key>: <value> for field in text.split(FM)
    } for text in mv_string.split(TM)
]

The key for each field in the dictionary is then the first value we get from splitting the field on value marks, and the values are the remainder of that split:

1
2
3
4
5
return [
    {
        field.split(VM)[0]: field.split(VM)[1:] for field in text.split(FM)
    } for text in mv_string.split(TM)
]

Now we just need to split each value into sub values, which brings us to the final version:

1
2
3
4
5
6
7
return [
    {
        field.split(VM)[0]: [
            value.split(SM) for value in field.split(VM)[1:]
        ] for field in text.split(FM)
    } for text in mv_string.split(TM)
]

Having previously created a looping equivalent in PHP I can say without qualification that I find the comprehensions much more satisfying.