Building a Python 2/3 compatible Unicode Sandwich

Posted on March 10, 2017 in programming

So you've decided that your code needs to be compatible with both Python 2 and Python 3. Most likely, you're upgrading your Python 2 code to work in Python 3, and know that you need to do things like:

  • Replace all calls to print with print()
  • Use absolute rather than relative imports
  • Call dict.items() instead of dict.iteritems()
  • etc.

But for me, and perhaps for you as well, by far the biggest and most complicated issue in getting code to be jointly Python 2/3 compatible is maintaining the Unicode Sandwich. If you don't know what a Unicode Sandwich is, please see these slides by Ned Batchelder: Pragmatic Unicode, or How do I stop the Pain? The idea is that in the 21st century, all text inside your application should be Unicode, with conversions to and from various specific character encodings at the margins.

Implementing a Unicode Sandwich in Python 3 isn't too bad because Python 3 explicitly distinguishes between decoded Unicode text (of type str) and encoded characters (of type bytes). This means that if you do a little bit of extra work to always explicitly convert bytes (that you might get from a web service, text file, or some other source) into Unicode, and do the reverse conversion when writing outputs, then voila, problem solved.

However, in Python 2 the str type contains bytes, which are implicitly converted (decoded) to Unicode when mixing bytes and Unicode. When you have an application that deals with lots of external files, services, and resources, some of which return Unicode content and others encoded bytes, along with the string literals in your own code (which are bytes in Python 2 and Unicode in Python 3), you have a recipe for confusion.

Making a Unicode sandwich work in both Python 2 and 3 requires systematically going through your code and enforcing the bytes/Unicode conversions at all the appropriate places, using code that works for both versions. The trick is that every data source and library is a little bit different--some accept only bytes to their key functions, others only Unicode strings, so the way of doing the appropriate conversions takes a little figuring. What follows are some notes I compiled when rewriting our INDRA software (which deals with natural language text and several different types of databases) to support Unicode in a Python 2/3 compatible way. Hopefully these notes will point you in the right direction if you are trying to do something similar.

Boilerplate imports

If you're going to maintain a Unicode sandwich, you'll need any strings that you define in your code to be Unicode strings:

from __future__ import unicode_literals

You'll also want (at least) the following builtins in your imports in every file:

from builtins import dict, str

Redefining dict and str in this way in Python 2 cause them to behave like the corresponding types in Python 3, e.g. dict.items() returns a generator rather than a list, and str can be treated as a Unicode string rather than a bytestring. Because of the redefinition of str, isinstance(u'foo', str) will be True in Python 2, so you can use the same code for both Python 2 and 3. Also, this redefines the str() constructor so that when you convert an object to a str in Python 2 (e.g., str(5)) you'll end up with a Unicode-compatible string, not bytes.

String type checking

In Python 2, it's common to check whether an argument is a string by checking isinstance(foo, basestring), because basestring is the supertype of both Python 2's str and unicode types. This is a handy way to, for example, tell whether then argument to a function is a string or a list. However, because basestring doesn't exist in Python 3, this has to be changed.

The best solution is to use Unicode everywhere in Python 2, importing from builtins import str (as recommended above) and then using isinstance(foo, str), where you would have previously used basestring. This is then compatible with both Python 2 and 3.

However, if your Unicode sandwich isn't completely airtight and there's a possibility that foo might be a Python 2 bytestring, then isinstance(foo, str) will be False when using the above approach, possibly leading to silent failures. In this case you might want to stick with the following workaround that retains basestring:

# Will be OK in Python 2
try:
    basestring
# Allows isinstance(foo, basestring) to work in Python 3
except:
    basestring = str

Custom __str__ methods

In Python 2, custom __str__ methods are expected to return a bytes (Python 2 str), not a Unicode string (unicode). This can be a problem once your objects contain only hard-won Unicode strings. Fortunately the future package contains a decorator, @python_2_unicode_compatible, to make your __str__ method work in both Python 2 and 3. However, make sure that you apply the decorator only once in your object hierarchy, or you will get an error (i.e., don't apply the decorator both in a superclass and a subclass).

Web services: urllib and requests

The structure of the urllib library is very different between Python 2 and 3, and urllib calls require much more care in converting bytes/unicode. Do yourself a favor and rewrite any web service calls using requests instead of any of the urllib methods. requests takes a dict of query parameters directly, eliminating the need to urlencode and/or UTF-8 encode request content. The response object returned by the requests library gives you access to both the underlying bytes (in response.content) as well as a decoded Unicode version (in response.text). You can also get a JSON object directly using response.json.

If you're stuck using urllib for some reason, here are some things to note. First, wrap your imports in a try/catch, e.g.:

# Python 3 version
try:
    from urllib.request import urlopen
    from urllib.error import HTTPError
    from urllib.parse import urlencode
# Python 2 version
except ImportError:
    from urllib import urlencode
    from urllib2 import urlopen, HTTPError

When calling urlopen, the second argument must be a bytes, so you'll need to do:

result = urlopen(url, data.encode('utf-8'))

And when reading from the object, you'll need to decode back to Unicode:

response_text = result.read().decode('utf-8')

In some cases you'll want to keep the response content in bytes form if you're passing it to another library that expects only bytes. For example, rdflib and json perform the bytes/unicode conversion internally (see below).

Reading/writing CSV files

There are a few differences in the procedure for reading and writing CSV files between Python 2 and 3. In Python 3, the encoding should be specified when opening the file in text mode, which leads the csv.reader/writer object to then return Unicode strings. A newline='' argument is also required when opening the file.

For Python 2, the encoding and newline arguments are not permitted, so the file opening step has to occur in an alternative block. Also, the delimiter and quotechar arguments can be Unicode in Python 3, but in Python 2 they must be bytes (the lineterminator argument does not need to be encoded to bytes in Python 2, however).

Finally, the csv reader returns byte strings, so each field must be explicitly decoded into Unicode. Here is an example that handles the complete process and returns a generator. Note that the Python 2 version assumes that it is getting Unicode strings as arguments (which is the case when using unicode_literals as I recommend), which is why they have to be encoded:

def read_unicode_csv(filename, delimiter=',', quotechar='"',
                     quoting=csv.QUOTE_MINIMAL, lineterminator='\n',
                     encoding='utf-8'):
    # Python 3 version
    if sys.version_info[0] >= 3:
        # Open the file in text mode with given encoding
        # Set newline arg to ''
        # (see https://docs.python.org/3/library/csv.html)
        with open(filename, 'r', newline='', encoding=encoding) as f:
            # Next, get the csv reader, with unicode delimiter and quotechar
            csv_reader = csv.reader(f, delimiter=delimiter,
                                    quotechar=quotechar,
                                    quoting=quoting,
                                    lineterminator=lineterminator)
            # Now, iterate over the (already decoded) csv_reader generator
            for row in csv_reader:
                yield row
    # Python 2 version
    else:
        # Open the file in bytes mode
        with open(filename, 'rb') as f:
            # Next, get the csv reader, passing delimiter and quotechar as
            # bytestrings rather than unicode
            csv_reader = csv.reader(f, delimiter=delimiter.encode(encoding),
                                    quotechar=quotechar.encode(encoding),
                                    quoting=quoting,
                                    lineterminator=lineterminator)
            # Iterate over the file and decode each string into unicode
            for row in csv_reader:
                yield [cell.decode(encoding) for cell in row]

Follow the corresponding procedure for writing CSV files.

Pickling

Pickling and unpickling must always be done with files opened explicitly in binary mode. To maintain compatibility of pickled files with both Python 2 and 3, pickle files should be generated with protocol level 2, i.e., by pickle.dump(foo, fp, protocol=2). These files can be opened by Python 2 as well as 3.

If there are pre-existing pickle files generated by Python 2 that need to be openable by Python 3, there is an optional encoding argument to pickle.load that tells Python 3 how it should interpret non-ASCII byte strings that were encoded into pickle files by Python 2. For some reason, Python 2 pickles can sometimes fail to load in Python 3 unless the encoding argument to pickle.load is set to latin-1 (even if they were encoded in Python 2 using UTF-8). This has been reported in quite a few places, including:

Annoyingly, the encoding argument is not included in Python 2, so you will have to have two blocks of code for loading pickle files. If possible, it is far better to recreate and/or repickle the data in Python 3.

Parsing XML with xml.etree.ElementTree

This XML parser expects bytes, not Unicode, converting the bytes into Unicode internally. However, it is important to note than in Python 2, elements in the parsed XML will contain unicode text in the et.text field only when the element contains a Unicode character. This means that if the XML contains mostly ASCII-compatible strings, they will come back as Python 2 str, leaking bytes into your otherwise pure Unicode sandwich. This would generally be OK, except that if you have unit tests that check objects for Unicode objects this will lead to failures. Moreover, explicitly converting the ASCII-compatible strings with unicode(foo) is problematic in cases where the string can be None, as it will introduce the string 'None' into the data!

Here's a tricky solution (hack?) that I adapted from this thread that ensures that etree returns only Unicode strings and uses the same syntax in the caller between Python 2 and 3. It involves subclassing xml.etree.ElementTree.XMLTreeBuilder in Python 2 and overriding a single method. The trick is that in Python 3, a corresponding function is defined that simply returns None:

import xml.etree.ElementTree as ET

if sys.version_info[0] >= 3:
    def UnicodeXMLTreeBuilder():
        return None
else:
    class UnicodeXMLTreeBuilder(ET.XMLTreeBuilder):
        # See this thread:
        # http://www.gossamer-threads.com/lists/python/python/728903
        def _fixtext(self, text):
            return text

# Get XML content as bytes, e.g., via urlopen
response = urlopen(...)
tree = ET.parse(response, parser=UnicodeXMLTreeBuilder())

# Or, parse directly from a bytestring
xml_str = b'<foo><bar>baz</bar></foo>'
tree = ET.XML(xml_str, parser=UnicodeXMLTreeBuilder())

In Python 2, the call to UnicodeXMLTreeBuilder() returns an instance of the appropriate parser, whereas in Python 3, it returns None and allows the ElementTree.XML and ElementTree.parse functions to operate normally. The upshot is that the parser argument should always be passed when using either function.

JSON

When writing a Unicode-containing Python object to a JSON file or string using json.dump or json.dumps, note that the object produced is, counterintuitively, a str in Python 2 (bytestring), but with all non-ASCII characters escaped ("Python encoded") and hence suitable for writing to a file in text mode.

# Python 2
>>> import json
>>> foo = u'U\0001F4A9'
>>> type(foo)
<type 'unicode'>
>>> bar = json.dumps(foo)
>>> bar
'"U\\u00001F4A9"'
>>> type(bar)
<type 'str'>
>>> baz = json.loads(bar)
>>> baz
u'U\x001F4A9'
>>> type(baz)
<type 'unicode'>

In Python 3 json.dumps returns str, suitable for writing to text mode files.

Similarly, to load a JSON object with load or loads, in both Python 2 and 3 the json module expects a str (not a Python 3 bytes). This means that all JSON files should be opened in text (not binary) mode, and should be created by json.dump rather than by some other process that would leave encoded byte strings in the code.

Three other things are worth noting:

  • When dumping with json.dumps(foo) in Python 2, foo itself can contain a mix of str and unicode strings as long as the str objects are ASCII only.
  • When loading with foo = json.loads(...) in Python 2, the object returned will contain only unicode strings, even if those strings were str when they were dumped. For example:
# Python 2
>>> import json
# All strings are str, not unicode
>>> foo = ['foo', {'bar': ('baz', None, 1.0, 2)}]
# Will come back with all strings unicode
>>> json.loads(json.dumps(foo))
[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
  • In Python 3, calling json.dumps on an object containing any bytestrings will lead to a TypeError.

RDF and rdflib

When serializing an rdflib.graph object (e.g., for writing to a file), the encoding can be specified by an argument to the serialize function, which returns bytes:

g.serialize(format='xml', encoding='utf-8')

This can then be written to a file opened in bytes mode (i.e., with the wb arg), e.g.:

with open(file_path, 'wb') as out_file: # Binary mode
    xml_bytes = g.serialize(format='xml', encoding='utf-8')
    out_file.write(xml_bytes)