Working with binary files in Python

Reading Time: 5 minutes

Table of Contents

Binary Files & Distinction between r/w and rb/wb
Writing text to a binary file
PEP 263
Summary
Further Reading

Binary Files & Distinction between r/w and rb/wb
§

A binary file is a computer file that is not a text file.
~ Wikipedia

Not all files are text files. Others contain information in the form of bits and bytes, like images, formatted text files (Microsoft Word), audio-video files etc. These cannot be read in the same way as Python reads text files.

>>> fH = open('image.jpg', 'rb')			
>>> fH.read(5)
b'\xff\xd8\xff\xe0\x00'
>>> fH.close()

>>> fH = open('document.docx', 'rb')
>>> fH.read(5)
b'PK\x03\x04\x14'
>>> fH.close()

>>> fH = open('audio.mp3', 'rb')
>>> fH.read(5)
b'ID3\x03\x00'
>>> fH.close()

>>> fH = open('video.wmv', 'rb')
>>> fH.read(5)
b'0&\xb2u\x8e'


>>> fH = open('text.txt', 'rb')
>>> fH.read(5)
b"It's "
>>> fH.close()

Whatever we have "read" so far in the above examples (except the last one) makes little sense to us. So why does Python have these binary access modes? Well, this is why:

This is a platform implementation detail. On Unix systems such as Linux, Mac OS X etc., all files are treated the same way i.e. there is no distinction between text files and binary files. Python on Windows, however, makes this distinction.

When a text file is read from or written into, in Python on windows, the end-of-line characters in text files are altered implicitly. That is to say, for a file opened in write(w) mode, fileHandler.write('A sentence.\nNew Sentence.\n') will produce a file with the following contents.

>>> fileHandler = open('writtenInStandardForm.txt', 'w')
>>> fileHandler.write('A sentence.\nNew Sentence.\n')
26
>>> fileHandler.close()

# writtenInStandardForm.txt
A sentence.
New sentence.

This will not be obvious to you when you perform a simple read operation on it.

>>> fileHandler = open('writtenInStandardForm.txt')
>>> fileHandler.read()
'A sentence.\nNew sentence.\n'

This is because, \r\n sequences are clubbed into \n characters by default.

In text mode, the default when reading is to convert platform-specific line endings (\n on Unix, \r\n on Windows) to just \n. When writing in text mode, the default is to convert occurrences of \n back to platform-specific line endings.

~ Python Documentation

But this alteration of line endings (i.e. replacement of \n with \r\n) IS happening behind the scenes. To verify this, open the file in a text editor which supports Regular Expressions, such as Notepad++. Press Ctrl + F to open the find dialog. In the bottom-left corner of the dialog box, mark the radio button saying Regular expression. In the Find What input box, enter '\n', without quotes. You'll find two occurrences. Now look for '\r', without quotes, you'll find two occurrences again but no visible characters. These characters are right there, before the new line characters. To view them, look for '\r\n' in the input box.

Now, the catch is that this modification is fine in case of text files, BUT it may corrupt binary data found in image or executable files. That is why, it is advised to use the binary access modes while dealing with binary files.

Unix systems treat text files the same way as other files. That said, you should, to err on the side of caution, always use rb/wb for files that are not text files, and use r/w for text files.

The takeaway from all this is to not to use r/w when you are dealing with binary files such as images, audio-video files etc., as this may cause a serious problem by corrupting the files.

Here's a link to the Python documentation underlining the same point.

Writing text to a binary file
§

To write text a binary file, you have to prefix the string with the character 'b' to tell Python that it is a binary string, so convert it into a sequence of bytes yourself.

>>> fH = open('writtenInBinary.txt', 'wb')
>>> fH.write(b'hi')
2
>>> fH.close()

>>> fH = open('writtenInBinary.txt', 'rb')
>>> fH.read()
b'hi'
>>> fH.close()

What this 'b' in the beginning does is that it converts the string into a sequence of bytes. This produces the same result as using the builtin bytes method on the same string.

>>> bytes('hi', 'utf-8')
b'hi'
>>> bytes('hi', 'utf-8') == b'hi'
True

The second argument we supplied in the bytes function is the character encoding that the string will be written in. Encoding, is the process of transforming the string into a specialized format for efficient storage or transmission. In other words, encoding is the process of transforming content into sequence of bytes, which will ideally make sense again when it is decoded with the same encoding type with which it was encoded. Character encoding is used to represent the entire list of characters that belong in an encoding system.

For example, let's talk about two encodings: ASCII and Unicode.

ASCII(American Standard Code for Information Interchange) has a total of 127 characters, which is roughly a list of all the characters that you can type using a standard keyboard. You can view the list of symbols here. Basically, it covers numbers, uppercase letters and lowercase letters and a bunch of other symbols.

Unicode covers almost every character there is. It contains over 128 thousand characters, covering 135 modern and historic scripts, as well as multiple symbol sets, as per Wikipedia. Unicode is the standard character set of Python, and is denoted by utf-8. You can read about Unicode here.

Now, the Python part. The encode() acts on a string and produces a sequence of bytes. The decode() acts on bytes and produces the original string.

>>> "hello".encode(encoding = 'ascii')
b'hello'
>>> b'hello'.decode(encoding = 'ascii')
'hello'

That’s all great. But what happens when you try to encode a string using an encoding that doesn’t have one or more of the characters in the string in its character set?

>>> "é".encode('ascii')			# accented e
Traceback (most recent call last):
  File "<pyshell#75>", line 1, in <module>
    "é".encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

Python raises a UnicodeEncodeError, telling you that the string you are trying to encode has one or more characters that do not fall in the character set of the encoding you have listed.

What happens when you try to decode a binary string using an encoding that does not have the characters in its character set?

>>> "é".encode('utf-8')
b'\xc3\xa9'
>>> b'\xc3\xa9'.decode('ascii')
Traceback (most recent call last):
  File "<pyshell#82>", line 1, in <module>
    b'\xc3\xa9'.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> b'\xc3\xa9'.decode('utf-8')
'é'

What is imperative to know is that you must use the same encoding to decode the binary sequence with which it was encoded.

>>> '阮经天'.encode('utf-16')
b'\xff\xfe.\x96\xcf~)Y'
>>> b'\xff\xfe.\x96\xcf~)Y'.decode('utf-8')
Traceback (most recent call last):

  File "<pyshell#99>", line 1, in <module>
    b'\xff\xfe.\x96\xcf~)Y'.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

PEP 263 (Link here)
§

Marc-André Lemburg & Martin von Löwis, members of the Python community, authored this Python Enhancement Proposal a.k.a. PEP, asking for an official commented statement to inform the interpreter about the encoding of the Python script. And if the user tries to introduce a character (or characters) out-of-scope of the current encoding, then IDLE (built-in Python text editor) tells the user that the encoding you have specified is invalid, when a save operation is performed on the script. Next course of action for the user is to either change the encoding to incorporate the out-of-scope character(s) or remove the character(s) itself.

The Python developers obliged. They said,

If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+) , this comment is processed as an encoding declaration; the first group of this expression names the encoding of the source code file. The encoding declaration must appear on a line of its own. If it is the second line, the first line must also be a comment-only line. The recommended forms of an encoding expression are
# -*- coding: -*-
which is recognized also by GNU Emacs, and
# vim:fileencoding=
which is recognized by Bram Moolenaar's VIM(Unix text editor).
If no encoding declaration is found, the default encoding is UTF-8.
~ Python Documentation

So, you can declare the encoding of your script by adding this statement to the top of the script:

# -*- coding: encoding_name -*-

Example:

# -*- coding: utf-8 -*-

You can replace the word 'coding' before the semi-colon with 'encoding', and it will still work. This is because # -*- coding: encoding_name -*- still matches the regular expression coding[=:]\s*([-\w.]+)

Example of declaring encoding in a Python file, say encoding.py:

# -*- coding: utf-8 -*-
variableOne = 'Ethan'
print(variableOne)

What happens if the user tries to include an out-of-scope character in the script?

# -*- coding: ascii -*-
varOne = 'é'

Try saving the above in a python script and you’ll get an error similar to below.

"Invalid encoding 'ascii'.
Saving as UTF-8"

Now, you can either omit the out-of-character-set character(s) or save it with utf-8 encoding.

Summary of what we learnt today
§

That's it for this one. In this article, we saw the distinction between normal access modes and binary access modes. We also saw how to write to a binary file, encoding it and decoding it with a particular encoding. To summarize it:

Use r/w access modes while dealing with text files and use rb/wb access modes while dealing with binary files such as images, audio-video files etc.

To maintain the integrity of any file, always decode it using the same encoding that you used to encode it.

I hope this was of help to you. Cheers!