»recommended« and »mandatory« are with regard to reading a section.
The Python modules in the repo complement this guide. Feel free to copy them into your projects and send improvements.
- Write
unicodeor unicode if you mean the Python type. - Write Unicode if you mean Unicode in general.
- ❃
strvs.unicodevs.bytesand Python 2 vs. Python 3 - When dealing with strings and Unicode in Python, there are two types
you have to know.
stris a plain list of bytes that just happens to be rendered as a string.unicodeis a list of Unicode characters. Python 2 → Python 3:str→bytes,unicode→str. - ❃ default string type
- The default string type in both Pythons is
str, but note thatstris different things in Python 2 and Python 3. In Python 3 all string variables inside a program are lists of Unicode characters and we want to have the same in Python 2, because we are forward-looking. - ❃ the ideal: every string is
unicode - Therefore, we assume all string variables inside our programs to be
of type
unicode. - ❃ (nearly) everything outside is
str - When communicating with the outside world and some libraries, we have
to convert to or from
str. - ❃ Unicode and UTF-8
- Unicode is different from UTF-8. Read the first paragraph in the blue box at the top of https://pythonhosted.org/kitchen/unicode-frustrations.html.
- ❃ encoding and decoding
- To turn a UTF-8-encoded
str(list of bytes) intounicode, use.decode('utf-8'). To turn aunicodeinto a UTF-8-encodedstr, use.encode('utf-8').
- ❃ unicode_literals
In every Python file, import
unicode_literals:from __future__ import unicode_literals
If you don't do this, all string literals in your source code will be
str, which is against the »every string isunicode« ideal of the Need to Know.- ❃
strliterals - Use
b"bla"to write astr"bla". - ❃ string conversion
- Use
unicode()instead ofstr()when you want to convert numbers etc. to strings. - ❃ naming convention
If there is a string variable that needs to be of type
strinside your program, prefix it withb_if you don't know the encoding, or withutf8_if you know it is UTF-8:b_company_name = read_company_name_str() utf8_company_name = read_company_name_utf8()
- ❃ reading and writing files
When you want to read from or write to a file, use
codecs.open()instead of the built-inopen():>>> from __future__ import unicode_literals >>> import codecs >>> with codecs.open("bla.txt", 'w', 'utf-8') as f: ... f.write("üüü") ... >>> with codecs.open("bla.txt", 'r', 'utf-8') as f: ... f.read(3) ... u'\xfc\xfc\xfc' >>> 'ü' * 3 u'\xfc\xfc\xfc'- ❃
print Everything that is written to the outside world should be
str. This normally includes parameters toprint. In order to avoid having to convert yourunicodes all the time, write at the top of every file, but after all imports:import sys import codecs # and other imports if not isinstance(sys.stdout, codecs.StreamWriter): sys.stdout = codecs.getwriter('utf-8')(sys.stdout) # main code follows(Don't forget to add imports for
sysandcodecsif they aren't there already.) This way you can doprint(unicode). Note however, that now it's dangerous to doprint(str). Never pass astrtoprintunless you're sure it contains only ASCII. In such cases, write a clarifying comment.- ❃ exceptions and warnings
- When raising exceptions or warnings, only pass
str. Think twice whether the thing you're passing really isstr! - ❃
printtosys.stderr - We don't put an UTF-8 writer in front of
sys.stderr, since that would cause even more confusion. So make sure that everything you send there isstr. - ❃ external libraries
- Check whether the library procedures you're calling accept and return
strorunicode. If they accept and returnstr, take care to make the right conversions. Below are notes on which libraries do what. - ❃ environment variables
- Use
unicode_environ.getenvandunicode_environ.environinstead ofos.getenvandos.environ. If you need to do anything else with the environment, extendunicode_environinstead of resorting to environment utilities fromos. - ❃ command line arguments
- Command line arguments come as
strand you need to convert them. Unfortunately, passingtype=unicodetoArgumentParser.add_argumentis not enough. Useunicode_argparse.ArgumentParserinstead ofargparse.ArgumentParser. - ❃ testing
- In your tests, try to break the system by including non-ASCII characters in strings. If you can't succeed, chances are good that you have done the Unicode thing correctly.
- ❃ CONSTANT VIGILANCE!
- When you read data from or write data to somewhere outside your program, make sure it gets converted to the right types.
You may make project-specific exceptions to these rules if they get annoying. Be sure to document them.
Example for a project that uses Pygit2 often:
- ❃ Git SHA1s
- Git SHA1s as returned by
Oid.hexare of typestr. Since they never contain non-ASCII characters and it would be annoying to convert them all the time, we leave them asstr. Since we know that they arestrand it is annoying to write prefixes, it is okay to leave off theb_. (Not so sure if this is good, though.)
- ❃ UTF-8-encoded source
In the first or second line of every Python file, put the following:
# -*- coding: utf-8 -*-
Doing this will allow you to use non-ASCII characters in your Python source.
- ❃ unicodification (stringification)
Implement
__unicode__and__str__like this (credits):def __unicode__(self): return … # create unicode representation of your object def __str__(self): return unicode(self).encode('utf-8')- ❃ writing Unicode utilities
- If you want to write utilities like
unicode_environandunicode_argparse, you might find the functions fromunicode_toolshelpful.
When I write something like »works with unicode arguments«, I mean that it
works with arguments of type unicode which can contain arbitrary
characters, i. e. ASCII as well as non-ASCII.
Feel free to extend, or correct if things have changed.
codecs.open works with unicode as well as str filenames.
datetime.datetime.strftime(unicode): str
httplib2.Http.request works with unicode arguments. However, the
results will all contain or be of type str. Example:
>>> r, c = httplib2.Http(".cache").request("http://de.wikipedia.org/wiki/Erdkröte")
>>> r['content-type']
'text/html; charset=UTF-8'
>>> type(r['content-type'])
<type 'str'>
>>> type(c)
<type 'str'>
Things in os are generally safe to use with unicode. However, note this:
path.join(unicode, unicode):unicodepath.relpath(unicode, unicode):strorunicode(!!!) If the result contains non-ASCII characters, it will beunicode, otherwisestr. Isn't it sweet?
PyCurl works solely on strs.
- Config values can be
unicode. Commit.hex:strCommit.message:unicode- Paths are
str. However, this is extrapolated from the fact thatPatch.delta.{old,new}_file.pathisstr. The API might be inconsistent, so check the thing you're using and add the data here. Reference.name,Reference.shorthand:str- However,
Repository.lookup_reference(unicode)works. - Refspecs should be
str.Remote.add_fetchdoesn't complain when you passunicode, butRemote.fetch_refspecsthrows an exception if you added a refspec with non-ASCII characters. Funny enough, though,Remote.fetch_refspecsis a list ofunicode. Repository(path)doesn't work withunicodes containing non-ASCII characters. In order to be sure, I'd say that all paths passed to Pygit2 methods or the like should be converted to UTF-8strs first.Signature.name,Signature.email:unicode. If you needstr, you can useSignature.raw_nameandSignature.raw_email.
Trivia:
>>> no_r = pygit2.Repository("/tmp/tüüls") # throws error
>>> r = pygit2.clone_repository("/tmp/tüüls", "./tüüls") # works
>>> r.remotes[0].url # throws error
re is completely okay with unicode everywhere.
textile.textile returns unicode if you give it unicode.
urllib2 didn't like unicode for URLs and also returned str only. Since
urllib is older, I guess it's the same there.
- https://docs.python.org/2.7/howto/unicode.html
- https://pythonhosted.org/kitchen/unicode-frustrations.html
- http://python-future.org/unicode_literals.html
- the documentation of the mentioned modules or libraries
- Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you are in an industrious mood, you can help improving this document and the modules.
- I marked up many things as
literal text. It would be nice if you could change this to interpreted text, such as :meth:`pygit2.Diff.merge`. But you'd also have to find the right way to convert this to HTML, since rst2html doesn't likemeth(as well as the other Python-specific roles, I guess). - As stated above, the notes on which libraries do what are always happy to be updated and extended.
Copyright (c) 2015 Richard Möhn
This work is licensed under the Creative Commons Attribution 4.0 International License.