C for Yourself, Part I Using the REPL to Explore Python Internals

15 Mar 2018

Not coming from a traditional CS background, I often felt like I was flying blind during my first year of coding Python. I understood the syntax well enough, but I had no mental model for what was actually going on inside my laptop. Articles about Python internals were helpful, but I never felt fully convinced because it was all hearsay that I couldn’t test out myself. What ended up being the biggest revelation was finding out that I could actually dissect Python objects right inside the REPL! As it turns out, you can take things all the way down to bytes and virtual memory addresses without leaving the interpreter. Rather than digging through old C code, I was able to fiddle with instances of common data types and just see what happened. Much more fun!

The main tool you need to do this is the ctypes library, which (as the name suggests) provides data types that are compatible with C. That way, you can decode the bytes that Python is storing anywhere in memory. The other useful tool is the id() built-in function, which in CPython gives you the memory address of whatever object you pass in. With these two little weapons, you can do quite a lot of damage. Let’s dissect some integers and tuples.

The simplest example to start with is an integer. The diagram below shows how a Python integer is actually stored in memory, where each block represents eight bytes (assuming you’re on a 64-bit machine):

[ Reference count ] [ Type address ] [ Value ]

The first block, the reference count, tracks how many variables point to the object. The second block, the type address, is the address of that object’s type in memory. The last block is the actual value of the integer. Now how can I see all of this with my own eyes at the REPL? Let’s start with reference count. We can read the first block into memory by reading the bytes :

>>> import ctypes
>>> ctypes.c_long.from_address(id(1))
c_long(607)

Whoa! Can that really be a reference count? We haven’t assigned any variables to 1, let alone 607 variables. However, we have to keep in mind that Python has pulled all the built-in modules into memory as well, where it gets used quite a few times. In any case, we can test that this number is a reference count by assigning a variable of our own:

>>> my_int = 1
>>> ctypes.c_long.from_address(id(1))
c_long(608)

Ok cool! We just watched the internal reference counter at work. Now, let’s check the next block. We can use the same code as before, but just bump up the memory address by 8 bytes. This block is supposed to hold the memory address of the integer type:

>>> ctypes.c_long.from_address(id(1)+8)
c_long(4296527200) 
>>> id(type(1))
4296527200

The memory addresses match! So this block is indeed the address of the integer type. Finally, let’s hop forward another 8 bytes and unpack the value:

>>> ctypes.c_long.from_address(id(1)+16)
c_long(1)

Great! Since we were looking at the integer 1 that’s what you’d like to see. And there you have it! You now know how the number one works in Python, in a very literal sense.

Now let’s look at a tuple. Tuples are stored in memory as follows, where each block is again eight bytes:

[Reference count] [Type address] [Length] [Item Address #1] … [Item Address #N]

The reference count and type address are the same as in the integer case, so I’ll skip those. Let’s start out by testing the length slot:

>>> small_tuple = (1, 2)
>>> ctypes.c_long.from_address(id(small_tuple)+16)
c_long(3)
>>> big_tuple = (1, 2, 3, 4, 5, 6)
>>> ctypes.c_long.from_address(id(big_tuple)+16)
c_long(6)

Ok! We could test this further, but it looks like we’re accurately tracking the length. Now let’s make sure that the elements of the tuple are at the memory addresses we expect. To make sure these addresses point to the objects they should, we can use the id() function. The value in each of these last slots should match the address output by calling id() on the corresponding element.

>>> my_tuple = (1, "two")
>>> ctypes.c_long.from_address(id(my_tuple)+24)
c_long(4298159368)
>>> id(big_tuple[0])
4298159368
>>> ctypes.c_long.from_address(id(big_tuple)+32)
c_long(4337354632)
>>> id(big_tuple[1])
4337354632

They both match! So we’ve managed to dissect every part of a Python tuple now as well. In the next installment of this series, we’ll be dissecting two more complicated examples: lists and dictionaries. Stay tuned! As a final note, I have to give credit to a blog post by Christian Perone that tipped me off to the power of the ctypes library. I’ve learned a lot from his blog, and fully recommend checking it out.