Need help with your Django project?

Django ORM Under the Hood - Iterables

Ivaylo Donchev

Nov 1, 2021

Categories:Django

Intro

In our daily work as web developers, we use the Django ORM a lot. It's a great tool that makes it easy to interact with the database by just calling any of the methods for CRUD operations:

Create(INSERT) - QuerySet.create
Read(SELECT) - QuerySet.all, QuerySet.values and QuerySet.values_list
Update - QuerySet.update
Delete - QuerySet.delete

But the all/values/values_list methods are quite different - they don't immediately trigger a query, even if you chain other QuerySet methods to them (.filters / .order_by / etc.), but the data is there when you start iterating over it.

Here're 5 questions that we're going to focus on:

When does the ORM execute the SELECT query?
How do the all, values and values_list affect the queryset ?
What data structure do we need for better performance ?
Can we define a custom form for the DB data?
Does the order of methods chaining affect the order of execution of the QuerySet ?

QuerySet as a generator vs QuerySet as an iterable

Theory

Let's say we have the following model:

class User(Model):
    email = CharField(max_length=255)
    name = CharField(max_length=255)

Let's take a look at the following code snippet:

users = User.objects.all()

hacksoft_users = users.filter(email__icontains='@hacksoft.io')

for user in hacksoft_users:
    pass

for user in hacksoft_users:
    pass

How many queries will the code above do? When? Why?

Here're a few facts about the querysets:

The QuerySet is immutable - chaining methods to our queryset doesn't modify the original queryset - it creates a new one.
The QuerySet is a generator when you iterate over it for the first time - when you start iterating over the queryset, internally it executes a SELECT query and yields the DB rows shaped into the desired Python data structure.
The QuerySet is an iterable - once we've iterated over the queryset once, the queryset puts the DB result into a cache. On every subsequent iteration, we'll use the cached objects. This prevents us from unwanted queries duplication.

As a summary:

The queryset object is immutable and it triggers a SELECT query only when you iterate over it for the first time.

So let's get back on our example:

users = User.objects.all()  # Creates a queryset

hacksoft_users = users.filter(email__icontains='@hacksoft.io') # Creates a new queryset

for user in hacksoft_users:  # Makes SELECT query and yields the result
    pass

for user in hacksoft_users:  # Just yields the cached result
    pass

Based on the unique querysets first iterations, the code above makes 1 SELECT query.

Implementation

In Python, the way to define an iterable class is by using the __iter__ method. That's how the QuerySet class works. Let's take a look at the QuerySet.__iter__:

class QuerySet:
    ...
    def _fetch_all(self):
        if self._result_cache is None:
            self._result_cache = list(self._iterable_class(self))
        # ... more code to handle prefetched relations

    def __iter__(self):
        self._fetch_all()
        return iter(self._result_cache)

As we see in the _fetch_all method we have 2 cases:

the _result_cache is None - that's when we iterate over the queryset for the first time. The queryset now needs to iterate over the _iterable_class (we'll take a closer look there in the next section) that will take care of the DB fetching and will put the DB rows into the desired Python structure.
the _result_cache is not None - the queryset will just iterate over it.

Q: When does the ORM execute the SELECT query?
A: The ORM execute a SELECT query only the first time you iterate over a QuerySet

Iterable classes

Let's focus on the QuerySet._iterable_class and see what it does with the SELECT query's data.

The _iterable_class has two functions:

calls the SQL compiler to execute SELECT query
puts the raw database data (a list of tuples) into ORM objects(.all), dictionaries(.values) or tuples(.values_list) and return it

We have the following types of "iterable classes" that comes from the Django ORM:

ModelIterable (link) - used by .all and yields ORM objects
ValuesIterable (link) - set when .values is called and yields dictionaries
ValuesListIterable (link), NamedValuesListIterable (link) and FlatValuesListIterable (link) - set when .values_list is called (we have 3 iterable classes here since values_list returns different formats depending on the named and flat arguments)

NOTE: The ModelIterable is the default _iterable_class of the QuerySet - that's why we could iterate over User.objects.filter without calling all explicitly.

Implementation

So let's take a look ValuesIterable class:

class ValuesIterable(BaseIterable):
    def __iter__(self):
        queryset = self.queryset
        query = queryset.query
        compiler = query.get_compiler(queryset.db)

        names = [
            *query.extra_select,
            *query.values_select,
            *query.annotation_select,
        ]
        indexes = range(len(names))
        for row in compiler.results_iter(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size):
            yield {names[i]: row[i] for i in indexes}

As we see the dictionary form of the DB data is done by the {names[i]: row[i] for i in indexes} expression.

So why should we care about this? Well, for 2 reasons:

ORM performance
Flexibility to extend the QuerySet interface

Q: How do the all, values and values_list affect the queryset ?
A: By changing the QuerySet's iterable class, they define the data structure for the DB data.

QuerySet Performance - SQL vs Python

The ORM SELECT query performance is a result of two other components - SQL query performance and Django ORM (Python) performance.

SQL query performance

The SQL query performance is a huge topic and we're not going to get into it. Here're a few common sense things that could make our DB work faster:

query only for the DB column you need - use .only and .defer
query only for the DB rows you need - use .filter and .exclude
use DB indexes

ORM performance

As soon as the DB finishes its job, it's time for the ORM's QuerySet._iterable_class to go through each DB row and put it into an ORM data structure. Since this is done for each row, the performance between .all , .values_list and .values_list could be significant (that's visible especially when working with huge data sets).

Although QuerySet.all's ModelIterable allows us to use the entire power of the ORM, it has a price - it makes a model instance for each DB row and attaches all the related objects(also models instances). This is a really expensive operation - both for the CPU and the RAM.

If we fetch too much data and we need it read-only, .values and .values_list are a better choice than .all.

Q: What data structure do we need for better performance ?
A: The rule of thumb here is "The more data you fetch, the lighter data structure you should use.". An expensive structure like the ORM object could seriously affect the QuerySet performance when you work with big volumes of data.

Writing a custom iterable class

Inspired from the functional programming, let's write a map() method on the QuerySet objects

Here is the interface we want to achieve:

users = User.objects.filter(email__icontains='@hacksoft.io')

user_ids = users.map(lambda el: el.id)

print(user_ids)
# >>> <UserQuerySet [1, 2, ...]>

Let's do it step by step:

Define custom queryset with empty map function:

class UserQuerySet(QuerySet):
    def map(self, func):
    	# We need to clone our queryset to keep the immutable behaviour
        clone = self._chain()

        return clone


class User(Model):
    objects = UserQuerySet.as_manager()

    email = CharField(max_length=255)
    name = CharField(max_length=255)

2. Add an iterable class that inherits the current iterable class so we chain our map method to all of the all, values and values_list

class UserQuerySet(QuerySet):
    def map(self, func):
        class MapIterable(self._iterable_class):
            def __iter__(self):
                return (
                    obj
                    for obj in super().__iter__()
                )

        # Keep the queryset immutable behaviour
        clone = self._chain()
        clone._iterable_class = MapIterable

        return clone

3. Apply our custom function in the map

class UserQuerySet(QuerySet):
    def map(self, func):
        class MapIterable(self._iterable_class):
            def __iter__(self):
                return (
                    func(obj)  # <-- apply the function
                    for obj in super().__iter__()
                )

        # Keep the queryset immutable behaviour
        clone = self._chain()
        clone._iterable_class = MapIterable

        return clone

As we use the current data structure of the queryset in our map function we can do the following things:

def attach_verbose_name(user):
	# The user here is a dictionary since we used `.values()` in our queryset
    user['verbose_name'] = f'{user["name"]} ({user["email"]})'
    return user

User.objects \
  .filter(id__gt=2) \
  .values() \  # the data structure will be a `dict`
  .map(attach_verbose_name) \  # Will work with dictionaries (`.values`)
  .map(lambda el: el['verbose_name'])
>>> <UserQuerySet ['User 3 (user3@hacksoft.io)']>

Q: Can we define a custom form for the DB data?
A: Yes - by defining a new QuerySet method with it's own iterable class.

Methods chaining and order of execution

The order of the method chaining is not always the same as the order of execution.
We could categorize the QuerySet methods into 2 categories:

Methods that modify the SQL query - filter/ exclude/ annotate/ only / etc. They are "executed" into the database when it runs the SQL query.
Methods that define the data structure - all/ values / values_list/etc. They're executed in our Django app (by iterating over the iterable class and modifying the data)

The ORM allows us to chain the same methods in almost any order. But, no matter the order of chaining, the order of execution will always be:

Execute the methods that are modifying the SQL query
Run the query in the database
Execute the methods that define the data structure

That's why the following example will fail:

def attach_annotated_id(user):
	# The user here is a dictionary since we used `.values()` in our queryset
    user['annotated_id'] = user['id']
    return user

User.objects \
  .values() \  # Execution order: 2
  .map(attach_annotated_id) \  # Execution order: 3
  .filter(annotated_id__gte=2)  # Execution order: 1

Q: Does the order of methods chaining affects the order of execution of the QuerySet ?
A: It does, but the methods that modify the data structure are ALWAYS executed after the methods that modify the SQL queries by design.

Summary

Django ORM's QuerySet methods that cover each of the CRUD operations are:

Create(INSERT) - QuerySet.create
Read(SELECT) - QuerySet.__iter__
Update - QuerySet.update
Delete - QuerySet.delete

Back to the questions:

1. When does the ORM actually execute the SELECT query?

The ORM execute a SELECT query only the first time you iterate over a QuerySet

2. How do the all, values and values_list affect the queryset ?

By changing the QuerySet's iterable class, they define the data structure for the DB data.

3. What data structure do we need for better performance ?

The more data you fetch, the lighter data structure you should use.

4. Can we define a custom form for the DB data?

Yes - by defining a new QuerySet method with it's own iterable class.

5. Does the order of methods chaining affects the order of execution of the QuerySet ?

It does, but the methods that modify the data structure are ALWAYS executed after the methods that modify the SQL queries by design.

I hope this article helps you learn more about the Django ORM data fetching strategy and how we can resolve possible performance issues.

For more Django related blog posts - check the rest of the Django articles in our blog ✌

Need help with your Django project?

Check our django services

Your development partner beyond code.