In our daily work as web developers, we use the Django ORM a lot. It's a great tool that makes it easy to interact with the database by just calling any of the methods for CRUD operations:
QuerySet.create
QuerySet.all
, QuerySet.values
and QuerySet.values_list
QuerySet.update
QuerySet.delete
But the all/values/values_list
methods are quite different - they don't immediately trigger a query, even if you chain other QuerySet methods to them (.filters
/ .order_by
/ etc.), but the data is there when you start iterating over it.
Here're 5 questions that we're going to focus on:
SELECT
query?all
, values
and values_list
affect the queryset ?Let's say we have the following model:
class User(Model):
email = CharField(max_length=255)
name = CharField(max_length=255)
Let's take a look at the following code snippet:
users = User.objects.all()
hacksoft_users = users.filter(email__icontains='@hacksoft.io')
for user in hacksoft_users:
pass
for user in hacksoft_users:
pass
How many queries will the code above do? When? Why?
Here're a few facts about the querysets:
SELECT
query and yields the DB rows shaped into the desired Python data structure.As a summary:
The queryset object is immutable and it triggers a SELECT query only when you iterate over it for the first time.
So let's get back on our example:
users = User.objects.all() # Creates a queryset
hacksoft_users = users.filter(email__icontains='@hacksoft.io') # Creates a new queryset
for user in hacksoft_users: # Makes SELECT query and yields the result
pass
for user in hacksoft_users: # Just yields the cached result
pass
Based on the unique querysets first iterations, the code above makes 1 SELECT
query.
In Python, the way to define an iterable class is by using the __iter__
method. That's how the QuerySet
class works. Let's take a look at the QuerySet.__iter__
:
class QuerySet:
...
def _fetch_all(self):
if self._result_cache is None:
self._result_cache = list(self._iterable_class(self))
# ... more code to handle prefetched relations
def __iter__(self):
self._fetch_all()
return iter(self._result_cache)
As we see in the _fetch_all
method we have 2 cases:
_result_cache
is None
- that's when we iterate over the queryset for the first time. The queryset now needs to iterate over the _iterable_class
(we'll take a closer look there in the next section) that will take care of the DB fetching and will put the DB rows into the desired Python structure._result_cache
is not None
- the queryset will just iterate over it.Q: When does the ORM execute theSELECT
query?
A: The ORM execute aSELECT
query only the first time you iterate over a QuerySet
Let's focus on the QuerySet._iterable_class
and see what it does with the SELECT
query's data.
The _iterable_class
has two functions:
SELECT
query.all
), dictionaries(.values
) or tuples(.values_list
) and return itWe have the following types of "iterable classes" that comes from the Django ORM:
ModelIterable
(link) - used by .all
and yields ORM objectsValuesIterable
(link) - set when .values
is called and yields dictionaries ValuesListIterable
(link), NamedValuesListIterable
(link) and FlatValuesListIterable
(link) - set when .values_list
is called (we have 3 iterable classes here since values_list
returns different formats depending on the named
and flat
arguments)NOTE: The ModelIterable
is the default _iterable_class
of the QuerySet - that's why we could iterate over User.objects.filter
without calling all
explicitly.
So let's take a look ValuesIterable
class:
class ValuesIterable(BaseIterable):
def __iter__(self):
queryset = self.queryset
query = queryset.query
compiler = query.get_compiler(queryset.db)
names = [
*query.extra_select,
*query.values_select,
*query.annotation_select,
]
indexes = range(len(names))
for row in compiler.results_iter(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size):
yield {names[i]: row[i] for i in indexes}
As we see the dictionary form of the DB data is done by the {names[i]: row[i] for i in indexes}
expression.
So why should we care about this? Well, for 2 reasons:
Q: How do theall
,values
andvalues_list
affect the queryset ?
A: By changing the QuerySet's iterable class, they define the data structure for the DB data.
The ORM SELECT
query performance is a result of two other components - SQL query performance and Django ORM (Python) performance.
The SQL query performance is a huge topic and we're not going to get into it. Here're a few common sense things that could make our DB work faster:
.only
and .defer
.filter
and .exclude
As soon as the DB finishes its job, it's time for the ORM's QuerySet._iterable_class
to go through each DB row and put it into an ORM data structure. Since this is done for each row, the performance between .all
, .values_list
and .values_list
could be significant (that's visible especially when working with huge data sets).
Although QuerySet.all
's ModelIterable
allows us to use the entire power of the ORM, it has a price - it makes a model instance for each DB row and attaches all the related objects(also models instances). This is a really expensive operation - both for the CPU and the RAM.
If we fetch too much data and we need it read-only, .values
and .values_list
are a better choice than .all
.
Q: What data structure do we need for better performance ?
A: The rule of thumb here is "The more data you fetch, the lighter data structure you should use.". An expensive structure like the ORM object could seriously affect the QuerySet performance when you work with big volumes of data.
Inspired from the functional programming, let's write a map()
method on the QuerySet objects
Here is the interface we want to achieve:
users = User.objects.filter(email__icontains='@hacksoft.io')
user_ids = users.map(lambda el: el.id)
print(user_ids)
# >>> <UserQuerySet [1, 2, ...]>
Let's do it step by step:
map
function:class UserQuerySet(QuerySet):
def map(self, func):
# We need to clone our queryset to keep the immutable behaviour
clone = self._chain()
return clone
class User(Model):
objects = UserQuerySet.as_manager()
email = CharField(max_length=255)
name = CharField(max_length=255)
2. Add an iterable class that inherits the current iterable class so we chain our map
method to all of the all
, values
and values_list
class UserQuerySet(QuerySet):
def map(self, func):
class MapIterable(self._iterable_class):
def __iter__(self):
return (
obj
for obj in super().__iter__()
)
# Keep the queryset immutable behaviour
clone = self._chain()
clone._iterable_class = MapIterable
return clone
3. Apply our custom function in the map
class UserQuerySet(QuerySet):
def map(self, func):
class MapIterable(self._iterable_class):
def __iter__(self):
return (
func(obj) # <-- apply the function
for obj in super().__iter__()
)
# Keep the queryset immutable behaviour
clone = self._chain()
clone._iterable_class = MapIterable
return clone
As we use the current data structure of the queryset in our map function we can do the following things:
def attach_verbose_name(user):
# The user here is a dictionary since we used `.values()` in our queryset
user['verbose_name'] = f'{user["name"]} ({user["email"]})'
return user
User.objects \
.filter(id__gt=2) \
.values() \ # the data structure will be a `dict`
.map(attach_verbose_name) \ # Will work with dictionaries (`.values`)
.map(lambda el: el['verbose_name'])
>>> <UserQuerySet ['User 3 (user3@hacksoft.io)']>
Q: Can we define a custom form for the DB data?
A: Yes - by defining a new QuerySet method with it's own iterable class.
The order of the method chaining is not always the same as the order of execution.
We could categorize the QuerySet methods into 2 categories:
filter
/ exclude
/ annotate
/ only
/ etc. They are "executed" into the database when it runs the SQL query.all
/ values
/ values_list
/etc. They're executed in our Django app (by iterating over the iterable class and modifying the data)The ORM allows us to chain the same methods in almost any order. But, no matter the order of chaining, the order of execution will always be:
That's why the following example will fail:
def attach_annotated_id(user):
# The user here is a dictionary since we used `.values()` in our queryset
user['annotated_id'] = user['id']
return user
User.objects \
.values() \ # Execution order: 2
.map(attach_annotated_id) \ # Execution order: 3
.filter(annotated_id__gte=2) # Execution order: 1
Q: Does the order of methods chaining affects the order of execution of the QuerySet ?
A: It does, but the methods that modify the data structure are ALWAYS executed after the methods that modify the SQL queries by design.
Django ORM's QuerySet
methods that cover each of the CRUD operations are:
QuerySet.create
QuerySet.__iter__
QuerySet.update
QuerySet.delete
Back to the questions:
1. When does the ORM actually execute the SELECT
query?
The ORM execute a SELECT
query only the first time you iterate over a QuerySet
2. How do the all
, values
and values_list
affect the queryset ?
By changing the QuerySet's iterable class, they define the data structure for the DB data.
3. What data structure do we need for better performance ?
The more data you fetch, the lighter data structure you should use.
4. Can we define a custom form for the DB data?
Yes - by defining a new QuerySet method with it's own iterable class.
5. Does the order of methods chaining affects the order of execution of the QuerySet ?
It does, but the methods that modify the data structure are ALWAYS executed after the methods that modify the SQL queries by design.
I hope this article helps you learn more about the Django ORM data fetching strategy and how we can resolve possible performance issues.
For more Django related blog posts - check the rest of the Django articles in our blog ✌