Nov 23, 2015 • Scott Stafford
UPDATE: We are available for consulting work. Please reach out to Math and Pencil, our small consulting company, if you are looking for help with performance problems in your website.
The Django REST Framework allows Django developers to build simple yet robust standards-based REST APIs for their applications. We’ve used it successfully on a number of Django web design projects. However, even seemingly simple, straightforward usage of the Django REST Framework and its nested serializers can kill performance of your API endpoints. And that matters: if your web server is wasting its time inefficiently responding to a REST API call, it will drag the rest of the server’s responsiveness down with it.
At it’s root, the problem is called the “N+1 selects problem”; the database is queried once for data in a table (say, Customers
), and then, one or more times per customer inside a loop to get, say, customer.country.Name
. Using the Django ORM, this mistake is easy to make. Using DRF, it is hard not to make.
Luckily, there is a solution that can be used to fix this common Django REST Framework performance problem, without any major restructuring of the code. It requires use of the underutilized select_related
and prefetch_related
methods on the Django ORM (and the newer Prefetch
object as well) to perform what is called “eager loading”.
This approach can have a big effect. On the most recent project we applied this too, important API calls were taking 5-10 seconds to return results. After applying appropriate eager loading, the same calls were well below 1s. Speedups of 20x or more are typical.
When you build a DRF view, you often want the return to include data from more than one related table. Writing this is straightforward and covered in the DRF docs in depth. Unfortunately, as soon as you use a nested relationship in your serializer, you risk crushing your performance, and like so many performance problems, it often only shows itself in production with larger, real world data sets.
This happens because the Django ORM is lazy; it only fetches the minimum amount of data needed to respond to the current query. It does not know you’re about to ask a hundred (or ten thousand) times for the same or very similar data.
And these days, when talking about database-backed websites, generally, the most important metric when determining site responsiveness is number of trips to the database.
In DRF, we run into trouble whenever a serializer has a nested relationship, such as either of these:
class CustomerSerializer(serializers.ModelSerializer):
# This can kill performance!
order_descriptions = serializers.StringRelatedField(many=True)
# So can this, same exact problem...
orders = OrderSerializer(many=True, read_only=True) # This can kill performance!
The code inside DRF that populates either CustomerSerializer
does this:
customers
. (Requires a round-trip to the database.)orders
. (Requires another round-trip to the database.)orders
. (Requires another round-trip to the database.)orders
. (Requires another round-trip to the database.)orders
. (Requires another round-trip to the database.)orders
. (Requires another round-trip to the database.)orders
. (Requires another round-trip to the database.)And it quickly can get worse. If your OrderSerializer
itself has a nested relationship, you have a loop-inside-a-loop, and you’re quickly in trouble, even for smallish amount of data. As a rule of thumb, these days, on a modest traffic website, you can probably afford 50 trips to the database before you start getting into real trouble.
Our approach to fixing this problem is called “eager loading”. Essentially, you warn the Django ORM ahead of time that you’re going to ask it the same inane question over and over, “so get ready”. In the above example, simply do this before DRF starts fetching:
queryset = queryset.prefetch_related('orders')
Then, when DRF makes the same call as above to serialize customers, this happens instead:
customers
. (Makes TWO round-trips to the database. The first is for customers. The second fetches all orders related to any of the fetched customers.)orders
. (Does NOT require a trip to the database, we already fetch the needed data in step 1.)orders
. (Does NOT require a trip to the database.)orders
. (Does NOT require a trip to the database.)orders
. (Does NOT require a trip to the database.)orders
. (Does NOT require a trip to the database.)orders
. (Does NOT require a trip to the database.)In short, the Django ORM “eagerly” asked for the data in step 1, then could supply the data requested in steps 2+ from it’s local data cache. Fetching data from the local data cache is essentially instantaneous when compared with the database round-trip, so we just got an enormous performance speedup in conditions when there are many customers.
We have settled on a common pattern to optimize this Django REST Framework performance problem. Whenever a serializer will query nested fields, we add a new @staticmethod
called setup_eager_loading
to the serializer, like so:
class CustomerSerializer(serializers.ModelSerializer):
orders = OrderSerializer(many=True, read_only=True)
def setup_eager_loading(cls, queryset):
""" Perform necessary eager loading of data. """
queryset = queryset.prefetch_related('orders')
return queryset
And then, wherever that serializer is going to be used, simply call setup_eager_loading
on the queryset before the serializer is invoked, like so:
customer_qs = Customers.objects.all()
customer_qs = CustomerSerializer.setup_eager_loading(customer_qs) # Set up eager loading to avoid N+1 selects
post_data = CustomerSerializer(customer_qs, many=True).data
…or, if you have an APIView
or a ViewSet
, you can call setup_eager_loading
in the get_queryset
method:
def get_queryset(self):
queryset = Customers.objects.all()
# Set up eager loading to avoid N+1 selects
queryset = self.get_serializer_class().setup_eager_loading(queryset)
return queryset
setup_eager_loading
?The hard part of solving this Django performance problem is becoming adept with how select_related
and its friends work. Here, we’ll detail how each is used in the context of the Django ORM and the Django REST Framework.
select_related
: The simplest eager loading tool in the Django ORM, for all one-to-one or many-to-one relationships, where you need data from the “one” parent object, such as a customer’s company name. This translates into a SQL join so the parent rows are fetched in the same query as the child rows. (See Official Documentation)prefetch_related
: For more complex relationships where there are multiple rows per result (ie many=True), like many-to-many or one-to-many relationships, such as a customer’s orders as above. This translates to a second SQL query on the related table, usually with a long WHERE ... IN
clause to select only relevant rows. (See Official Documentation)Prefetch
: Used for complex prefetch_related
queries, such as filtered subsets. It can also be used to nest setup_eager_loading
calls. (See Official Documentation)For our example, let’s optimize the Django REST Framework-related performance problems of an imaginary event-planning website (which surprisingly parallels our ongoing project getfetcher.com). We have a simple database structure:
from django.contrib.auth.models import User
class Event:
""" A single occasion that has many `attendees` from a number of organizations."""
creator = models.ForeignKey(User)
name = models.TextField()
event_date = models.DateTimeField()
class Attendee:
""" A party-goer who (usually) represents an `organization`, who may attend many `events`."""
events = models.ManyToManyField(Event, related_name='attendees')
organization = models.ForeignKey(Organization, null=True)
class Organization:
name = models.TextField()
For this example, to fetch all events, our eager loading code would look like this:
class EventSerializer(serializers.ModelSerializer):
creator = serializers.StringRelatedField()
attendees = AttendeeSerializer(many=True)
unaffiliated_attendees = AttendeeSerializer(many=True)
@staticmethod
def setup_eager_loading(queryset):
""" Perform necessary eager loading of data. """
# select_related for "to-one" relationships
queryset = queryset.select_related('creator')
# prefetch_related for "to-many" relationships
queryset = queryset.prefetch_related(
'attendees',
'attendees__organization')
# Prefetch for subsets of relationships
queryset = queryset.prefetch_related(
Prefetch('unaffiliated_attendees',
queryset=Attendee.objects.filter(organization__isnull=True))
)
return queryset
When we make sure to invoke setup_eager_loading
before using the EventSerializer, we will only have two large queries instead of N+1 smaller queries, and our performance will usually be MUCH better!
Eager loading is a common performance optimization that has application well beyond the Django REST Framework.
Any time you are querying nested relationships via an ORM, you should think about setting up the proper eager loading. In my experience, it is the most commonplace performance-related problem in modern small- and midsize web development.
In a followup blog post, I’ll write some debugging strategies for figuring out elusive queries spawned by more complex Serializers and some more advanced usages of Prefetch
.
select_related
and prefetch_related
on ModelSerializer.Thank you for reading!