Design a rate limiter for an API.
The complete answer guide: what this question really tests, two example strong answers in different angles, the common weak answer rewritten, and the trap most candidates fall into. This is a system design archetype question — see the broader pattern guide for the structural shape.
What this question is really testing
The interviewer isn't actually testing whether you can build a rate limiter—most production systems use existing solutions like Redis or cloud-native offerings. They're testing whether you can navigate ambiguity, make explicit tradeoffs, and think in layers of abstraction. The binary read they're making: can you hold multiple concerns in your head simultaneously (accuracy vs performance, single-machine vs distributed, different rate limiting algorithms) while still driving toward a concrete design? They're watching to see if you ask clarifying questions that reveal system thinking, or if you immediately jump to implementation details like "I'll use a token bucket algorithm" without understanding the problem space.
The deeper signal is about production maturity. When you discuss rate limiting, the interviewer is listening for whether you've actually dealt with the cascading failures that happen when systems don't have proper backpressure. Do you mention what happens to requests that get rate limited? Do you think about monitoring and observability? Do you consider the user experience of getting a 429 response? Weak candidates treat this as an algorithms problem. Strong candidates treat it as a distributed systems problem with business constraints, user experience implications, and operational requirements. The interviewer wants to know if you'll build something that works in theory or something that works when deployed.
Two strong answers, two angles
Angle A: Start with requirements and constraints
"Before I design anything, I need to understand the rate limiting requirements. Are we limiting per user, per API key, per IP address, or some combination? What's the time window—requests per second, per minute, per hour? And critically, what's our scale—are we talking about a single server handling thousands of requests, or a distributed system across multiple regions handling millions? Let me assume we're building for a REST API with 10,000 requests per second, limiting authenticated users to 100 requests per minute. I'd start with a token bucket algorithm running in Redis, with each user having a key that stores their token count and last refill timestamp. This gives us atomic operations, persistence across server restarts, and the ability to share state across multiple API servers."
Angle B: Start with architecture and evolution
"I'd approach this in stages because rate limiting requirements evolve. For an MVP with a single server, I'd implement an in-memory token bucket using a concurrent hash map with periodic cleanup. This handles 95% of cases with microsecond latency. But the real question is distributed rate limiting—when you have multiple API servers, you need shared state. I'd introduce Redis as a centralized counter store, using INCR with EXPIRE for a simple sliding window counter. The tradeoff here is network latency—each request now makes a Redis call—but we gain consistency across servers. For very high throughput, I'd implement a hybrid approach: local in-memory buckets that sync periodically with Redis, accepting some over-limit requests in exchange for reduced latency. The key is making these tradeoffs explicit based on whether we prioritize strict enforcement or performance."
The common weak answer
"I would use a token bucket algorithm where each user gets tokens that refill over time. When a request comes in, we check if they have tokens available. If yes, we decrement and allow the request. If no, we reject it with a 429 status code. We'd store this in Redis for persistence."
This answer fails because it's a Wikipedia summary, not a design. The interviewer learns nothing about your decision-making process or awareness of tradeoffs. You've named an algorithm but haven't explained why token bucket over leaky bucket or sliding window. You've mentioned Redis but not addressed how you handle Redis being down, or whether you're using Redis transactions, or how you prevent race conditions with concurrent requests. The response signals that you've memorized a pattern but haven't built systems that actually enforce rate limits under production load. Reframe it: "I'd start with token bucket because it allows burst traffic while enforcing average rates, which matches most API usage patterns. The implementation details depend on whether we need strict enforcement—if so, Redis with Lua scripts for atomicity—or if we can tolerate slight over-limit scenarios for better latency."
The one trap most candidates fall into
The trap is diving into algorithm implementation details before establishing the system boundaries and failure modes. Candidates spend ten minutes explaining token bucket mechanics with timestamps and refill rates, then realize they haven't discussed what happens when the rate limiter itself becomes the bottleneck or fails. Here's the counterintuitive truth: the algorithm choice matters far less than your strategy for handling rate limiter failures. If your Redis cluster goes down, do you fail open (allow all requests, risking downstream overload) or fail closed (reject all requests, creating a self-inflicted outage)?
Most candidates also forget that rate limiting is a user-facing feature with UX implications. When you return a 429, are you including headers that tell the client when to retry? Are you distinguishing between "you're making too many requests" and "the entire system is overloaded"? The strongest candidates mention that rate limiting is part of a broader API contract and discuss response headers like `X-RateLimit-Remaining` and `Retry-After`. They think about client SDK behavior and exponential backoff. This signals you've actually dealt with angry developers whose applications broke because of rate limiting changes, which is the kind of production experience interviewers want to see.
Common questions
How long should my answer to "Design a rate limiter for an API." be?
Aim for 60-120 seconds spoken (250-350 words). Long enough to land the situation, action, and result; short enough that the interviewer has room to follow up. Anything past two minutes risks losing them.
Should I memorize my answer word-for-word?
No — that reads as canned and falls apart the moment the interviewer asks a follow-up. Memorize the structure (the bones of the story) and the specific numbers/names that anchor it. Let the words come naturally each time.
What if I have a really good story but it was years ago?
Recent is better, but a strong story from 3 years ago beats a vague story from last quarter. If the example is older than 5 years, frame it as the moment that crystallized the lesson, then briefly bridge to how you've applied it since.
Can I use the same story for multiple questions?
Often yes — strong stories tend to demonstrate multiple competencies. The trick is reframing the angle each time. Same situation, different opening sentence: lead with the conflict for conflict questions, lead with the leadership move for leadership questions.
How do I know if my answer is actually good?
Practice it out loud and have it scored. The fastest way is a mock interview where the AI flags exactly what's vague, where you used 'we' when the question asked about 'I,' and rewrites the weakest sentence. Reading example answers helps; getting yours scored is what moves performance.
Other system design questions
Reading isn't practicing.
Try answering this question right now before checkout, with real Claude-scored feedback in 5 seconds.
Practice this question free →