Supabase RLS Infinite Recursion (42P17): The Hidden Bug
A login redirect loop that was really a database bug. How supabase rls infinite recursion 42P17 hides in multi-tenant policies and the SECURITY DEFINER fix.
By Mike Hodgen
The Symptom: A Black Dashboard and an Endless Login Loop
A user logs into a multi-tenant SaaS app. The screen flickers, then bounces them right back to the login page. They log in again. Bounced again. On the third try the dashboard finally renders, except it's black. Empty. Nothing loads.
Symptom-to-cause layer chain (login loop traced down to database)
This is a postmortem from a multi-tenant product I built. No client, no product name. Just the bug, because the bug is the useful part.
When I first saw this, every instinct pointed at authentication. A login loop is an auth problem, right? That's what it looks like. That's where you go. So that's where I went, and I lost hours there.
I checked the session token. Clean. I checked the middleware redirect guards, the logic that decides whether a request is authenticated and where to send it if not. Clean. I checked the cookie config, the SameSite settings, the domain scoping. All fine. I checked the GoTrue auth callback, the part of Supabase that handles the login handshake. Nothing wrong.
The session was valid. The token was good. The user was authenticated. And the app kept throwing them back to login like they weren't.
Here's the thing I want you to sit with before I tell you the answer: the symptom lived in the auth layer, but the bug lived in the database. The login loop was real. The black dashboard was real. But neither was caused by anything in the authentication stack.
The actual culprit was a Postgres error, supabase rls infinite recursion 42p17, buried in a data query the dashboard fired on load. The auth layer was the messenger. I spent hours arguing with the messenger.
That's the trap. And it's a common one. Let me walk you through exactly how I fell into it and how I climbed out.
Why I Wasted Hours Debugging the Wrong Layer
The auth-shaped red herring
A redirect loop has a shape. When you see one, your brain fills in the cause before you've looked at any evidence. Session expired. Token didn't refresh. Cookie got dropped. Redirect guard misfired.
So I worked the list. I added logging to the middleware to confirm the session was being read correctly. It was. I traced the token refresh flow to make sure it wasn't silently failing. It wasn't. I verified the GoTrue callback was setting cookies with the right flags. It was.
Every single thing in the auth layer checked out. Which should have been my first clue that the problem was somewhere else. Instead I just looked harder at the same code, convinced I'd missed something.
When a layer keeps passing every test you throw at it, stop testing that layer.
The clue I almost missed
The breakthrough came when I stopped reading my own code and opened the browser network tab. Not the auth requests. The data request. The query the dashboard fires the moment it loads to pull the user's tenant data.
That request returned a 500. And the response body wasn't an auth error. It was a Postgres error:
code: 42P17
message: infinite recursion detected in policy for relation "memberships"
There it was. The data fetch failed at the database. The client code saw a failed fetch, couldn't load the user's data, and interpreted "no data" as "not authorized." So it kicked the user to login. The user logged in fine, the dashboard tried to fetch data again, the same recursion error fired again, and the loop closed.
The recursion masqueraded as an auth failure. The login screen was a symptom three layers removed from the cause. I'd been debugging the reflection instead of the object.
What Postgres Error 42P17 Actually Means
Postgres error 42P17 is the database telling you that a row-level security policy references itself in a way that can never terminate. It aborts the query rather than hang forever.
How RLS infinite recursion forms (the self-referencing loop)
You probably know what row-level security (RLS) is conceptually. Here's the failure mode in plain terms.
An RLS policy is a WHERE clause that Postgres silently appends to every query against a table. You write a policy once, and from then on every SELECT, INSERT, UPDATE, and DELETE on that table gets your condition stapled to it automatically. The user never sees rows the policy excludes. That's the whole point, and it's a clean way to enforce tenant isolation at the database level instead of trusting your application code to filter correctly every time.
Now here's where it breaks.
Suppose your policy needs to check something, and to check it, the policy runs a query against the same table the policy is protecting. To evaluate that inner query, Postgres has to apply the policy. To apply the policy, it has to run the inner query. To run the inner query, it has to apply the policy. Forever.
Postgres detects this loop and refuses to play. It throws 42P17, "infinite recursion detected in policy for relation," and aborts. It's choosing to fail loudly instead of hanging your database thread until it times out.
So 42P17 is not a bug in Postgres. It's Postgres protecting you from a logical impossibility you wrote into your security model. The fix is never to make the recursion "work." The fix is to break the self-reference so the policy never has to evaluate against the table it protects.
The Multi-Tenant Trap: A Policy That Queries Its Own Table
The membership-check pattern everyone writes
Here's the scenario, anonymized. You have a memberships table linking users to organizations. Each row says "this user belongs to this org" with a user_id and an org_id.
You want a rule: a user can see a membership row only if they belong to the same org as that row. Sensible. Tenant isolation 101.
So you write the policy in the most natural way possible:
CREATE POLICY "members see their org's memberships"
ON memberships
FOR SELECT
USING (
org_id IN (
SELECT org_id FROM memberships
WHERE user_id = auth.uid()
)
);
Read it out loud. "You can see a membership row if its org_id is one of the orgs you're a member of." To find the orgs you're a member of, you query the memberships table. The policy protecting memberships queries memberships.
Why it recurses
That subquery, SELECT org_id FROM memberships, is itself subject to the policy. So evaluating the policy requires evaluating the policy. That's the recursion. That's 42P17.
The cruel part is how convincing this code looks. It reads correctly. It passes code review. It even works in a quick manual test, because a single-row scenario can sometimes short-circuit before the recursion fully unwinds, or the planner caches its way past the problem on a trivial dataset.
It detonates the moment the policy actually has to evaluate against itself under real conditions, with multiple rows and multiple tenants. Which means it often ships clean and breaks in production, weeks later, when you finally have real customers. By then nobody connects the login loop to a policy nobody's touched since launch.
This is one of the most common multi-tenant RLS mistakes I see, and it's exactly the kind of subtle structural problem the row-level security playbook I use across 50+ tables is built to prevent. The pattern feels right. That's what makes it dangerous. A self-referencing policy on a junction table is the default thing a smart engineer writes, and it's wrong every time.
The Fix: SECURITY DEFINER Functions That Bypass RLS
What SECURITY DEFINER does
The rule is simple: never let an RLS policy query the table it protects. When you need a cross-row check, like "does this user belong to this org," move that check into a Postgres function marked SECURITY DEFINER.
A normal function runs with the privileges of whoever calls it, so it's still subject to RLS. A SECURITY DEFINER function runs with the privileges of the function's owner. If the owner is a role that bypasses RLS, the lookup inside the function doesn't re-trigger the policy. The recursion loop never forms because the inner query is no longer governed by the policy doing the asking.
Moving the cross-row check out of the policy
Here's the pattern, generically:
Broken self-referencing policy vs SECURITY DEFINER fix
CREATE FUNCTION user_belongs_to_org(check_org_id uuid)
RETURNS boolean
LANGUAGE sql
SECURITY DEFINER
SET search_path = public
AS $$
SELECT EXISTS (
SELECT 1 FROM memberships
WHERE user_id = auth.uid()
AND org_id = check_org_id
);
$$;
Then the policy calls the function instead of doing the inline subquery:
CREATE POLICY "members see their org's memberships"
ON memberships
FOR SELECT
USING ( user_belongs_to_org(org_id) );
Now the policy asks a function "does this user belong to this org," the function checks memberships with RLS bypassed, returns true or false, and the policy applies that boolean. No self-reference. No recursion. No 42P17.
A few non-negotiable safety rules, because SECURITY DEFINER is a loaded weapon.
It bypasses RLS by design. That's the entire point, and it's also how you blow a hole in your own security model if you're careless. So keep these functions narrow. This one takes one argument and returns one boolean. It cannot leak rows because it never returns rows.
Always set a fixed search_path, like SET search_path = public. Without it, an attacker who can create objects in a schema earlier in the search path can hijack what your function resolves to and run their code with the owner's elevated privileges. The fixed path closes that door.
And only return what you need. A boolean. Never expose the underlying rows through a SECURITY DEFINER function unless you have deliberately reasoned about exactly who can call it and what they can pass.
This is the difference between an RLS setup that holds and one that either recurses or quietly leaks. It's also why I architect multi-tenancy so every client gets their own isolated data from the first table, not as a patch after the loop shows up.
The Bonus Trap: GoTrue and the NULL Token Columns
While we're below the auth layer, here's a second gotcha that surfaces as the exact same symptom.
Two different causes, one identical symptom
When you insert directly into auth.users, say to seed test accounts, create demo logins, or migrate users from another system, GoTrue has expectations about certain columns that the docs don't shout about. Columns like confirmation_token, recovery_token, and email_change_token need to be empty strings, not NULL.
If you leave them NULL, GoTrue's internal queries can choke when it tries to process that user. The login fails. And how does a failed login present? As a redirect loop or a broken session. The user logs in, something downstream errors on a NULL it expected to be '', and they get bounced right back to where they started.
Same symptom. Completely different cause. Both living below the authentication layer you instinctively blame.
The fix is to set those columns to '' on insert:
INSERT INTO auth.users (id, email, confirmation_token, recovery_token, ...)
VALUES (gen_random_uuid(), 'test@example.com', '', '', ...);
I'm flagging this because it reinforces the whole lesson. Two unrelated failure modes, an RLS recursion and a NULL token column, both produce a login loop. If you only know the symptom, you'll spend hours in the auth code on both, and the auth code is innocent both times.
This is the kind of cross-layer knowledge that separates someone who has shipped multi-tenant Supabase in production from someone who has only read the documentation. The docs tell you what the columns are. They don't tell you that a NULL there looks exactly like a broken redirect guard at 11pm.
Debug the Layer the Error Comes From, Not the One You Suspect
The symptom was an auth loop. The bug was in the database. The whole cost of this debugging session, the hours I'm not getting back, came from one instinct: debugging what I could see instead of what was actually failing.
Debugging discipline: read the failing request first
The login screen was visible. The recursion error was buried in a network response I didn't open until I'd exhausted everything else. I had the answer available the entire time. I just didn't read it because I'd already decided where the problem lived.
The discipline here is boring and it works: read the actual error payload from the failing request before you touch any code you assume is broken. Open the network tab. Find the request that's actually failing. Read the response body. The database told me it was 42P17 in plain text. I just had to look.
This is what production multi-tenancy actually involves. Not the happy-path demo where one user logs in and sees one tenant's data. The non-obvious failure modes that don't appear until you have real tenants, real rows, and policies evaluating against themselves under load. They hide until you ship, then they surface as something that looks like a completely different problem.
If you've read about the same security holes I find in fast AI-built apps, you already know how often the dangerous bug is the one that looks like something else.
So if you're a CTO weighing whether to build multi-tenant isolation in-house or bring me in to audit your multi-tenant setup, this is the work. Finding the recursion before it ships. Knowing that a NULL token reads as a redirect loop. Building the SECURITY DEFINER boundaries so the policy never eats itself. These are the traps I've already hit, so your customers don't have to.
Ready to bring AI leadership into your company?
I work with a small number of companies at a time. If you're serious about AI, apply to work together and I'll review your application personally.
Get AI insights for business leaders
Practical AI strategy from someone who built the systems — not just studied them. No spam, no fluff.
Ready to automate your growth?
Book a free 30-minute strategy call with Hodgen.AI.
Book a Strategy Call