The Real Fix for Streaming, Traefik, and MetalLB

So remember that time I spent months debugging streaming issues with Ollama behind Traefik, only to have it fixed by a routine apt-get upgrade?

Sadly… that wasn’t the fix… and the issue returned.

The streaming worked after that upgrade. But it wasn’t because of the upgrade. And it took me until now to figure that out. Turns out the cluster reboot that came after the upgrade is what actually not what fixed it. The timing just made it look like the package update did it.

TLDR

Kube-Vip and Metalb both operating at Layer 2 and clobbering each other.

Ralph the rubber duck

A few months after that false-positive fix, the issue came back after I preformed OS updates and kicked off a cluster reboot.

Bam, all my Ai agents and tools busted, plagued by api stream terminations.

I reverted my Ai Agent over to direct IP access and started using him has a rubber duck and troubleshooting companion. So I gave Ralph, my Ai agent, access to the gitops repo and direct access to the kubeapi to dig as deep as needed.

Things Ralph thought might be the problem:

  1. Packet Loss. Ralph then wrote a series of tests to confirm, not the issue.
  2. Bad Traefik Config. Ralph modified the config, no improvement.
  3. Storage too slow… Bro, not sure why storage would impact stateless apps…
  4. Traefik resources to low. Increased resources, no improvement.
  5. Raspberry Pi’s can’t handle the load… Maybe

So I moved the cluster off my trusty Raspberry Pi 4 nodes onto a single beefy x86 machine. 36 cores. 256GB of RAM.

The streaming broke again. On better hardware. So it wasn’t the Pis. It wasn’t slow ARM processors. It wasn’t some Raspberry Pi-specific network weirdness.

I bypassed MetalLB by switching Traefik to node ports, problem immediately cleared.

I went back to the Pi cluster.

Then I played chaos monkey:

  • Rebooted individual nodes
  • Cordoned nodes to force rescheduling
  • Manually moved Traefik pods to run on every node
  • Did a full cluster reboot
  • Watched to see if the streaming broke

Here’s where it got weird: I noticed the streaming seemed to work whenever Traefik happened to be scheduled on a specific node.

After tons of digging I realized that node was the kube-vip leader.

When Traefik moved off that node — streaming broke again. Always. Like clockwork.

Made no sense. Different IP ranges, different ARP domains, shouldn’t conflict at all.

But the correlation was undeniable.

The fix took only minutes, I went to my gitops repo, disabled kube-vip. Bam, no issues. Confirmed with some chaos monkey business. No issues!

The Actual Lesson

The original post’s conclusion was wrong. It said:

“I spent weeks assuming it was some complex interaction between Traefik and streaming protocols, when really some package somewhere in my base OS had a bug that got patched in some routine update.”

That’s not what happened. What happened was:

  1. I ran apt-get upgrade
  2. I rebooted the cluster
  3. The reboot caused k8s scheduling chaos.
  4. Traefik happened to get scheduled on the kube-vip leader node after the reboot
  5. Streaming appeared to work
  6. I wrote a blog post claiming package updates were the fix
  7. Months later: the bug came back because Traefik moved nodes
  8. Months later: I finally connected the dots

The real lesson? Don’t settle for “it works, don’t know why”. In this case, it was a real issue that took deep troubleshooting, two services fighting over ARP announcements on the same network, even though they were supposed to be in separate IP ranges.

Is it a kube-vip bug? A MetalLB bug? A weird kernel interaction? I don’t know. And honestly, I don’t care anymore. The fix is simple: I only need one of them.

Use one. Not both.

Bonus

It appears kube-vip has configuration to disable type loadBalancer and only provide load balancing to the controlplane kubeAPI, but for now I’m happy to have it disabled and not have broken APIs.