Requesting a sanity check...help untangle my home network as I expand into more advanced networking?

surfrock66@lemmy.world · 8 months ago

Requesting a sanity check...help untangle my home network as I expand into more advanced networking?

towerful@programming.dev · edit-2 8 months ago

Hmm… Home network and /16 subnets seems insane.
Especially with gateways near - but not at - the start of subnets.
How many clients per vlan are you running?
Would boring or /24 subnets help?

I’ve also never played around with an l3 switch. Routing on a switch seems like a budget hack for when the actual l3 routing device isnt powerful enough - or for when l3 isnt complex (ie switching subnets of IPs, although i imagine thats more BGP hardware accelerated devices). Seems like an easy way to tie yourself in knots, accidentally allow acces (or block access) when you shouldnt.
But, i’ve never had a router/firewall that cant keep up with my demands - however ive never had more that 1gbps wan, and internal networking doesnt need as much processing to keep up.

My only guess from my limited knowledge of l3 switches is…
Can 1 vlan access another vlan? If so, whats its route?
Are there assymetrical rules that arent stateful? I dont know if an L3 switch tracks the state of cross-subnet/vlan connections, allowing packets to return.

Why your firewall cannot ping out, sounds like it has an issue with its upstream gateway, doesnt know its next hop, or you are not letting traffic out of the firewall itself.

Have you tried some wireshark/tcpdump captures? Can you mirror a port on your switch to help debug?

Honestly, i dont know why you dont router-on-a-stick.
Have opnsense run vlans over 1 (or 2-lag) physical, have the switch distribute vlans, and let opnsense handle L3.
When you set up opnsense, have the initial config use an unused port for lan. If you ever lock yourself out, use that as emergency access.
1 port for WAN, and 1 port (or 2-lag) for Trunk local.

As for config issues i can see…
OPNsense gateways are its upstream gateway.
I dont know why you would have RFC1918 addresses set as gateways.
OPNsenses upstream gateway is normally provided by PPPoE/DHCP from your ISP. That where it sends unknown packets to… Unless you have a static route for 0.0.0.0/0 set to tge ISPs provided gateway.

I feel like you are misunderstanding gateways. Any gateways for a subnet would be set on its DHCP server, to tell clients where they should send unknown packets to. OPNsense doesnt care about that, the clients do. The client then know where to send their non-local packets - their DHCP (or statically) assigned gateway, normally OPNSense static IP for the VLAN. Considering you have an L3 switch, i imagine it wants to act as a gateway to its known vlans so it can do local-L3 things, and it would forward non-local packets to its assigned gateway: OPNsenses static IP for the VLAN. Opnsense then gets the packet, and forwards it to ITS gateway, which would be your ISP… Likely from PPPoE or DHCP)

Honestly, i dont know where to start with this.
Maybe its because ive never done anything as complex as this, or because ive never complicated things this much.
Id suggest you draw up your requirements, and think about redesigning towards those.
How many VLANs? How many clients per vlan? Max bandwidth requirements? Can high bandwidth connections be solved by multi-homing a service, so L2 deals with it? Whats the actual throughput of your firewall? And so on

Edit: sorry for the wall of text. I honestly didnt know what to concentrate on. Dropping an L3 switch into an odd home network just explodes the possibilities

surfrock66@lemmy.world · 8 months ago

Ok, lot to go over. The /16 thing is just history; before I started this, I actually had a full /16 for my whole house as I thought I’d have hundreds of IoT devices one day, and used that third octet as a logical separator. I’ve kind of got that stuck in my head, so when I moved to a 10. system, I made the vlans/subnets the 2nd octet because I have so much IP memory of that third and fourth octet. It’s unnecessary, but tbh I know most of my IP’s by heart, and I went into this trying to drive complexity up a bit to further my learning. I don’t think necessarily changing them to /24 would solve the problem, because the complexity wouldn’t really collapse much. It’s things like the 3 network is for our minecraft servers/services, and 10.3.2.* represents the main one, 10.3.3.* represents the one my son runs, etc. It’s just muscle memory at this point. The L3 stuff is mostly good I think, I’m mostly concerned about the firewall.

I know that opnsense can be the L3 device, but 1) part of my learning in this was to use kind of raw cli switch commands and not some web UI, and 2) I had the original L3 device before adding the opnsense box (I used to have a comcast modem as the upstream from the L3, now the L3 has a 0.0.0.0/0 route to opnsense, and that should upstream to the comcast device). I have a full VM dedicated as my DHCP/DNS device running bind9 and isc-dhcp-server which has been maintained for over 10 years; I’m not looking to offload that to another device and it works flawlessly on the lan (with an IP helper on each vlan).

I am definitely confused how it does gateways. My understanding is, in opnsense, gateways are the part of the route definition, so you define the opnsense gateways to point to the gateways on the L3 device, they’re not on the opnsense box itself. When you add an interface, you select the default gateway for that interface from a dropdown, consisting of the gateways you defined elsewhere. Where I get goofed up and lock myself out is when I change the “upstream” checkbox or mess with the priority. I don’t know how it selects one or the other as “active” either. I’ve iterated on that a lot; the further I get, the more it feels like the obfuscation of opnsense is adding to my complexity rather than reducing it.

It seems the only thing having routing problems are packets essentially originating from an interface on the opnsense; things on the LAN reach the WAN, things on the WAN reach the LAN, but wireguard clients terminating at the opnsense box can’t hit the WAN, and the opnsense box can’t hit the LAN (despite passing traffic).

tuxed@sh.itjust.works · 8 months ago

First off, if your firewall can ping 8.8.8.8 it can access the WAN, as 8.8.8.8 (hopefully, or you have bigger issues) is on the WAN. It not being able to do updates etc is probably a DNS issue in that case, probably caused by your firewall not being able to access your DNS server due to improper configuration on either the firewall, the switches or the DNS server itself.

Is your DNS server allowing clients coming from subnets other than its own? Can your Wireguard clients also ping 8.8.8.8? If so, they probably share the DNS issue with your firewall.

I would recommend trying to debug this iteratively, as this sort of problem has a lot of potential error sources that is hard to know of no matter how many screenshots you provide, like the configuration of your switches and DNS server. Try this:

Computer A cant reach computer B. What is the IP of A? What is the subnet of A? If it is different from the subnet of B, what route should it take to reach B? What is the next step on that route? Can we successfully reach this next step? Does the next hop on the route know where to go to reach the subnet for B? If so, what is the next step? Repeat until we’ve reached B, ideally ensuring each step on the way is acting as it should either trough something like wireguard or the built in tools of your firewall/switch/gateway/etc.
Assuming the problem hasnt been found, repeat from B to A, as responses might not reach us resulting in a broken connection even if we can reach B.
If the routing makes sense, is there a firewall on the way that doesnt allow us to reach B from A? Can we instead reach A from B? If not, we’ve found the problem.

I would strongly reccomend drawing your network layout (or at least the route you are trying to debug) in a flowchart tool (diagrams.net being a good option), as it is extremely hard to keep track of everything otherwise.

surfrock66@lemmy.world · 8 months ago

Ok, it’s definitely an issue with the firewall not sending traffic from itself to LAN. It’s weird, it’s passing traffic, but it cannot ping or access anything on the LAN including things on the 99 VLAN (so it’s local VLAN). The DNS requests are for sure failing from the firewall…but they work fine for the rest of the LAN. Any client can get a DNS response from the DNS server on the 2 VLAN, and can access the resulting site.

For now, I’m just excluding the wireguard thing, I think it’s a distraction to the problem that the firewall has some sort of bad routing going on.

I have a diagram, but at this point it’s pretty local to the firewall itself and I think it’s around the gateway/route configuration. I got some advice on the opnsense forum that my static routes are wrong; they say to make a single static route of 10.0.0.0/8 instead of one for each VLAN, turn off “upstream gateway” on the LAN GW (which when I do that I lose all WAN connectivity…which is a concern but I can revert). When I do the cli configuration, and I assign an IP for LAN, it asks if I want to put a gateway; it kind of says “it should be yes for wan, but no for LAN” but if I do no, I can’t access the internet from any clients, and if I do yes, it ticks “upstream gateway” on the lan gateway. Something is awry, but I’m gonna try again after making some static route changes.

tuxed@sh.itjust.works · 8 months ago

If the firewall cant reach the LAN, either because of a firewall rule or bad routing, it will not be able to access the DNS server even if it works well for the rest of the LAN. I’m assuming that the rest of the LAN talks to the DNS server directly and not through the firewall.

It sounds like you would benefit from reading a bit about how routing and gateways work, as it seems like you’re mostly trying stuff without really knowing what it does. Please save yourself some sanity and make some proper planning on your different subnets, their vlans and how they should route their traffic, ideally in a diagram of some sort.

Without knowing your exact setup I’m getting a feeling that your current configuration is both overly permissive and overly restrictive, meaning you cant access the things you want but any potential attackers can probably get around just fine.

I would seriously consider tearing it down and starting over with a more cohesive plan, but I know that might not be possible for you time-wise. On the other hand, having a well planned network that you understand would almost certainly save you time in the long run, especially if you want to keep doing more advanced and unorthodox stuff to it.

surfrock66@lemmy.world · 8 months ago

I probably need to burn it down and restart, but I need to find a time the family will tolerate an extended outage. I did share some things on the opnsense forum though which might be useful here.

My diagram. At the bottom you will see why I have /16; in truth, it’s from back when I only had a single subnet, and I made it /16 so I could use the third octet to form DHCP scopes. That’s how the network worked in my head and I knew the IP scheme, so when it came time to add VLANS much later, I just made those the 2nd octet, and that’s how we are here today. Maybe one day I’ll re-do that, but it’s not in scope right now: https://nextcloud.surfrock66.com/s/txnZdzxHaiA5t65
I did an experiment with static routes last night. I have the static route in, so I untagged the “LAN_GW” as an upstream gateway, and tagged “WAN_GW” as an upstream gateway. No change in the ability for opnsense to ping anything (it can ping WAN, not LAN), however all my LAN clients lost internet. In this state, from opnsense, I ran a “ping -S 10.99.1.40 10.2.2.213” (that’s my DNS server). This failed, but interestingly enough I was looking at the live logs, and even though the interface is LAN, the source IP was the WAN IP. I’m very confused; I’ve confirmed the LAN and WAN interfaces are correct and they have correctly assigned default gateways. See the attached picture. This would make sense; is opnsense doing something to switch the LAN and WAN somehow? I’m blown away how this is the case; that being said, it makes sense that tagging the LAN interface as upstream allows traffic out.

It feels like somehow opnsense is treating LAN like WAN or something? I don’t know the obfuscation feels like it’s hiding things. A “ping -S 10.99.1.40 10.2.2.213” shouldn’t show in the logs with a source of the WAN address, right???

tuxed@sh.itjust.works · 8 months ago

Okay, I think I know (at least one of) the problem(s).

It is sending the ping from the WLAN interface because that is your default route, and you either don’t have a route specified for your 10.2.x.x network or you’re overwriting it with a different route (I’m guessing the first option).

E.g. you need to tell your firewall “if you want to reach an ip-address in 10.2.x.x you need to go through here”, with “here” probably being either your managed switch if it works as a gateway (10.6.1.254?) or an interface on your router if it works as a switch (10.6.1.41?).

surfrock66@lemmy.world · 8 months ago

I’m totally with you…and I have that, which is why I think I’m hitting some sort of bug, or a firewall rule that is somehow breaking this:

tuxed@sh.itjust.works · 8 months ago

Have you tried setting the gateway to one of your LAN interfaces? And what happens if you ping 10.99.1.254 from the firewall?

surfrock66@lemmy.world · 8 months ago

If I go to my LAN interface and set the gateway to “LAN_GW” at 10.99.1.254, everything works (but I can’t ping anything on the LAN from the firewall itself, including the client I’m ssh’d from). If I set that to Auto, all LAN clients lose WAN access.

I’ve got a backup, but I think I’m gonna try to rebuild from scratch :/ I just worry I’m gonna end up in the same spot since I don’t understand how it all got here and don’t know what to avoid.