Channel Access to Controls networks ISAC and ARIEL slow

 

Description of the Issue:

Epics Channel Access requests to ISAC and ARIEL Controls networks were slow or timing out from network segments outside of the controls network.

 

Resolved:

This was resolved at 11:30 Tuesday March 5th.

 

Reason (Gory details, mostly for me but recorded for anyone interested):

The above issue was a known issue under investigation for the new cycepics controls network. A 50% packet loss was being observed for channel-access requests from Ggeng. The issue was identifed only on one of the two redundant links and only for channel-access requests (UDP Directed Broadcasts), Ping, SSH and other TCP traffic was not affected.  A temporary workaround was to disable one of the two redundant links into the core routers between the CCS controls switch Bay15 and the core router TR1 (ISAC2)

Two weeks ago Sunday February 18th the two core firewalls were upgraded. It was noted after the upgrade that to restore 100% connectivity for channel-access to cycepics, the opposite LAG (MoB) had to be disabled. It was noted that after the upgrade the Master Routing Engine for the firewall cluster was switched from node1 to node0.

Now... two weeks later, the core routers were upgraded on Sunday morning March 4th. When the routers were upgraded the Master Routing Engine in the HA-Cluster was changed from member1 to member0.It was observed that now cycepics channel-access traffic was okay with both LAGS enabled but the ISAC and ARIEL channel-access requests, previously not a problem, were now slow or timing it.

When the redundant LAG to  TR1 (ISAC2) was disabled full connectivity for channel-access was restored.

Conclusion:

The present Core Router/Firewall architecture and configuration has an impact on UDP Directed Broadcast packets used by the Epics channel-access controls software. Depending on which Node in the Firewall cluster is the Master and which node in the Router cluster is the Master traffic does not traverse one of the LAGS. This  results in a 50% packet loss.

Under normal operating conditions node0 is the Master of the firewall HA-Cluster and member0 is the Master for the Routing HA-Cluster. The work-around to provide 100% channel-access to all controls networks is to disable the redundant link to TR1 (ISAC2).

This appears to be a bug in Junos OS and the present Firewall/ Routing HA-cluster toplogy used at TRIUMF. A ticket has been open with the Vendor to address the issue.

It should be noted taht this appears only to impact channel-access requests from outside of the controls networks. i.e from Ggeng Controls development and from Experimenters on the Main TRIUMF network. It does not affect channel access requests within the controls network.

Document Actions