Multicast hanging for multiple PCs

theopenem_admin

The only other thing I could think of is to change the multicast block size. In your sender arguments try using:

--blocksize 700

Eruthon

@theopenem_admin after setting this option in Com Server Multicast settings, after I click start multicast for a group, it says "Could not start the multicast application"

Eruthon

Fixed it - had to add a space before "--blocksize", or it wasn't in the command:

07-27-22 08:01 Starting Multicast Session With The Following Command:
cmd.exe /c ""C:\Program Files\Theopenem\Toec-API\\private\apps\udp-sender.exe" --file "O:/\images\Win10-EduPro-IDE_Test\hd0\part2.ntfs.lz4" --portbase 9370 --min-receivers 2 --ttl 32 --interface X.X.143.115 --mcast-rdv-address X.X.143.115--blocksize 700"
Session Started And Was Forced To Quit, Try Running The Command Manually

hodgesc

FWIW, I have the same situation, and the only thing that seemed to work for me was to make sure that all storage settings were set to local, on the main server, and secondary com servers, and to make sure upload/deploy direct to SMB was disabled. Works like a charm after that.

Eruthon

@hodgesc I've got the system as Proxmox VM with 50GB C:\ drive and 1TB O:\ drive, which has the images. I don't use any secondary servers (although I think having one server per VLAN would work, I don't think it's the right way to do it, as I have system with like 5 VLANs or so) and I've never used the SMB option. Also "deploying via SMB" is a thing?

hodgesc

@eruthon Upload/Deploy Direct to SMB is the last option under Admin Settings --> Imaging Client. To multicast across VLANs, you'd have to have a com server on each vlan and run the multicast session through there.

Eruthon

@hodgesc And the com server has to be a standalone windows server with its own installation of TOEM or it could be just one windows server with one toem, but set up with multiple network interfaces and I'd just add different com servers on those networks?

hodgesc

@eruthon what I think you'd actually have to do if you just want to use one actual system to run this on multiple network interfaces, is to:

Clone the TOEC-API site for each IP.
Bind each IP to the site in IIS
Setup a "com server" for each IP in the UI.
Modify the web.config for each com server and verfiydb.
Add them to the default cluster as passive servers with the multicast, TFTP, and imaging options.
(edit) 6. Go into HeidiSQL, change the root account to have remote access. (almost forgot that)

That should do it I think.

theopenem_admin

@hodgesc
You are 100% correct. Love seeing people with an understanding of how things work, not to discount others that are still learning.

Eruthon

@hodgesc will try, thanks for steps, will let you know if it works!

hodgesc

@theopenem_admin I wish I could say I was an expert, but I literally just started using this last thursday. However, I've been in systems management for 16 years so I've seen enough to figure things out quickly.

Eruthon

@hodgesc ok..

I opened the IIS config file (c:\windows\system32\inetsrv\config\ApplicationHost.config) and copied the Toec-API page, the second one called Toec-API-129
Every web page till now had "all unassigned" interfaces, I gave the new one the IP address X.X.129.253
I created a new com server, with its new IP address, gave it the same local storage location as the main server
I copied the folder of Toec-API as Toec-API-129 folder, and changed the com server ID in webconfig to the one generated by website for new com server
I added the new com server to the cluster, gave it those 3 functions

I know I forgot the steps for database, but I'd like to know what exactly to do, please.

Now I can succesfully connect to the web interface on both IPs, but PXE for PCs on the new VLAN run into PXE-E99 error (worked for every PC on the main IP; I added "next-server X.X.129.253" in my isc-dhcp config for the VLAN), while if I let the old server IP for all VLANS, it gives server timeout.

hodgesc

@eruthon Couple of quirks I forgot about.

Some permissions don't seem to copy over.

You'll need to make sure that iis_iusr has read and execute permissions for the Toec-API-129 folder.
Within the Toec-API-129 folder, you'll need to make sure iis_iusr has modify rights to the "private" folder. Be sure to apply permissions to child objects.
Open HeidiSQL, enter the root password (located in the connection string in the Toems-API web.config folder, and click open. At the top, click the user manager icon (looks like 2 people). Click root on (the one with the host as your servername), change the "from host" setting to Access from anywhere.

Eruthon

@hodgesc

added the permissions for folders (IIS_USERS were missing, now it's the same as the original folders);
changed the settings for database, now it shows "%" character

The problem with the PXE boot continues though... I can't PXE boot from any other VLAN other then the one with first network interface. Seems like it can't read from tftp server (which is on for both com servers in their configs and in their cluster settings).

hodgesc

@eruthon I don’t use PXE yet, but I believe you need to go into the group settings and assign the machines to the com server on their vlan.

theopenem_admin

You probably need to modify your tftp server.
Program files\Theopenem\tftpd32\tftp64_gui
Open settings and make sure that bind tftp to this address is unchecked

Eruthon

@theopenem_admin it shows the client when it tries to download pxeboot.0, but it shows ERR in progress bar and in the logs, although I didn't do anything:

Peer returns ERROR <User aborted the transfer> -> aborting transfer [09/08 21:37:25.927]

Edit: also the bindded interface was already disabled

Edit: efi started working for other VLANs except the one with new com server

hodgesc

@eruthon I don't know if this plays a role in this or not, but my understanding is that PXE settings are pushed to vlans via DHCP, so if machines on that vlan are still trying to only contact the main rig, it may be a DHCP configuration issue.

Eruthon

@hodgesc I played with it a bit
I have ISC DHCP: global next-server set to the main IP and for VLAN 129 I have dedicated next-server on the second IP
Also I have classes in DHCP server to distinguish legacy and UEFI clients and direct them to their respective boot files, and I have proxy DHCP setting turned on in TOEM, ofc

I couldn't look into it today, and I think I'll have time maybe on Friday or next week, but I can't see any problem network-wise... just finding some weird setting in our Cisco 3560G would be the best way to handle this, but multicast for campus TV is working flawlessly, also the FOG multicast deployment. And I don't have enough resources to have multiple Windows servers running on every VLAN...
I thought about setting up the SMB option, too, but you mentioned it just creates another headache generator to tackle, so that won't help, I guess, but I could try it at least.

Eruthon

@theopenem_admin @hodgesc
Well, I tried different scenarios, during which neither option worked perfectly...
When I boot PXE, it either:

shows timeout of the server (this happens when I contact the new VLAN129 server)
a. the tftpd logs show multiple tries with 0% progress and then ending in ERR
downloads NBP file successfully, but doesn't download kernel (says "connection reset")
a. this shows in tftpd that peer returns error: user aborted the transfer
only 1 time I successfully got to the linux env, set up multicast for 2 clients, joined them through the PXE menu and went approximately to 67% of 30GB image, then got stuck and sadly never managed to get there again

I tried restarting the tftpd32 service many times after changing the config in com servers and clusters. Tried disabling TFTP information server for the second com server. Tried changing IPs of next-server for given VLAN. Tried different options in tftpd64 gui.
Of course there are many more combinations I tried, but neither worked, except for that one time, where the settings were set exactly as @hodgesc wrote, but it worked one time and after restarting the multicast and booting the clients again it went back to the timeout/not downloading kernels...

The worst thing now is that even if I bind TFTP to only first interface, delete next-server option for new com server and even delete the com server from all configs and stopping the API for new com server, disabling network second interface - it doesn't even work as before (same timeouts and problems downloading kernel).

EDIT:
tftpd also shows in some cases: TIMEOUT waiting for Ack block #0 [16/08 12:21:50.206]

Error communicating with VLAN129 com server (everything set up according to guide):

Connection received from X.X.129.238 on port 2027 [16/08 12:50:37.308]
Read request for file <proxy/efi64/pxeboot.0>. Mode octet [16/08 12:50:37.308]
OACK: <tsize=882048,blksize=1468,> [16/08 12:50:37.308]
Using local port 54070 [16/08 12:50:37.308]
Peer returns ERROR <User aborted the transfer> -> aborting transfer [16/08 12:50:37.308]

Error downloading kernel log (TFPT server for com server of VLAN129 disabled in com server cluster):

Connection received from X.X.129.215 on port 2041 [16/08 12:40:45.683]
Read request for file <proxy/efi64/pxeboot.0>. Mode octet [16/08 12:40:45.683]
OACK: <tsize=882048,blksize=1468,> [16/08 12:40:45.683]
Using local port 52369 [16/08 12:40:45.683]
Peer returns ERROR <User aborted the transfer> -> aborting transfer [16/08 12:40:45.683]
Connection received from X.X.129.215 on port 2042 [16/08 12:40:45.763]
Read request for file <proxy/efi64/pxeboot.0>. Mode octet [16/08 12:40:45.763]
OACK: <blksize=1468,> [16/08 12:40:45.763]
Using local port 52370 [16/08 12:40:45.763]
<proxy\efi64\pxeboot.0>: sent 601 blks, 882048 bytes in 0 s. 0 blk resent [16/08 12:40:45.953]
Connection received from X.X.129.215 on port 59651 [16/08 12:40:49.710]
Read request for file <proxy/efi64/pxelinux.cfg/default.ipxe>. Mode octet [16/08 12:40:49.710]
OACK: <blksize=1432,tsize=1141,> [16/08 12:40:49.710]
Using local port 50690 [16/08 12:40:49.710]
<proxy\efi64\pxelinux.cfg\default.ipxe>: sent 1 blk, 1141 bytes in 0 s. 0 blk resent [16/08 12:40:49.710]
Connection received from X.X.129.215 on port 60599 [16/08 12:40:49.710]
Read request for file <proxy/efi64/pxelinux.cfg/01-30-9c-23-6a-6a-40.ipxe>. Mode octet [16/08 12:40:49.710]
File <proxy\efi64\pxelinux.cfg\01-30-9c-23-6a-6a-40.ipxe> : error 2 in system call CreateFile The system cannot find the file specified. [16/08 12:40:49.710]
Connection received from X.X.129.215 on port 4576 [16/08 12:40:49.710]
Read request for file <proxy/efi64/pxelinux.cfg/01-.ipxe>. Mode octet [16/08 12:40:49.710]
File <proxy\efi64\pxelinux.cfg\01-.ipxe> : error 2 in system call CreateFile The system cannot find the file specified. [16/08 12:40:49.710]

This is shown during every successfull PXE boot, but the kernel wants to be downloaded from the VLAN129 com server, which resets the connection and ends PXE boot.