Multicast hanging for multiple PCs


  • @hodgesc I played with it a bit
    I have ISC DHCP: global next-server set to the main IP and for VLAN 129 I have dedicated next-server on the second IP
    Also I have classes in DHCP server to distinguish legacy and UEFI clients and direct them to their respective boot files, and I have proxy DHCP setting turned on in TOEM, ofc

    I couldn't look into it today, and I think I'll have time maybe on Friday or next week, but I can't see any problem network-wise... just finding some weird setting in our Cisco 3560G would be the best way to handle this, but multicast for campus TV is working flawlessly, also the FOG multicast deployment. And I don't have enough resources to have multiple Windows servers running on every VLAN...
    I thought about setting up the SMB option, too, but you mentioned it just creates another headache generator to tackle, so that won't help, I guess, but I could try it at least.


  • @theopenem_admin @hodgesc
    Well, I tried different scenarios, during which neither option worked perfectly...
    When I boot PXE, it either:

    1. shows timeout of the server (this happens when I contact the new VLAN129 server)
      a. the tftpd logs show multiple tries with 0% progress and then ending in ERR
    2. downloads NBP file successfully, but doesn't download kernel (says "connection reset")
      a. this shows in tftpd that peer returns error: user aborted the transfer
    3. only 1 time I successfully got to the linux env, set up multicast for 2 clients, joined them through the PXE menu and went approximately to 67% of 30GB image, then got stuck and sadly never managed to get there again

    I tried restarting the tftpd32 service many times after changing the config in com servers and clusters. Tried disabling TFTP information server for the second com server. Tried changing IPs of next-server for given VLAN. Tried different options in tftpd64 gui.
    Of course there are many more combinations I tried, but neither worked, except for that one time, where the settings were set exactly as @hodgesc wrote, but it worked one time and after restarting the multicast and booting the clients again it went back to the timeout/not downloading kernels...

    The worst thing now is that even if I bind TFTP to only first interface, delete next-server option for new com server and even delete the com server from all configs and stopping the API for new com server, disabling network second interface - it doesn't even work as before (same timeouts and problems downloading kernel).

    EDIT:
    tftpd also shows in some cases: TIMEOUT waiting for Ack block #0 [16/08 12:21:50.206]

    Error communicating with VLAN129 com server (everything set up according to guide):

    Connection received from X.X.129.238 on port 2027 [16/08 12:50:37.308]
    Read request for file <proxy/efi64/pxeboot.0>. Mode octet [16/08 12:50:37.308]
    OACK: <tsize=882048,blksize=1468,> [16/08 12:50:37.308]
    Using local port 54070 [16/08 12:50:37.308]
    Peer returns ERROR <User aborted the transfer> -> aborting transfer [16/08 12:50:37.308]
    

    Error downloading kernel log (TFPT server for com server of VLAN129 disabled in com server cluster):

    Connection received from X.X.129.215 on port 2041 [16/08 12:40:45.683]
    Read request for file <proxy/efi64/pxeboot.0>. Mode octet [16/08 12:40:45.683]
    OACK: <tsize=882048,blksize=1468,> [16/08 12:40:45.683]
    Using local port 52369 [16/08 12:40:45.683]
    Peer returns ERROR <User aborted the transfer> -> aborting transfer [16/08 12:40:45.683]
    Connection received from X.X.129.215 on port 2042 [16/08 12:40:45.763]
    Read request for file <proxy/efi64/pxeboot.0>. Mode octet [16/08 12:40:45.763]
    OACK: <blksize=1468,> [16/08 12:40:45.763]
    Using local port 52370 [16/08 12:40:45.763]
    <proxy\efi64\pxeboot.0>: sent 601 blks, 882048 bytes in 0 s. 0 blk resent [16/08 12:40:45.953]
    Connection received from X.X.129.215 on port 59651 [16/08 12:40:49.710]
    Read request for file <proxy/efi64/pxelinux.cfg/default.ipxe>. Mode octet [16/08 12:40:49.710]
    OACK: <blksize=1432,tsize=1141,> [16/08 12:40:49.710]
    Using local port 50690 [16/08 12:40:49.710]
    <proxy\efi64\pxelinux.cfg\default.ipxe>: sent 1 blk, 1141 bytes in 0 s. 0 blk resent [16/08 12:40:49.710]
    Connection received from X.X.129.215 on port 60599 [16/08 12:40:49.710]
    Read request for file <proxy/efi64/pxelinux.cfg/01-30-9c-23-6a-6a-40.ipxe>. Mode octet [16/08 12:40:49.710]
    File <proxy\efi64\pxelinux.cfg\01-30-9c-23-6a-6a-40.ipxe> : error 2 in system call CreateFile The system cannot find the file specified. [16/08 12:40:49.710]
    Connection received from X.X.129.215 on port 4576 [16/08 12:40:49.710]
    Read request for file <proxy/efi64/pxelinux.cfg/01-.ipxe>. Mode octet [16/08 12:40:49.710]
    File <proxy\efi64\pxelinux.cfg\01-.ipxe> : error 2 in system call CreateFile The system cannot find the file specified. [16/08 12:40:49.710]
    

    This is shown during every successfull PXE boot, but the kernel wants to be downloaded from the VLAN129 com server, which resets the connection and ends PXE boot.


  • Well, tried drastic measures and reinstalled the whole server from zero...
    Now I can PXE boot on BIOS machine, but can't download kernel due to connection reset...
    And I can't even PXE boot on UEFI machine due to "PXE-E99: Unexpected network error"

    I found out I didn't copy everything I needed in the Applicationhost.config, so this time I tried to copy everything regarding Toec-API to its duplicate applicationpool, site and location.


  • Finally got it working, but new problem appeared:
    once I have multiple interfaces on the windows server, the tftpd kinda fails to work correctly with UEFI clients - Bios clients have no problem downloading pxeboot.0 and kernels, gets to TOEM PXE interface without problems and can image, but UEFI clients ignore incoming acknowledgment of PXE options from server and both of them just hang (even with tftp server binded to one interface), until I disable all windows interfaces, except one. I tried it with one other tftp software, but it did the same thing, so maybe it's Windows getting the response from VLAN143 and sending it to client through VLAN129 directly and UEFI PXE doesn't allow that?


  • Also if I set both Com servers as TFTP, they then share the boot files and TOEM overwrites them making the tftpboot\proxy\bios\pxelinux.cfg\default and similar files containing IP address of only one server, I guess the last one in the list. I thought about adding dedicated folder just for the VLAN, but so far I didn't find any solution to that.


  • Since they are on the same server, you would need a different tftp path defined in the com server


  • @theopenem_admin finally found the problem!
    I went deep down into using udp-sender on server via cmd and on client into debug console and found out that I have to use the --mcast-rdv-address 238.X.Y.Z, not the server IP W.X.Y.Z!
    As soon as I set 238.X.Y.Z, the client consoles started to be flooded with sent data and udp-sender started showing download progress.
    My question is, how to change this in TOEM source code, becuase the multicast.log shows the IP address of the server is always used as the mcast-rdv-address argument and adding my own argument into the setting just adds it to the command, not overwritting the wrong IP.


  • I'll need to release an update where this information is not prepopulated into the cmd.


  • Can you create a feature request post so I remember. That's generally all I look at when releasing new versions, the support topics are too many to go through.


  • @theopenem_admin ok I'll psot it there, I'll try to survive with unicast for now