What if there was some intermediate object like a simple version of the instance that should spawn there without any AI or animation... and it gets replaced at the moment the real entity would normally pop into existence?
We had that talk already at some point or another. Short answer: not easy at all, so it's a slim chance that they try it. Reasons ? a lot, but to name a few:
1-
There's no easy way to switch from block to entity in Unity in real time without feedback problems. If you shoot that Zd block from far away, it can't switch fast enough for it to receive the damage as an entity. It needs to store the health value, so in other words, the block health receives damage and then through new programmed variables you transfer the data to an entity which spawns with said hp value, but if its head explodes, then the player will notice a clear delay ; if it doesn't explode you will notice it nevertheless because it's not instantaneous as it goes through a few checks. Akin to an Skyrim character dying from fall damage: it falls and then it suddendly dies after a few seconds when the fall damage is calculated.
2-Aside from point 1,
the other apparent problems are visual and dev related(have to do 1-3 extra models per Zd as blocks in different postures and in the moment of switch the entity has to have that specific posture to make the switch seamless. Needs extra code and quite a bit of modelling time.
3.
You would lose performace when you make the switch, more than right now. The performance gain from having the blocks spawn first not only gets nullified when they switch but it doesn't compensate having to process a block to entity switch time and time again in a continuous fashion.
4.
If you make blocks, you would lose the very gameplay you are trying to fix. Isn't a Zd game supposed to wake everyone up when you make noise? If you have blocks, they won't wake up. You would have to make the block transfer to entity with a lot of code and more performance loss than you think and the gameplay not only would worsen, but it would be predictable to low end computers (we can "feel" when an horde spawns, we would also "feel" when a volume designed as you suggest "switchs" models).
I hope this helps to clarify why that innocent idea is a HUGE can of worms that won't be easily fixed with optimization tweaks such as switching entities for blocks. A hypothetical implementation would decrease performance, arguably maintain the gameplay as is (probably even worse due to switch delay) and it would also need tons of work and dev time.