I can explain why this seems so easy when the two tracks are identical except for the voice.
You start with 2 tracks, (instrumental) and (instrumental + voice).
Inverting a track is the same as adding a minus sign to it. So when you invert and mix these two tracks together, you get this equation:
-(instrumental) + (instrumental + voice) = voice.
As long as the two instrumental signals are exactly equal, they’ll exactly cancel out and you’ll be left with only voice. The theory really is that simple. The reality is that the only time you’ll encounter two signals that are exactly the same is if they’re from the same source material. If your two recordings aren’t from the same performance, you’ll never get anywhere. I’ll try to explain why after I answer your question.
I guess my main problem is that I need to have a “cleaner” more EXACT instrumental version to work with, and that is a problem. Is there any way to improve the track, via amplifying it or something to get more “peaks and valleys” that are closer to the original version that I am trying to work with?
Almost certainly not, with one exception that I’ll explain at the bottom. Even if the two instrument tracks were pulled from the same performance, they might be mixed differently. And if they’re mixed differently, you’ll get the difference between the two tracks when you subtract one from the other. It might be technically possible (though unlikely) to “backtrack” with one of the clips so that they line up more perfectly, but without know what the mixing engineer was doing, there’s no way you can find a process that will reverse all the differences. And even if you did know what the engineer did, it will still probably be impossible because many multi-track processes are not reversible once you’ve mixed a signal down to 2-track stereo (such as compression, panning, individual volumes, modulation effects, reverb, etc).
Conceptually, if you have two tracks that are identical, then when you subtract them, the difference is zero at all points (along the time axis), so the final product will be zero. The more differences there are between the two signals, the more substantial the difference will be at all points, so the final product will have audible sounds in it. The reason even “small” differences between signals won’t cancel out is because the tolerances are so tight. A common digital signal has 44,100 data points every second. These all need to be identical in order to cancel two signals out by subtraction, so each note of the performance needs to be within 22.6 microseconds. Even if you were to build a robot band that could play acoustic instruments within this time tolerance, there would still be enough differences between two performances to make subtraction impossible. Acoustic instruments don’t ever respond identically to the same input in the real world due to the properties of the materials involved and acoustic interactions within the environment.
So basically, if you take one instrumental mix and add vocals to it, you’ll be able to use the subtraction method. But if you take one performance and mix them independently, you won’t be able to subtract one from the other very well. This is probably what’s happening to you.
On the other hand, if the only difference between the two instrument tracks is a slight volume difference, then it should be possible to get them to exactly cancel out if you can find out what the exact volume difference is. It’s possible that the track with the voice in it was turned down a tiny bit to make room for the voice. That would change the original equation to something like this:
-(instrumental) + (.95 * instrumental + voice) = (-.05 * instrumental) + voice
If that’s the case, then you’ll hear a quiet version of the instrumental track mixed with the full volume voice track. But if the difference is more than just volume, then you’ll either get a mild filtering effect or a mild echo effect.