For example, maybe you have some data where after each occurance of 011 it becomes more likely to repeat. To model that with an FSM you only need a state for 011, and it loops back to itself and increases P. To model it with finite contexts you need an 011 state, an 011011 , 011011011 , etc. But you might also have correlations like : every 72 bytes there is a dword which is equal to dword at -72 bytes xor'ed with the dword at -68 bytes and plus a random number which is usually small.
The point is not that these correlations are impossible to model using finite contexts, but the correct contexts to use at each spot might be infinitely large.
Furthermore, you can obviously model FSM's by hard-coding them into your compressor. That is, you assume a certain structure and make a model of that FSM, and then context code from that hard-coded FSM. What we can't do is learn new FSM's from the data.
For example, say you have data that consists of a random dword, followed by some unknown number of 0's, and then that same dword repeated, like
DEADBEEF0000000000000000DEADBEEFyou can model this perfectly with an FCM if you create a special context where you cut out the run of zeros. So you make a context like
DEADBEEF00and then if you keep seeing zeros you leave the context alone, if it's not a zero you just use normal FCM (which will predict DEADBEEF). What you've done here is to hard-code the finite state structure of the data into your compressor so that you can model it with finite contexts.
In real life we actually do have this kind of weird "finite state" correlation in a lot of data. One common example is "structured data". "Structured data" is data where there is a strong position-based pattern. That is, maybe a sequence of 32 bit floats, so there's strong correlation to (pos&3), or maybe a bunch of N-byte C structures with different types of junk in that.
Note that in this sense, the trivial mode changes of something like HTML or XML or even English text is not really "structured data" in our sense, even though they obviously have structure, because that structure is largely visible through finite contexts. That is, the state transitions of the structure are given to us in actual bytes in the data, so we can find the structure with only finite context modeling. (obviously English text does have a lot of small-scale and large-scale finite-state structure in grammars and context and so on).
General handling of structured data is the big unsolved problem of data compression. There are various heuristic tricks floating around to try to capture a bit of it. Basically they come down to hard coding a specific kind of structure and then using blending or switching to benefit from that structure model when it applies. In particular, 4 or 8 byte aligned patterns are the most common and easy to model structure, so people build specific models for that. But nobody is doing general adaptive structure detection and modeling.