1. It's kind of amazing to me how well LZNib does. (currently 30,986,634 on enwik8 with parallel chunked compress and LRM). I guess it's just the "asymptotic optimality" of LZ77; as the dictionary gets bigger, LZ77 approaches perfect compression (assuming the data source is static, which of course it never is, which is why LZ77 does not in fact approach the best compressor). But anyway, the point is with basic LZ the way matches are encoded becomes less and less important as the window gets bigger (and the average match length thus gets longer).
2. With byte-wise coders you have something funny in the optimal parser than you don't run into much with huffman or arithmetic coders : *ties*. That is, there are frequently many ways to code that have exactly the same code length. (in fact it's not uncommon for *all* the coding choices at a given position to produce the same total length).
You might think ties don't matter but in fact they do. One way you can break a tie is to favor speed; eg. break the tie by picking the encoding that decodes the fastest. But beyond that if your format has some feedback, the tie is important. For example in LZNib the "divider" value could be dynamic and set by feedback from the previous encoding.
In my LZNib I have "last offset" (repeat match), which is affected by ties.
3. My current decoder is around 800 mb/s on my machine. That's almost half the speed of LZ4 (around 1500 mb/s). I think there are a few things I could do to get a little more speed, but it's never going to get all the way. Presumably the main factor is the large window - LZ4 matches mostly come from L1 and if not then they are in L2. LZNib gets a lot of large offsets, thus more cache misses. It might help to do a lagrangian space-speed thing that picks smaller offsets when they don't hurt too much (certainly for breaking ties). (LZNib is also somewhat more branchy than LZ4 which is the other major source of speed loss)
4. One of the nice things about optimal parsing LZNib is that you can strictly pick the set of matches you need to consider. (and there are also enough choices for the optimal parser to make interesting decisions). Offsets can be sent in 12 bits, 20 bits, 28 bits, etc. so for each offset size you just pick the longest match in that window. (this is in contrast to any entropy-coded scheme where reducing to only a few matches is an approximation that hurts compression, or a fixed-size scheme like LZ4 that doesn't give the optimal parser any choices to make)
5. As usual I'm giving up some compression in the optimal parser by not considering all possible lengths for each match. eg. if you find a match of length 10 you should consider only using 3,4,5... ; I don't do that, I only consider lengths that result in a shorter match length code word. That is a small approximation but helps encoder speed a lot.
6. Since LZNib uses "last offset", the optimal parse is only approximate and that is an unsolved problem. Because big groups of offsets code to the same output size, the choice between those offsets should be made by how useful they are in the future as repeat matches, which is something I'm not doing yet.