I was rather curious to see how this handled Garden Path Sentences[0]. For "The old man the boat.", Stanza interprets "man" as a noun rather than a verb. Similarly, for "The complex houses married and single soldiers and their families." "houses" is also interpreted as a noun rather than a verb. These sentences are mostly corner-cases, but was an interesting little experiment nonetheless.
Most humans struggle when reading garden-path sentences, so I would be quite impressed if an NLP toolkit handled them easily out-of-the-box.
EDIT: On a related note, when I was an undergrad there was a group on campus that was doing research on how humans repair garden-path sentences when their first reading is incorrect. They were measuring ERPs to see if something akin to a backtracking algorithm was used + eye-tracking to see which word/words triggered the repair. I graduated before the work was complete, but I might go digging for it to see if it was ever published.
I think that's too high a bar. I didn't interpret either one of those sentences the first time I read it either. It would be obtuse to expect even a "human-level" AI to get these right. Though you could fix it to get it right by backtracking to see if there are alternate solutions that generate complete parses.
But still, this kind of analysis (part-of-speech, dependency parsing, etc.) was deemed useless with NN transformer models.
The solutions to these low-level problems seem to be unimportant for high level tasks. Not to mention that error propagates. Error on part-of-speech tagging will propagate to dependency parsing that uses that info, and eventually this error will affect NER or entity/relationship extraction and similar.
You can use spacy for english tokenization or you can use their neural model. The neural model will generally do better, especially sentence segmentation, but will be slower.
NLTK, for my uses, has been dethroned by Spacy for years now. I'm very curious to see how Stanza compares. It looks like it's built on PyTorch, so very interested to check it out.
Interestingly I had trouble using Spacy as it requires an internet connection to AWS to load their models and therefore work. This was blocked by deep packet inspection in my use case.
I looked a bit for a workaround but finally decided to just do the work in NLTK.
You can also try out Stanza in spaCy --- Ines updated the spacy-stanfordnlp wrapper to use the new version pretty much immediately: https://github.com/explosion/spacy-stanza
I was asked by a friend about Stanza in a private DM, I'll paste the answer here as I think others might find it helpful:
Q: are stanza models more accurate and consistent than
spacy as this tweet claims?
A: Yeah definitely, our models are quite a bit behind
state-of-the-art atm because we're still optimized for
CPU. We're hoping to have a spacy-nightly up soon that
builds on the new version of Thinc.
The main thing we want to do differently is having
shared encoding layers across the pipeline, with several
components backproping to at least some shared layers of
that. So that took a fair bit of redesign, especially to
make sure that people could customize it well.
We never released models that were built on wide and
deep BiLSTM architectures because we see that as an
unappealing speed/accuracy trade-off. It also makes the
architecture hard to train on few examples, it's very
hyper-parameter intensive which is bad for Prodigy.
Their experiments do undercount us a bit, especially
since they didn't use pretrained vectors, while they
did use pretrained vectors for their own and Flair's
models. We also perform really poorly on the CoNLL-03
task. I've never understood why --- I hate that dataset.
I looked at it and it's like, these soccer match
reports, and the dev and test sets don't correlate well.
So I've never wanted to figure out why we do poorly on
that data specifically.
As an example of what I mean by "under counting", we can get to 78% on the GermEval data, while their table has as on 68%, while FLAIR and Stanza are on 85%. So we're still behind, but by less. The thing is, the difference between 85 and 78 is actually quite a lot -- probably more than most people would intuit.
I hope we can get back to them with some updates for specific figures, or perhaps some datasets can be shown as missing values for spaCy. Running experiments with a bunch of different software and making sure it's all 100% compatible is pretty tedious, and it won't add much information. The bottom-line anyone should care about is, "Am I likely to see a difference in accuracy between Stanza and spaCy on my problem". At the moment I think the answer is "yes". (Although spaCy's default models are still cheaper to run on large datasets).
We're a bit behind the current research atm, and the improvements from that research are definitely real. We're looking forward to releasing new models, but in the meantime you can also use the Stanza models with very little change to your spaCy code, to see if they help on your problem.
[0] https://en.wikipedia.org/wiki/Garden-path_sentence