Tagging & Annotation R&D

Tagging and annotation have long been some of the most important tasks that a news organization undertakes. The tags that we attach to articles enable nearly everything that happens to that article after publication: how we recommend related content to readers, how search engines index our site, how ads are targeted and more.

Currently, at The New York Times, those tags are applied at the article level. Yet when we look at an article we can see that it actually contains many smaller component parts, like a fact, a person, a recipe or an event. If we could begin to annotate and tag these components, it would enable us to do so much more with that information. New devices, especially those with smaller screens, could make use of smaller chunks of content. New products could be created by extracting components from their original article context and recombining them to create collections or new kinds of experiences. And rather than the archive being a file cabinet full of articles, it would become a corpus of structured news information that could be interrogated and reasoned across.

Fine-grained annotation within an article is a difficult problem that has historically been approached in two ways, both of which have their own challenges. One approach is computational, building rule sets or machine learning processes to take best guesses at where to apply tags. These approaches can be quite successful, but are still not nearly good enough to stand on their own. The other approach is to have people do the tagging. The person writing the article knows the information needed with a high degree of accuracy, but the burden of work required to highlight and annotate every significant phrase is untenable.

via EDITOR (2015) | nytlabs.