NLP for investment management: quants face a grab bag of words

Training models to interpret text can be dull; but doing it poorly can be costly.

Investors hope machine learning-based natural language processing engines can find alpha signals in text, or help fundamental analysts cut through informational noise.

Training the models to make sense of financial language, however, is labor-intensive. BlackRock has been working on its sentiment model for five years.

Doing the job badly can lead to wasted time and effort; few shortcuts exist.

The participation of investment professionals in training the models is vital, quants say.

When Microsoft launched its chatbot Tay in 2016, the inputs from users notoriously steered the bot into spewing out hate speech. Tay was hurriedly bundled offline—and has yet to reappear.

Now, like so many in the vanguard of artificial intelligence, investment managers are busy trying to teach their bots the nuances of interpreting text—but, like the tech giant, their journey to success is set to be a long one—and must avoid embarrassing missteps.

NLP, or natural language processing, is a buzzy area in investing. Buy-side quants are using this text-interpreting arm of computing to generate sentiment signals from financial news, earnings calls and company filings—or to pick stocks along thematic lines, such as those likely to benefit from a transition to low-carbon economies.

“There’s a lot of R&D,” says Yin Luo, who heads quant research at Wolfe Research, which advises buy-siders on developing NLP models as well as running models of its own.

But the research and development to which Luo refers may be where many firms are parked for a while yet.

On the face of it, NLP seems easy to adopt. Powerful open-source algorithms are freely available. And a new generation of NLP-based models from companies like Google, is producing more accurate results.

But even the best-trained models throw up dozens of false positives: text they say relates to the wrong topic, or label as positive when it’s negative—or vice versa. And, like other quant models, NLP can be overfitted.

There’s a “fragility” in NLP, says Geoffrey Horrell, head of innovation at LSEG Labs, a fintech unit at the market infrastructure and data firm.

“You need to make sure that every single thing that your model is going to see in production, every document, is somehow represented in the corpus that you’ve trained with,” he says.

And algos make mistakes—which risks pushing investors into bad trades. Or into missing good ones.

The off-the-shelf algorithms lack the training to recognize specialist vocabulary, and don’t work so well when applied to investing. So buy-siders are having to do some of this effortful work themselves.

Investors can reduce the errors through extra training, using finance-specific text, says Luo. But it takes time and effort. “It can be very labor-intensive.”

A model trained to interpret text from earnings call transcripts might do a bad job of understanding central bank statements, where words are used differently.

You need to make sure that every single thing that your model is going to see in production, every document, is somehow represented in the corpus that you’ve trained with.

Geoffrey Horrell, LSEG Labs

Language can change over time, making a model out of date. It might not recognize a new cryptocurrency name or a novel phrase like ‘meme stock’, for example. A model built before Covid could fail to operate in the Covid era.

Annotate training text badly—using a corpus of language that fails to match what the model sees in production—and “you’re not going to get the right results”, says Horrell.

Data scientists, conversely, must take care to test models out-of-sample, using text dissimilar from the text used in training.

Some topics prove too ambiguous for NLP to be trained to recognize, Horrell says. “Firms do end up saying: We can’t do it; drop it out of the model.”

Word soup

Not that success is easy for anyone using NLP. The stories of early adopters illustrate the scale of what can be needed.

BlackRock spent five years developing its NLP sentiment engine, according to Haris Chalvatzis, a researcher in the firm’s core portfolio management team.

At Invesco, two data scientists worked for 24 months to build the firm’s NLP capability. “There has to be a financial and human resource commitment if you want to do these things,” says Georg Elsaesser, a quant portfolio manager at the firm. “You don’t get it for free.”

A macro hedge fund wanted to track news about a possible ‘Frexit’ and found generic models didn’t recognize the term. Another buy-side firm built a model to find companies with links to modern slavery. The firm had to train the model from scratch to understand the relevant clues.

Investment houses that have tried to train NLP models cheaply—often recruiting interns for the work—have wound up “seriously disappointed”, Luo says.

Many are grasping only now that their NLP operations will not operate like a “cottage industry, in which one PhD cracks it with an amazing model,” says Horrell.

One global financial firm asked the LSEG Labs team for help setting up an NLP project with only one junior data scientist—and proved clueless about most of the basics of the training required.

“They were going to need much more support,” Horrell says wryly.

In broad terms, investment firms use NLP to gauge the sentiment of text—or its relevance to a particular topic of interest.

Quant firms such as Acadian and PanAgora have used the technology to identify greenwashing. Other investors have used it to pick stocks.

Discretionary investors believe the technology can help their analysts sift the volumes of news and broker reports they receive daily. 

The models have evolved rapidly. Early so-called ‘bag-of-words’ models counted occurrences of given terms to determine the subject matter of text as well as its positive or negative slant.

Newer, so-called transformer models, of which Google’s BERT (Bidirectional Encoder Representations from Transformers) is the most popular, use neural networks to learn how words relate to the words around them. The models build up a picture of billions of ‘word embeddings’—multidimensional vectors that describe the similarity of a given word in a given context to all other words in all other contexts. It’s “incredibly powerful”, says Luo.

There has to be a financial and human resource commitment if you want to do these things.

Georg Elsaesser, Invesco

But to train the newer models, data scientists must gather text samples that demonstrate the use of specialist language in specialist contexts. These are labeled and fed to the NLP models to learn from. The model trains itself to match the correct labeling to which the topic text relates—or its correct sentiment—as shown in the samples.

The human part is laborious.

“It just takes an extremely long time,” says Acadian’s Andrew Moniz, who together with a colleague has spent “months—every evening and weekend” labeling data for the firm’s NLP engine. For someone like Moniz, who is the firm’s director of responsible investing, with a PhD in information retrieval and NLP, this is not the type of work he might expect to have to do.

Models need perhaps 1,000 to 10,000 samples before they grasp a given detail, says Stefan Jansen, founder of consultancy Applied AI and author of a widely read text on machine learning in finance. And the more nuanced the lesson, the more complex the sample documents. Jansen recently completed a project that required only 1,000 training examples but for which each text source was 10 pages long.

Acadian identifies “tens of thousands” of example sentences from documents to illustrate distinctions in whether text does or doesn’t refer to a given topic, says Moniz.

The more lexical diversity in the subject matter—there are more words to describe a company’s culture than its revenues, for example—the more sentences are necessary for training, and the more labels for each sentence.

Turning negatives into positives

Fine-tuning can also be specific to the type of document a model reads, or for whom it reads them.

When Alliance Bernstein sought to override a base model’s acquired negative view of the word ‘question’, which comes up frequently in earnings calls, quants had to reclassify about a hundred sentences containing the word, says Andrew Chin, who heads data science, quant research and risk for the firm.

AB also had to relabel the word ‘disruption’ when using an NLP model for its tech analysts. “Usually, we don’t like to hear the word disruption. But my tech analysts love it,” Chin points out.

Simple elements of training, such as mapping company names to stock tickers, can also go awry.

BlackRock’s Chalvatzis describes automating the mapping process as an “open problem” in NLP and says none of the mechanical solutions is close to perfect.

“It sounds like an easy thing to do, but in fact is not straightforward,” he says. Apple is the obvious example. A computer struggles to work out whether a document refers to the company or the fruit.

I’m convinced you need people that understand finance and investing—and, in our case, ESG as well. It’s a nuanced skill set.

Andrew Moniz, Acadian

Bloomberg news comes ready-tagged. But in other articles, “it’s kind of difficult,” says Calvatzis; “not impossible, but you do get false positives”.

All of this, buy-siders freely concede, is dull work. “It’s a nightmare to handle,” says one. “It’s like modern slave work. It’s not pretty.”

But you can’t have just anyone doing it. Outsourcing the training seems an obvious idea. Services like Amazon Mechanical Turk and Figure Eight offer data-labeling. But few practitioners that spoke to Risk.net, a sibling publication of WatersTechnology, see this as a viable long-term option.

“Garbage in garbage out,” says Moniz. “I’m convinced you need people that understand finance and investing—and, in our case, ESG as well. It’s a nuanced skill set. I don’t want to build something for which we don’t have a full understanding of how it’s been created.”

AB has collaborated with outside NLP vendors in projects that improve the provider’s models as well as AB’s. But Chin expects that to change.

Model army

In fact, NLP models will diverge, as firms incorporate proprietary research into their algorithms in the search for novel insights, says Chin.

Put another way, investment managers will try to train better models than their peers. But when they succeed, “firms will likely want to keep that perceived edge to themselves,” he says.

By implication, a firm’s alpha-seekers need to be involved.

“The process only works if the labelers have insights and are good at their jobs. Otherwise, our proprietary models will not be useful,” says Chin. “At AB, I believe our analysts and investment teams have alpha and therefore their feedback on these documents is crucial to the success of our models.”

Several firms that spoke to Risk.net try to bring together different specialists to work on training.

At T Rowe Price, project teams of five to seven individuals, including data scientists, data engineers, application developers and investment analysts, collaborate on the firm’s NLP models.

Junior analysts on such a team might spend as much as an hour a day on the work, says Jordan Vinarub, head of T Rowe’s New York technology development center and data science unit. “There’s a cost to having the business associates spend time with the team, but there’s a cost to not doing it too,” he says. “You can end up with an army of data scientists building something that misses the mark.”

Back at AB, Chin is coaxing front-office colleagues into helping. He asks the firm’s research analysts to highlight positive and negative words and sentences in documents as they read them, and to pass the mark-ups to the data science team for processing.

“The easier you make it for the analysts to get this data in, hopefully the more you get, and you just keep tracking it over time,” he says.

It’s vital to convince this band of casual labelers their time is well spent, he says. “I need to demonstrate that these models can add value. It’s not something that happens in a week. It happens over months, sometimes even longer. It’s difficult.”

Where firms do rely on outside help, one trick is to use NLP itself to evaluate data annotated elsewhere. In this approach, practitioners label a subset of the data and train a classification model on both the subset and the externally labeled data. If the classifications match, the externally processed data is good.

Elsewhere, firms have found that narrowing the task they ask NLP to perform can make training easier.

For a range of NLP-selected thematic funds, Invesco seeds its model with just a handful of key words relating to a given topic and lets the algorithm search out synonyms and related terms itself, building a mini-dictionary of around 50 to 100 words.

The process only works if the labelers have insights and are good at their jobs. Otherwise, our proprietary models will not be useful.

Andrew Chin, Alliance Bernstein

By sticking to straightforward themes and selecting broad portfolios, the firm mitigates the risk of false positives. Invesco also eyeballs the model’s trades and vetoes any that look nonsensical.

The firm has also found mileage in using NLP to look for dud companies rather than picking stars. Elsaesser gives the example of socially responsible investing. “It’s easier to avoid companies that are bad actors by analyzing news data,” he says. And the potential downside is reduced. “If you exclude too much because your algorithm isn’t perfectly trained, there’s not much of a risk.”

As for fixing false signals in more nuanced tasks, Acadian built a dataset of sentences chosen to catch its model out—mentions of diversity, say, that relate to earnings diversity rather than employee diversity. By doing so, the firm can test its model’s ability to classify potential false positives correctly.

Onward and upward

Shortcuts are few, however—and costs are many. The abundance of text that gives NLP its appeal can cause the outlay on data alone to balloon: broker research, earnings call transcripts, company descriptions, financial news, social media, web-scraped text from job sites such as Glassdoor. And so on. To be in the game, firms “have to invest a few million or at the very least a few hundred thousand dollars,” says Elsaesser.

Will that cost— and the workload in training NLP models—prove too much? Buy-siders say not. The demands will slow the technology’s onward march, for certain—but not stop it, they say.

So-called ‘domain-adapted’ versions of Google’s BERT that have been pre-trained on financial text can help. In fact, the main such model—FinBERT—was downloaded nearly a quarter of a million times in October from Hugging Face, a central repository of pre-trained open-source models. 

Ultimately, though, how buy-siders train NLP models is one of the many “nuances” of using the technology they will simply have to learn, says Chin. “It’s a multi-year process.”

Maybe by then, Tay will be back too.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Data catalog competition heats up as spending cools

Data catalogs represent a big step toward a shopping experience in the style of Amazon.com or iTunes for market data management and procurement. Here, we take a look at the key players in this space, old and new.

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here