How language models are disrupting equity research

Blog Thimble Image
Suhas Pai, CTO of Hudson Labs
A note from Hudson Labs' CTO, Suhas Pai

A few months ago, we launched the Hudson Labs news feed, a fully-automated real-time AI-driven news feed that chronicles the worst nightmares of public companies as they happen - management resignations, financial statement restatements, investigations and subpoenas, the dreaded comment letter by the SEC asking about non-GAAP reconciliation, the ominous non-timely filings, and more. To our knowledge, this is the first ever AI-generated news feed in the world of its kind.

What we built is really quite impressive. You may have heard of and consumed AI news feeds on large media sites, but the news appearing in those feeds are already pre-generated - a journalist made the effort to publish an article and the algorithms just figure out relevance and ordering based on user preferences and behaviour.

In contrast, Hudson Labs ‘generates’ the news by itself, by combing through millions of raw SEC filings and reading every single sentence, separating out boilerplate from actual content, and determining whether a sentence represents interesting enough information that warrants an appearance on the news feed.

Just a decade ago, claims of an automatically generated news feed would either have been an April fools prank or a startup whose audacity to peddle such blatantly obvious snake oil would be lampooned and elicit widespread cringe. But today, it is not only possible, it has been done. What changed?

The field of artificial intelligence/machine learning has blossomed over the last decade. The year 2012 ushered in the ‘deep learning revolution’, the results of which are widely visible across society today. The impact has been especially visible in applications involving language and text, thanks to deep learning based language models. Gone are the days when the translated texts on Google Translate were fodder for improv bits due to the absurdity of the translations. Today, you could confidently use it as a tool to navigate a foreign country as a tourist.

At Hudson Labs, we have embraced language models in a big way. They are a foundation upon which we build our products and they play an essential role in our success.  We continue active research in this area to advance the state-of-the-art.

What is a language model?

To put it rather simplistically, it is a model that has been trained on a large corpus of text, such that when given a sequence of words, it can predict the next word following the sequence, by assigning a probability to each word in the vocabulary of the language.  

For example, if we provide a language model with the phrase ‘Mt Everest is the’, and ask it to predict the next word, the model can assign a probability to each word in its vocabulary based on the likelihood that it would be a valid continuation. It would give the word ‘the’ a close to zero probability, because it is highly unlikely that the continuation of the sequence would be the word ‘the’. Indeed, ‘Mt. Everest is the the’ is ungrammatical, so it turns out that learning to assign these probabilities incidentally causes the models to imbibe linguistic capabilities like syntax (grammar) and semantics (meaning). If you go to Google Docs and type the phrase ‘Mt. Everest is the the’, the in-built  grammar checker spits out the blue underline faster than the Peloton share price crash. That’s a language model in action.

Even more powerful language models that are trained on the entire web have seen sufficiently many references to Mt. Everest to give the word ‘tallest’ a very high probability, thus reflecting world knowledge that has been captured by the model. Such large language models like GPT-3 have astounded people by being able to answer questions about the world and generate human-like text, including constructing fiction and poetry, writing credible cover letters and marketing text, and carrying out conversations with people.

Modern language models are characterised by their adherence to the ‘distributional hypothesis’. The distributional hypothesis can be best described by the adage ‘You shall know a word by the company it keeps’. The meaning of each word can be inferred by the meaning of the words that surround it, in context. This means that we are able to distinguish between different senses of the same word. For example, consider the following 2 sentences:

‘The FDA generally expects preliminary clinical evidence to be obtained from clinical investigations specifically conducted to assess the effects of the therapy on a serious condition’

‘In addition, the Company’s Board of Directors recently received notice of an investigation by the DFEH.’

Because the model is able to capture context, it can distinguish between the different nuances of the word ‘investigation’. This vastly differs from the keyword based paradigm where a search for investigation would bring up both sentences. In our case, our models employ contextual information to mark the first sentence as not being a red flag.

All this is music to the ears of practitioners who would like to build applications for text. To those who would like to build an automated AI-driven financial news feed, a model that can capture syntax, a model that can handle contextual meaning…this is such a boon! But...do they really work well right out of the box? Does a language model understand the meaning of ‘impairment’? Does it understand on its own that having a material weakness is supposed to be a bad thing? Unfortunately no, we did that. We did it using a process called fine tuning, a form of supervised learning (technically, we use a lot of unsupervised, semi-supervised, and self-supervised learning too, but those details are for another day). For an introduction to supervised learning and machine learning in general, read our  "Introduction to Financial NLP and Large Language Models".  

Challenges with state-of-the-art models

While the potential of language models is no doubt immense, they are not easily adaptable to solve different tasks. At Hudson Labs, we spent a year in research before we launched our product, innovating several techniques for addressing commonly seen pitfalls. We continue to spend significant time on research even today.

Domain Adaptation

Large language models are prohibitively expensive to train, with the compute costs running into the tens of millions for larger models. These models are trained by Big Tech behemoths like Microsoft, Meta, OpenAI etc and are released for free or under a pay wall. These models are trained largely on web text. Thus, they need to be adapted to financial text in order for them to be usable.

Financial text is not easy to adapt to. SEC filings consist of uncharacteristically long sentences that are linguistically complex, having multiple syntactic clauses, written in legalese bereft of any emotion, containing both legitimate financial jargon and not so legit buzzwords and jargon.

Here is a typical sentence from the Risk Factors section of a 10-K.

“Any such determination could result in industry investigations, enforcement actions, changes in legislation, regulations, interpretations or regulatory guidance or other legislative or regulatory action or other actions, any of which could have the potential to result in additional limitations or restrictions on our business, cause material disruption to our business or otherwise adversely affect us”

I rest my case.

Boilerplate

SEC filings in particular contain boilerplate text that is linguistically indistinguishable from normal text. Consider the following -

“The existence of any material weakness in our internal control over financial reporting could also result in errors in our financial statements that could require us to restate our financial statements, cause us to fail to meet our reporting obligations, subject us to investigations from regulatory authorities or cause stockholders to lose confidence in our reported financial information, all of which could materially and adversely affect us.”

Does this sentence indicate the existence of a material weakness? This is kinda difficult even for humans. How hard would it be for a machine?

In general, boilerplate differs from useful content in very subtle ways that cannot be specified by rules.

Negation

Due to their overreliance on the distributional hypothesis, language models are not great at negation.

We have made substantial advances in adapting our models to work on financial text, separating out boilerplate, and addressing problems with negation. We have also made advances in handling topic drift, active learning, text ranking, representation learning and few-shot learning. For more details on that, watch out for part 2!

[Note: the above explanation was on generative language models. We have discriminative language models too, but that distinction is for another post]