Quants turn to machine learning to unlock private data

Replication could allow financial firms to use—and monetize—data that was previously off-limits

  • Rudimentary methods of anonymizing private data, such as masking, can be easily reversed.
  • Synthetic datasets created by machine learning algorithms are completely different from the originals, while still retaining the same statistical properties.
  • Financial firms including American Express, Fidelity and JP Morgan are exploring ways to use the technology to unlock the value of sensitive datasets.
  • Erste Group built a retail banking app using synthetic data, while an Italian bank used synthetic data to validate a third-party credit scoring model.
  • If the technique proves robust, it could make it easier for investment firms to develop novel strategies based on alternative datasets.

When an investment firm wanted to find out how a new breakfast menu at Wendy’s might affect the fast-food chain’s bottom line, it looked for the answer in time-stamped credit card transaction data.

The data was anonymized, of course. Credit card companies remove sensitive information and add statistical ‘noise’ to this type of data before selling it to investors or even sharing it internally. But these anonymization techniques are not foolproof, and nervousness about privacy breaches has held back the use of transaction data in areas such as investment analysis, fraud detection and the development of execution algorithms.

A new idea could change that. Rather than anonymizing datasets, financial firms are looking at replicating them. Machine learning algorithms can synthesize new, artificial datasets that are completely different from the original, while retaining the same statistical characteristics. Because the new data is essentially fake, it can be shared at will.

“The point about synthetic data is that it can remove sensitivities around personal information but preserve the signal,” says Harry Keen, founder of Hazy, a UK-based synthetic data firm that works with financial services organizations.

The approach could make it easier for investment firms to develop strategies using data that was previously off-limits. It could also allow them to test execution algorithms and other third-party services on their own data before signing expensive contracts.

Banks could use the technique to monetize proprietary data from their retail or trading units and to forge partnerships with fintechs.  

Fidelity International is already testing the technology. “There’s appreciation at senior level that this is the way forward,” says Erik Mostenicky, a senior associate in the firm’s strategic ventures group, which invests in businesses that are of strategic importance in asset management. A budget has been agreed and a team is working on a proof-of-concept. The plan is to put data anonymization using synthetic data “into production” by the end of the year.

There’s appreciation at senior level that this is the way forward.

Erik Mostenicky, Fidelity International Strategic Ventures

Other large financial firms are making similar moves. Data scientists from American Express described how synthetic data can be used in risk management at the NeurIPS conference, an annual gathering of machine learning experts, in Vancouver in 2019.

A team from JP Morgan gave a presentation on ways to synthesize data from order books and customer transactions at the same conference. “Financial services generate a huge volume of data that is extremely complex and varied,” the bank’s data scientists wrote in a subsequent paper on their research. “Data sharing within different lines of business as well as outside of the organization is severely limited.” They called for more research into ways of synthesizing financial data to overcome these limitations.

Quants at Standard Chartered have also written and spoken about using synthetic data to anonymize sensitive information.

Firms are cagey about what they’re doing—American Express, JP Morgan and Standard Chartered declined to comment for this article—but anecdotal evidence of interest in fake-data anonymization on both the sell side and buy side is growing.

Fernando Lucini, head of data science at Accenture, says his team is fielding five or six enquiries a month about synthetic data. A year ago, they had none.

Mostenicky is convinced other financial firms are trying out the technology, even if they won’t say so publicly. “I’m 100% sure it’s happening,” he says.

Behind the mask

The standard approach to anonymization involves stripping datasets of fields that could be used to identify individuals. More sophisticated methods add statistical ‘noise’ to data to render any single record meaningless while ensuring the composite retains its value.

But these techniques have their limits. Well-known instances of so-called de-anonymization, where hackers have been able to pick out individuals from supposedly anonymized data, include tracking of celebrities’ taxi trips and identification of the authors of Netflix film recommendations. Research has shown that in 90% of cases, the date and location of four credit card transactions is enough to reidentify an anonymized cardholder.

New laws such as the EU’s General Data Protection Regulation and California’s Consumer Privacy Act impose stiff penalties for such failures.

Legacy anonymization methods have another problem: they can mask the data too much. “You lose the pattern you need to validate investment strategies,” says Mostenicky. In some cases, a majority of datapoints might need to be removed to protect privacy.

Mostenicky illustrates the point with a trivial example: after masking, the statement “BMW, Mercedes and Audi are German carmakers with billions in revenue”, might become, “A, B and C are German objects of size”.

With synthetic data you can share samples without needing to sign non-disclosure agreements.

Gautier Marti, Abu Dhabi Investment Authority

Even sophisticated methods of anonymization such as homomorphic encryption come up short in finance, he says. Homomorphic encryption is a technology that encodes data in a way that cannot be de-anonymized while still allowing for calculations to be performed on it. The downside of this method is that it increases the size of the dataset, making computations thousands of times slower.  

Anonymizing with synthetic data, by contrast, gets away from sharing the original data entirely.

“Our software takes the original training set as learning material,” explains Alexandra Ebert, chief trust officer at Mostly AI, a fintech that has developed anonymization software for financial services firms. “Our deep learning algorithm identifies the patterns, the correlations, and understands how the customers behave, and what’s logical for them. Once the training process is finished, a completely separate synthetic dataset is generated from scratch that has the same characteristics.”

No single record in the synthetic dataset matches an original, but the synthetic dataset is still as useful as the original for training machine learning algorithms or for analytics, claims Ebert.

There are several machine learning techniques for creating artificial datasets. Standard Chartered prefers so-called Boltzmann Machines. Some practitioners advocate using variational autoencoders. An advanced technology—generative adversarial networks—is the same method used to generate viral deep-fake TikTok videos of Tom Cruise.

Opening doors

For investors, synthetic data opens the door to testing datasets more easily.

“When a fund wants to trial alternative data, usually it takes time,” says Gautier Marti, a quant researcher and developer at the Abu Dhabi Investment Authority, and an expert in ways to replicate complex datasets. “With synthetic data, you can share samples without needing to sign non-disclosure agreements and so on.”

Marti has looked at different ways that synthetic data might be used in finance and sees anonymization as the most obvious application.

Hedge funds making big investments in alternative data want assurances that the supply of data won’t be cut off because customers change their privacy settings, says Lorn Davis, vice-president of corporate and product strategy at Facteus, a company that anonymizes card transaction data.

Investors also face restrictions on the use of existing data that synthetic data might help overcome. “Data from third-party providers like Bloomberg, Markit and others comes with huge restrictions,” says Mostenicky. Because of legal, compliance and operational hurdles set by vendors or their own organizations, data “owners” within buy-side firms are “super anxious” about using data in new ways, he says.

Using data to verify whether a startup seeking investment can deliver on its promises, for example, could be classified as commercial use, which is often prohibited under contracts with vendors. 

“With synthesizing, you can replicate the shape and form and size of the data that you’re trying to use without the hassle of having to ask a vendor like Bloomberg as well as your legal team whether you can use it,” Mostenicky says.  

Investors could also use the technology to share their own data with vendors and external service providers. “A firm might say its technology will generate signals that tell you when your traders need to sell and when to buy,” Mostenicky says. “To test that you want to provide proprietary data sets on the performance of your investment funds. But you want to make sure you don’t disclose everything.

“With synthetic data, you can allow the external firm to use your synthetic securities and investment holding data and provide signals on the data they see.”

In the most advanced cases this could extend to creating a “sandbox” environment where external parties gain access to a range of anonymized data such as investment data or pricing data for use in proof-of-concept exercises.

“You can give them access to the data without revealing any of your investment strategies or your allocations,” Mostenicky says. “That can shorten the time to determine whether a startup or possible partner is relevant or not.”

For banks, too, a fail-safe method of anonymization would grease the wheels of collaboration with outside partners, specifically fintech companies sweeping the industry with offerings such as fraud detection systems or loan-default prediction models. Often those companies employ machine learning and so need access to bank data.

An Italian bank used synthetic data to validate a credit-scoring machine learning product from a third-party startup, Ebert says.   

Synthetic data could also help with internal development projects. Thresholds set for developers in banks to work on data can be prohibitively high. “It’s insane how little access they have,” Hazy’s Keen says. One bank was unable to run an internal hackathon because its technologists could not satisfy the bank’s own data governance requirements.

Erste Group built a retail banking app using Mostly AI’s synthetic data, road-testing how the app would work with customers both in terms of its design and in load testing the app’s capacity.

In a paper on the work done at American Express, the firm’s data scientists say publishing synthetic datasets would help industry innovators develop, train and test machine learning models in areas such as fraud detection.

Land of the hard

Making up data is no panacea, of course. Part of the value of information—especially for investors—is that it’s timely. Training generator-models to synthesize data and then synthesizing it can interfere with the speed at which new information becomes available.

Meanwhile customer behavior can change quickly, Davis at Facteus points out, as was the case during the Covid pandemic. Forecasting or investment models based on old data can quickly become redundant. 

And even fake data generators must give up some fidelity. Ebert says Mostly AI uses 99% of the original data for training to ensure the model doesn’t pick up on extreme outliers and inadvertently learn to reproduce them like-for-like.

That said, practitioners think the idea will catch on.

In future, buy-siders may well work with data that’s not real. And banks, in faking data, could find a way to monetize information they own but might struggle to use.

It’s early days. “This is still in the land of hard,” Lucini at Accenture says. Verifying that synthetic datasets are a good enough statistical match to the original will continue to be a tricky problem, he says.

But Lucini thinks the approach will one day become mainstream. Give it three years, he says, and anonymization through synthesizing data will be common practice.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Data catalog competition heats up as spending cools

Data catalogs represent a big step toward a shopping experience in the style of Amazon.com or iTunes for market data management and procurement. Here, we take a look at the key players in this space, old and new.

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here