IHS Markit to Add About 1 Million Analyst Reports to its Data Lake

IHS Markit uses Google’s transformer-based model BERT and a combination of classification and extraction techniques to determine what the documents mean and summarize them.

document shelf

IHS Markit is adding unstructured data, in the form of research articles and papers, to its proprietary Data Lake.

By the end of Q4, the data service provider aims to upload about one million documents published by internal analysts over the past 10 years. The research reports cover topics related to financial services, the automotive industry, agriculture, chemicals, economics and country risks, energy, life sciences, and more. 

Yaacov Mutnikas, chief technology officer and chief data scientist at IHS Markit, says the documents will be summarized and tagged so that users can understand their gist, and search for articles and reports by topic.

“For example, you can pull up ‘Argentina GDP’ and all the results on anything that was ever published on Argentina’s GDP will come up,” he says.

IHS Markit will generate a synopsis and extract domain-specific entities for each document before it goes into the data lake. 

“We are also running feature engineering through all the documents and extracting specific features such that we can label those articles to facilitate a much easier topic and article discovery,” Mutnikas says. 

IHS Markit used various machine learning and natural language processing techniques for the tagging system, including the incorporation of Google’s transformer-based model BERT

All this work will make it easier for clients to distinguish ambiguities between, for example, ‘Trump, the man’, and ‘Trump Tower, the building’, he explains. “[This is] to understand when you’re talking about an organization, or a location, or when you’re talking about a person versus a publication like a book.” 

Mutnikas adds that development work on the documents started in February and was completed towards the end of July. Currently, the service is going through testing and validation. “When that is done, we will onboard all the documents, all the machinery, indexing, and curating. And we will do it in just under two months—about six weeks,” he says. “It’s a very big step for us, to manage all the unstructured content that we have in the company.” 

Data Lake has been available to clients since May 18, and currently has about 1,000 proprietary datasets from the financial services, energy and resources, and transportation sectors. 

There are a few ways that buy- and sell-side clients can use Data Lake. The first, Mutnikas says, appeals to those who want to monetize their own data. These users can inject their data onto the platform and use IHS Markit’s framework to distribute it to various users.

Other users might want to compare their in-house datasets to the datasets that IHS Markit has. “They might want to merge the breadth of their data and the depth of our data. They can merge two datasets to get the best outputs,” he says. 

A third use case is to research opportunities or additional insights, for example in emerging markets. “If you look at macroeconomic data, for example, we’ve got north of 18 million time series,” Mutnikas says.

The cloud-based platform stores, catalogs, and governs access to structured and unstructured data. Using the catalog, clients can search and explore IHS Markit’s datasets via a standardized taxonomy. Clients can use the tools they want to work with their own as well as IHS Markit’s proprietary data in one place. 

“We cannot tell people where the opportunity is, but we can help people to find those opportunities, because people look for different things,” he says. ”For example, somebody will look for opportunities in Latin America, or somebody else will look for opportunities in Southeast Asia. They are both emerging markets and the opportunities there are different. So we enable people to use the tools they’re comfortable with rather than imposing tools on them. Essentially, we support any tool that people want to use in that space.”

IHS Markit has curated the 1,000 datasets it has into data packages, including access to metadata, sample data, and data dictionaries, to facilitate easier browsing. 


Sorting Documents

IHS Markit worked closely with the relevant analysts to ensure the reports and articles were summarized correctly, and the appropriate topics were tagged. 

Mutnikas says the analysts played a vital role in helping to develop the machine that IHS Markit used to summarize the documents. 

“When we sit down and write the initial data science machinery around it, they can validate if our assumptions and how it summarizes documents across their domain are represented correctly. From there, the machine does the work,” he says. 

The benefit for IHS Markit is that it has people in engineering, research, and specialists who understand the data. “You’ve got to understand what you’re looking at. … That is why having specialists curating that data and owning the data is key. Otherwise, [it’s a] no go,” he adds.

 

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Data catalog competition heats up as spending cools

Data catalogs represent a big step toward a shopping experience in the style of Amazon.com or iTunes for market data management and procurement. Here, we take a look at the key players in this space, old and new.

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here