Making Sense of raw data: New techniques in product databasing

» Lire en français: French

The vast amount of data produced today provides business with significant opportunities to improve the consumer experience. For businesses and consumers to benefit fully, however, we need new ways of processing this information in its many forms.

 5 min read 

Professor Periklis Andritsos and his co-researchers have developed techniques and algorithms that enable raw information to be prepared for use more efficiently and effectively than current methods.

Periklis Andritsos was a professor in the Department of Information systems at HEC Lausanne, University of Lausanne.

We live in an information age. According to IBM, 90% of the data in the world has been created in the last few years. Every day we create 2.5 quintillion bytes of data. You might think that having this treasure trove of data at our disposal would help make us a better, healthier, happier, more productive society. And so it should, if only we could make sense of it all.

Today, the information is often created and consumed on the fly, in real time

We have rapidly progressed from a situation where individuals are principally consumers of information to one where individuals are both creators and consumers. And, where information was once predominately static, produced, prepared, and made available in a fixed form, today it is often created and consumed on the fly, in real time, frequently changing, transient in nature.

Potentially, much of this information is beneficial for companies, provided it can be processed in ways that allow its easy use. For some time we have had database technologies that enable information that is structured in the appropriate way to be utilized effectively. Today, though, data is produced in many ways, from different sources and platforms, and in vast quantities. The old techniques used to corral and constrain data are no longer as effective. Dealing with large datasets, and with semi-structured and unstructured data, can be labor intensive, cumbersome, costly and slow.

Data analysis, sorting and searching

Andritsos and his colleagues focus on a data challenge faced by price comparison websites

Fortunately, new technologies and methods are emerging in the battle to sort and analyze the constant deluge of data that threatens to overwhelm us. The work of Periklis Andritsos, an information systems expert, and his co-researchers Fei Chiang and Renée Miller, highlighted in their paper “Data Driven Discovery of Attribute Dictionaries”, is a good example.

Andritsos and his colleagues focus on a data challenge faced by price comparison websites. These websites collect and collate raw product information storing it in way that can be searched by consumers. At the moment the raw product information needs to be manually inspected, sorted and tagged, to optimize it for customer searching, a hugely labor intensive task. However, the research team have created a new framework and algorithms that enable the raw product information to be dealt with in a more effective way.

Their work focuses on attribute dictionaries, which provide a reference list of valid attribute features for a product. Cameras might have an attribute titled “manufacturer”, for example, with values that include Canon, Nikon and Sony. The dictionaries are part of the database software solution that makes it easier for consumers to search the product information.

The aim is to construct a smaller representation of the original product dataset

After their initial creation, these dictionaries must be maintained as the price comparison website receives product offers from various sources. The offer data is received in the form of a bulk set of records, usually text descriptions containing a number of “tokens” or “values” separated in some way, by whitespace for example, which look something like this: “Sony XBR 1080 32″ LCD HDTV 120Hz”.

The aim is to construct a smaller representation of the original product dataset, that contains the key information, and to do this in a predominately automated way. The method developed by Andritsos and his colleagues has three main stages. Initially, there is a dictionary discovery process looking for instances where the same tokens, such as “LCD HDTV” for example, occur together. Frequently recurring groups of tokens, known as segments, are effectively removed from the dataset, reducing the size of the dataset.

However, this process may produce segments that contain similar tokens (and therefore similar information) but in a different order, such as “wide screen LED”, “LED wide screen”, and “wide LED screen”. The researchers apply another algorithm to compare the information contained in such segments and, if appropriate, refine segments by removing or adding tokens. This further reduces the number of candidates for the attribute dictionaries.

A further challenge remains – mapping the refined segments to attribute dictionaries. The attributes that need to be covered – manufacturer, model, screen size, resolution, for example – are provided by the user. Initially, it is not known which segments belong in which attribute dictionaries. However, once the user has seeded each dictionary by mapping a few of the segments to the appropriate dictionary, the rest of the process is completely automatic. Segments in the dataset that match existing dictionary entries are discounted, and remaining segments evaluated for mapping based on their structural similarity to dictionary segments e.g. “win XP pro” and “windows XP prof” are structurally similar.

Outperforming established methods

The performance of the team’s new dictionary discovery technique and algorithms has been benchmarked against, and consistently outperformed, established methods. The team has also incorporated their work into a tool that can implement the solution on websites.

Potentially, it has much broader application

The research is an important step forward. Potentially, it has much broader application beyond its incorporation in price comparison websites, being useful for the processing and analysis of semi-structured and unstructured data in a variety of situations. And the team continue to revise and improve their data processing and analysis techniques – good news, considering the ever increasing volume of data business have to deal with.

Read the original paper: Data Driven Discovery of Attribute Dictionaries, Fei Chiang, Periklis Andritsos, Renée J. Miller, 2016.

Featured image by Ali Kerem Yüsel / istockphoto