Novelty Detection in Review Data

The recent boom in e-commerce has created active electronic
communities where consumers share their thoughts about the
product and the company. These reviews play a very important
part in building an opinion about the said item. For a
popular product or service, there might be thousands of reviews,
making it difficult for the customer to make an informed
decision about the product. Today, we try to surface out the
reviews that are outliers to the general cluster of reviews(for a short period of time). This will give us an idea about is the “new thing” that is being talked about during that period of time.

TL:DR

The algorithm filters the novel reviews in 3 main steps.
1. The text review is pre-processed and cleaned.
2. A vector-matrix is created from the review corpora
using the tf-idf and n-gram model.
3. Novel reviews are filtered using the two approaches
Isolation forest and Local Outlier Factor.
The data consists of reviews from the last 1 year from both Play Store and App Store. The tf-idf vector-matrix created from this data is used to train the model on the first 5 months of the reviews while the model is tested on the rest of the data.

Preprocessing:

We did basic preprocessing to the text data removing all symbols, punctuation, emojis, stop-words etc. We trained the tf-Idf vectorizer on all the reviews. We used bi-grams and tri-grams to boost the accuracy of the model. This scaled down the impact of high-frequency words as they are empirically less informative than the words that occur rarely.

Finding Novelty

We have used an ensemble model of Isolation Forest and Local Outlier Factor. While Isolation Forest is sensitive to global outliers and is weak in dealing with local outliers. Local Outlier Factor performs well in local outlier detection, it has high time complexity. To overcome this, we use an ensemble model which first utilizes Isolation Forest with low complexity (contamination = 0.5) to quickly scan the dataset, prunes the apparently normal data, and generates an outlier candidate set. Then LOF with n-grams is applied to further distinguish the outlier candidate set and get more
accurate outliers.

Final Results

Novel reviews for 09–2018
Novel reviews for 09–2018

Here we see that the reviews tagged as outliers tell us he dip
in ratings occurred due to a new design update.

In November the ratings increased, and we can see from the filtered review that the team had fixed the design change.

Thus the algorithm helps us tell the new updates or bugs in the product giving us information about the product at that time period.

References

Sinha, Ankita & Subrahmaniam, Vignesh. (2020). Cerebro: Novelty Detection in Product Reviews. 140–143. 10.1145/3380688.3380701.

Hi I am Ankita. I work at Intuit India. I am passionate about machine learning and artificial intelligence.