Authors - Soheil Esmaeilzadeh, Negin Salajegheh, Amir Ziai, and Jeff Boote
Affiliation - Netflix, CA
Abstract - This work presents a fraud and abuse detection frame- work for streaming services by modeling user stream- ing behavior. The goal is to discover anomalous and suspicious incidents and scale the investigation efforts by creating models that characterize the user behavior. We study the use of semi-supervised as well as super- vised approaches for anomaly detection. In the semi- supervised approach, by leveraging only a set of au- thenticated anomaly-free data samples, we show the use of one-class classification algorithms as well as autoen- coder deep neural networks for anomaly detection. In the supervised anomaly detection task, we present a so- called heuristic-aware data labeling strategy for creating labeled data samples. We carry out binary classifica- tion as well as multi-class multi-label classification tasks for not only detecting the anomalous samples but also identifying the underlying anomaly behavior(s) associ- ated with each one. Finally, using a systematic feature importance study we provide insights into the underly- ing set of features that characterize different streaming fraud categories. To the best of our knowledge, this is the first paper to use machine learning methods for fraud and abuse detection in real-world scale streaming services.
Keywords: Heuristic-Aware, Fraud Detection, Machine Learning, Streaming Services
Fig.4. A schematic of Synthetic Minority Over-sampling Technique (SMOTE)
Fig.5. For the three fraud categories before and after carry- ing out multi-class multi-label SMOTE: (a) number of anoma- lous tagged accounts and (b) label imbalance ratio. </div>
Table 2. The values of the evaluation metrics for a set of semi-supervised anomaly detection models. </div>
Table 3. The values of the evaluation metrics for a set of supervised binary anomaly detection classifiers. </div>
Table 4. The values of the evaluation metrics for a set of supervised multi-class multi-label anomaly detection approaches. The values in parenthesis refer to the performance of the models trained on the original (not upsampled) datasets. </div>