The ability to identify usage patterns from geospatial data has become key to understanding users’ behavior and addressing their individual needs. IoT Venture, a scale-up that provides white-label hard- and software solutions to track moving assets, such as bikes & dogs, via IoT smart devices. While the core business at the moment lies in tracking and locating these assets, each of them by design writes its own data signature in terms of usage & movement behavior. This enables IoT Venture to explore new data use cases that make their smart devices even smarter and thereby create a value-added for their clients.
In collaboration with inovex and the University of Cologne, IoT Venture started an exploratory project to classify dog trackers’ usage patterns with a strong requirement on minimal data usage for fast and automated profile classifications of IoT smart devices. This would help them to better understand their customers and enable them to tailor their services to individual user needs. Furthermore, this first project would set the ground for further data use cases to come.
Project goal
The initial assumption by IoT venture suggested that dog owners show different usage patterns, such as using the device all the time or only during walks. Our task was to first, identify whether and how these patterns can be detected, and second to develop a classification model that could accurately predict the usage patterns quickly after the first few days of use of these devices. We aimed to achieve this goal by following a data-driven approach that involved exploratory data analysis, defining ground truth labels for different usage patterns using clustering algorithms, and finally developing a data-efficient classification model.
Pet tracker analytics
Exploratory data analysis
Our team started by analyzing millions of GPS data points provided by IoT Venture. In the first step, we pre-processed the data by cleaning it, removing missing values, and transforming it into a suitable format for analysis. This allowed us to apply descriptive analytics to gain a better understanding of the data at hand and its underlying patterns. This would then help us to identify potential features that could inform our machine-learning models later in the process.
The image below shows part of a visualization that we generated within this process. While we visualized the geospatial data, we additionally developed a rule-based logic to identify the user’s home location (marked with the pin) from the behavioral data. This would help us in the later process, as we were able to use this particular information as one of the features (transformed into the proportion of the time that the device was being used at home) for our algorithms. So let’s have a look into what we used this and other features for.
Clustering
Before we can try to predict usage patterns, we need to first generate labels for different classes of user types, that we then can use for classification. In our case, since we have retrospectively all the information about individual usage behaviour available, we can use this data to feed an unsupervised learning algorithm to find a clear separation of different classes of usage patterns and thereby create labels for each of them.
To group similar devices based on their usage patterns, we applied K-Means clustering as it resulted in the best separation of the data out of various popular clustering algorithms that were tested. Our choice was motivated by the algorithm’s ability to handle large datasets and create distinct and compact clusters. For implementing K-Means Clustering, the number of clusters (K) must be defined beforehand. An optimal K-value of 3 has been selected, using the silhouette score and elbow method. These methods helped us to optimize the number of clusters in a way that maximized the homogeneity within clusters and heterogeneity between them.
To deal with the multiple features and therefore the multi-dimensionality of the data generated by the various sensors and measurements of the devices, we utilized Principal Component Analysis (PCA) before applying K-Means Clustering. The use of PCA had multiple advantages. Firstly, it reduced the dimensionality of our data, improving the computational efficiency and avoiding multicollinearity. But mainly it was used on a 2-axis scale, to have a good overview of how the different clusters differ and to assess a good separation and interpretation out of it.
After considering different combinations of our features, testing different “K“ and looking into the silhouette scores, a final clustering model was selected.
The model clusters our users in three different clusters according to their behavioral data. After determining the final clusters, we proceeded to validate our results with descriptive statistics that would make the characteristics of each class interpretable. The first cluster (colored in purple) was identified by intensive use of the device, with a higher amount of trips, active days, and daily battery drain. The second (yellow) one represents devices with moderate usage and battery drain. The last cluster (turquoise), which represents the smallest group, shows a reduced intensity of usage and an unrealistic speed of over 70 km/h, suggesting these devices may be used for infrequent car trips. By comparing these key characteristics, we were able to assess the distinctness of our clusters and interpret them. Our final interpretation concludes with the labeling of three behavioral types: intensive users, occasional users, and outliers. These three labels form the base for our actual goal: classify users according to their usage patterns already early in their usage history.
Classification Model
Having the Ground Truth Labels laid the foundation for a supervised classification model, we started with a baseline classification model that would attempt to classify users considering only a single feature, the average daily battery drain (%), which more complex models will have to beat to be considered. The selection of this feature was motivated by its ability to reflect the usage intensity of a device. The baseline’s results underlined the need for a more complex model that incorporates additional features, with intensive users posing a challenge, as just 27 % of these could be correctly identified.
Through meticulous analysis, we tested nine classification algorithms, measuring and assessing their performance metrics. Three frontrunners emerged: Random Forest, Gradient Boosting, and Linear Discriminant Analysis. Digging into feature importance, we found that Random Forest and Gradient Boosting prioritized variables related to the number of points sent per day and trips. On the other hand, Linear Discriminant Analysis gleaned most insight from factors like speed, movement fraction, and battery drain.
Hyperparameter tuning unveiled an elevated level of accuracy and effectiveness in our classification models, which we achieved by performing a Randomized Search Cross-Validation. This is a technique used to optimize the hyperparameters of the model, which are the settings that are not learned from the data but are set by the user before training the model. This technique follows a random sampling approach rather than an exhaustive grid search, making it computationally much more efficient.
This led to improved results as we could increase recall for intensive users by 11 percent points. The trade-off, a slight dip of two percentual points in occasional user recall, was strategically aligned with the company’s goal of accurately identifying these two types of users. By correctly identifying these, the company can address its different user types by developing tailored strategies for each class.
While Gradient Boosting and Linear Discriminant Analysis yielded mixed results, the performance scores, use case, and input from IoT Venture propelled us to a clear conclusion. The Random Forest Classifier emerged as the beacon, capable of accurately classifying users‘ usage patterns, identifying 78 % of intensive users and 91 % of occasional users.
Conclusion
In conclusion, the successful implementation of our Random Forest classification model gives a first glimpse into the potential this rich data offers for IoT Venture. With an impressive 86 % accuracy that could be reached by using just five days of usage data, the model’s early prediction of usage patterns for the crucial first 30 days offers a strategic edge. Its potential for differentiation lies in tailoring distinct strategies for different types of users, an essential capability for any company seeking to meet diverse customer needs.
Optimizing battery performance for those identified as intensive users addresses a key concern and showcases IoT Venture’s commitment to its customer base. By introducing an automatic battery-saving mode and a custom configuration option for intensive users, the company can extend battery life while maintaining tracking effectiveness. Simultaneously, our classification model empowers IoT Venture to foster customer loyalty through proactive engagement and targeted retention strategies. Predictive analytics will identify at-risk users and guide personalized interventions to prevent customer churn. By acknowledging the unique requirements of each user class and addressing them proactively, the company can enhance its market competitiveness and build lasting customer relationships.