A Quick Overview on the Kaggle Competition for Avito
Introduction
I didn’t have much time for this competition, so didn’t invest much into feature engineering, creating ensembles or other things. As I participated in the Avazu competition as well, which included the use of tinrtgu’s now-famous code, I decided to use the same approach here.
Background
The overall goal of the competition is to analyze user behavior in order to generate a model for recommending ads to be shown in front of users, with the success metric being whether or not the user clicks on the ad. There is already a lot of work on this topic, so there is no need to rebuild everything from scratch.
If you haven’t read the paper from Google on FTRL for ad prediction and their view from the trenches then I can really recommend that as a first step. They deal with all the topics we find in application, like sparsity, constraints on hardware, constraints on processing times, and so on. For anyone who has also developed real-time bidding systems for demand-side platforms or supply-side platforms or other ad prediction systems this will probably be very familiar.
The data provided by Avito for the competition consisted of a sequence of views, including information on the user, ads they had seen, any possible requests to contact the seller, and so on. The data size didn’t present any particular challenges if you use online methods as I did, as the main data file of search information is about 10GB uncompressed and contains approximately 400 million records. If you want to look further into ways to speed up the learning process, check out Sublinear Optimization for Machine Learning by Clarkson et. al.
Implementation
The implementation was fairly quick as the solution is a known one for which Python code already exists. As is often the case, Abishek posted a solution to beat the benchmark. This solution took some time to run since it depends on NumPy and Pandas, which precludes my typical usage of PyPy for Python code needing a bit more performance. It is easy to get around this with some decoupling, so I split the parts depending on NumPy and Pandas into a separate file thus enabling PyPy to be used for the performance-intensive code and the other code to be run separately.
The main code, which can be run by PyPy, consists of defining an ftrl object in addition to a couple of methods for fitting, predicting, and updating. Nothing crazy.
Building a test set
If you want to assemble a test set to benchmark against locally, the following method seems to correlate very well with the score on the public leader board.
From the avito competition :
Take the minimum date that occurs in the test set (May 12?).
Take all LAST sessions for each user that happens after that date (or sample it). The last session is the one with highest date for each user.
This problem has a time component, so sampling from other periods of time won’t do, since it is expected that ads change over time. Randomly sampling wont do, since it will get a bigger proportion of users that has many sessions. If we take anything but the last (or n last) sessions from user we will leave the user future in the training set, so cv wont match because of overfit. Bottom line: In the data page it says that they take the last session for each user that happens after may 12. So we just need to do the same (as close as it gets)!
Comments from another competitor
Andrew Ostapets, who placed 9th in competition had the following to say on the approach:
My single ftrl model with basic features: AdID, UserID, IPID, Position, Price, … + the second- and the third-order interactions between some of this features + scores similarity between SearchQuery and Title and SearchParams and Params. It gets Public LB score below 0.044.
Adding to this model only one new feature
PositionFactor
from my teammate Alexander. It gave the incredible increments and the model scored ~ 0.0418 on Public LB.
PositionFactor = hash( [Position 1_place:ObjectType 2_place:ObjectType 6_place:ObjectType 7_place:ObjectType 8_place:ObjectType])
, wherePosition
is the position of the given context ad in search result page.k_place:ObjectType
is the position and type others ads in search result page (if there are data about them).Alexander’s models vw+xgb and vw+rf give him ~ 0.0424 on Public LB
After tuning hyperparameters and finding the weights for linear combination of our solutions we achieved 0.04137 on Public LB.
From a 5th place finisher
Gilberto Titericz Junior was 5th place and used this method as described by Dmitry Efimov:
I used something similar in the Avazu contest, if you are interested, you can check my presentation (link no longer available) about the solution (there are some slides about Batch FTRL). The principle is to sort combined dataset (train and test) by some fields and apply the FTRL algorithm. Then it will work in the following way: the algorithm is trained on the first batch from the train set and make prediction on the first batch from the test set. Then it starts training on the second batch from the train set starting from coefficients from the first batch and so on.
Conclusion
This was an interesting and fun competition, and a good proving ground for new ideas and approaches. I wrote up an implementation in Go which I may package as a library and put on GitHub at some point. Until then, here is a forked gist from ceshine of the implementation from another competition: