So far this semester, I have been busy with assignments and the occasional midterm exams, and I apologize for the dearth of content in the past seven weeks. However, I have finally had time to write about one recent assignment that I found quite interesting. It was for CS 410, Text Information Systems, taught by Professor ChengXiang Zhai.
The assignment’s focus was about implementing a retrieval function for a search engine using the MeTA toolkit. The retrieval function would help assess each query-document pair in a dataset and rank them. A better retrieval function would have a higher Normalized Discounted Gain, meaning that relevant documents for a range of queries would be ranked more highly. There were two parts to this assignment; the first one was implementing a basic retrieval function, which was due on March 12 (later extended to March 16). The second part of the assignment was a competition in which we tried to improve our retrieval functions to get a higher overall score. The top 50 submissions received extra credit, and the competition deadline was on March 27 (later extended to March 28).
This competition used two datasets—the Cranfield dataset and the APNews dataset—and scored each submission according to a weighted arithmetic mean:
Overall Score = 0.3 * Cranfield [email protected] + 0.7 * APNews [email protected]. “[email protected]” is shorthand for the Normalized Discounted Cumulative Gain measure at the position of 10 relevant documents retrieved. I’ll refer to that as “Score” from now on.
During the competition, the course staff published a leaderboard with each row having these columns:
- Current Overall Score
- Previous Overall Score
- Current Cranfield Score
- Previous Cranfield Score
- Current APNews Score
- Previous APNews Score
- Timestamp When the Submission Was Last Updated
- Total Number of Submissions the Alias Has Made
I ignored the items in italics and based my following analysis on the remaining items. There were 181 submissions at the deadline, 174 of which beat the baseline score of 0.3702, which was set by the course staff. I removed the seven submissions that didn’t beat the baseline from further analysis.
First, I plotted and analyzed the distribution of overall scores.
count 174.000000 mean 0.384471 std 0.007665 min 0.371100 25% 0.381700 50% 0.381900 75% 0.389275 max 0.429200
Many of the overall scores were between 0.38 (rank 139) and 0.39 (rank 38). The threshold to receive extra credit (rank 50) was 0.3881.
Next, I plotted and analyzed the distribution of Cranfield and APNews scores.
count 174.000000 mean 0.346355 std 0.011011 min 0.307900 25% 0.344475 50% 0.351550 75% 0.351900 max 0.374200
count 174.000000 mean 0.400815 std 0.012757 min 0.382700 25% 0.394500 50% 0.395200 75% 0.406425 max 0.452800
The histogram for the Cranfield score distribution seems to have a long left tail and the histogram for the APNews score distribution seems to have a long right tail. Since the overall score was determined by a weighted arithmetic mean, I suspected that there was a stronger positive correlation between the rank and APNews score than between the rank and Cranfield score. So I tested this hypothesis.
slope = 5.9780241349850844e-05 intercept = 0.3415477853061738 Mean Squared Error: 0.00011272078043399971
slope = -0.00023467705782489998 intercept = 0.4196848661840701 Mean Squared Error: 4.1211785714069953e-05
Looking at the linear regressions, the correlation between rank and APNews score looks as expected; as rank increases (and as the overall score decreases), the APNews score decreases. However, there was actually a slight opposite correlation between rank and Cranfield score. It seems that the dominant strategy in this competition was to increase one’s rank is to sacrifice the Cranfield score in pursuit of a higher APNews score. Because the APNews score is weighted more than twice as heavily as the Cranfield score, even a modest increase in the APNews score would usually produce a higher overall score, and therefore a better rank.
Now what if the overall score formula were to be fundamentally changed? What if it was based on a geometric mean? How would the rankings change? I used the formula
Overall Score = sqrt(Cranfield Score * APNews Score) to recalculate each submission’s overall score and plotted the original ranks versus the new ranks.
It looks like some submissions have been shuffled around. In particular, some top 10 submissions would drop to around 50th place, many submissions at rank 15-35 would fall out of the top 50, and some submissions ranked at around 130-150 would actually make top 50 and earn extra credit!
Looking at the data, I see many results where the APNews score is unusually larger than the Cranfield score. Under the formula used in the competition, a weighted arithmetic mean, this would be rewarded. But under a formula based on a geometric mean, this would be punished. Therefore, I suggest that the course staff change the formula for future search engine competitions in order to deter overfitting based on any one dataset.
Finally, out of curiosity, I plotted some miscellaneous scatterplots regarding submission time—hours submitted before the second deadline for the current submission—and number of submissions the alias has made, including the current one.
I considered March 28, 2017, 11:59pm to be the second deadline. For the two submissions that came after the deadline, I counted them as submitted exactly zero hours before the deadline.
count 174.000000 mean 290.994411 std 148.249640 min 0.000000 25% 196.077014 50% 347.166667 75% 389.088958 max 607.480556
There is a marked spike at around 390 hours submitted before the second deadline. This roughly corresponds to March 12, the first deadline.
slope = -1.046712681622008e-05 intercept = 0.3875171397697222 Mean Squared Error: 5.601417215565846e-05
Submitting the current submission later—less hours before the deadline—is correlated with getting higher overall scores. That makes sense since the later you submit your last submission, the more time you’ve had to improve your retrieval function.
count 174.000000 mean 12.454023 std 13.550405 min 1.000000 25% 5.000000 50% 9.000000 75% 15.000000 max 113.000000
An overwhelming majority of people made between 1 and 20 submissions. And someone had the time to make 113 (!) submissions.
slope = 0.00019868231866322257 intercept = 0.3819968702037748 Mean Squared Error: 5.120182994209354e-05
More submissions is very slightly correlated with higher overall scores.
In this analysis, I found two big insights in the results.
- The first insight is that a submission can still rank highly while having a low Cranfield score because of the nature of the arithmetic mean used in the scoring formula. Many of the top 50 submissions employed a strategy of overfitting to the APNews dataset—the dataset for which the score is weighted heavily—usually at the expense of the Cranfield score. Such submissions would fall in the rankings if the overall score formula were to penalize this strategy in some way (e.g. geometric mean).
- The second insight is that many people simply wanted to beat the baseline and be done with the assignment. Most people didn’t have many submissions, and according to the submission times histogram, there was a cluster of submissions around the first deadline.
I have provided the CSV dump of the results on Pastebin so you can analyze the data in your own time. I found this assignment engaging, and I appreciated the incentive to go above and beyond for extra credit. Props to Professor Zhai and the CS 410 course staff!
I hope you all have enjoyed this analysis! I will try my best to bring you guys more content in coming months.