So far this semester, I have been busy with assignments and the occasional midterm exams, and I apologize for the dearth of content in the past seven weeks. However, I have finally had time to write about one recent assignment that I found quite interesting. It was for CS 410, Text Information Systems, taught by Professor ChengXiang Zhai.

The assignment’s focus was about implementing a retrieval function for a search engine using the MeTA toolkit. The retrieval function would help assess each query-document pair in a dataset and rank them. A better retrieval function would have a higher Normalized Discounted Gain, meaning that relevant documents for a range of queries would be ranked more highly. There were two parts to this assignment; the first one was implementing a basic retrieval function, which was due on March 12 (later extended to March 16). The second part of the assignment was a competition in which we tried to improve our retrieval functions to get a higher overall score. The top 50 submissions received extra credit, and the competition deadline was on March 27 (later extended to March 28).

This competition used two datasets—the Cranfield dataset and the APNews dataset—and scored each submission according to a weighted arithmetic mean: `Overall Score = 0.3 * Cranfield [email protected] + 0.7 * APNews [email protected]`

. “[email protected]” is shorthand for the Normalized Discounted Cumulative Gain measure at the position of 10 relevant documents retrieved. I’ll refer to that as “Score” from now on.

During the competition, the course staff published a leaderboard with each row having these columns:

- Rank
- Alias
- Current Overall Score
*Previous Overall Score*- Current Cranfield Score
*Previous Cranfield Score*- Current APNews Score
*Previous APNews Score*- Timestamp When the Submission Was Last Updated
- Total Number of Submissions the Alias Has Made

I ignored the items in italics and based my following analysis on the remaining items. There were 181 submissions at the deadline, 174 of which beat the baseline score of 0.3702, which was set by the course staff. I removed the seven submissions that didn’t beat the baseline from further analysis.

## The Analysis

First, I plotted and analyzed the distribution of overall scores.

```
count 174.000000
mean 0.384471
std 0.007665
min 0.371100
25% 0.381700
50% 0.381900
75% 0.389275
max 0.429200
```

Many of the overall scores were between 0.38 (rank 139) and 0.39 (rank 38). The threshold to receive extra credit (rank 50) was 0.3881.

Next, I plotted and analyzed the distribution of Cranfield and APNews scores.

```
count 174.000000
mean 0.346355
std 0.011011
min 0.307900
25% 0.344475
50% 0.351550
75% 0.351900
max 0.374200
```

```
count 174.000000
mean 0.400815
std 0.012757
min 0.382700
25% 0.394500
50% 0.395200
75% 0.406425
max 0.452800
```

The histogram for the Cranfield score distribution seems to have a long left tail and the histogram for the APNews score distribution seems to have a long right tail. Since the overall score was determined by a weighted arithmetic mean, I suspected that there was a stronger positive correlation between the rank and APNews score than between the rank and Cranfield score. So I tested this hypothesis.

```
slope = 5.9780241349850844e-05
intercept = 0.3415477853061738
Mean Squared Error: 0.00011272078043399971
```

```
slope = -0.00023467705782489998
intercept = 0.4196848661840701
Mean Squared Error: 4.1211785714069953e-05
```

Looking at the linear regressions, the correlation between rank and APNews score looks as expected; as rank increases (and as the overall score decreases), the APNews score decreases. However, there was actually a slight opposite correlation between rank and Cranfield score. It seems that the dominant strategy in this competition was to increase one’s rank is to sacrifice the Cranfield score in pursuit of a higher APNews score. Because the APNews score is weighted more than twice as heavily as the Cranfield score, even a modest increase in the APNews score would usually produce a higher overall score, and therefore a better rank.

Now what if the overall score formula were to be *fundamentally changed*? What if it was based on a geometric mean? How would the rankings change? I used the formula `Overall Score = sqrt(Cranfield Score * APNews Score)`

to recalculate each submission’s overall score and plotted the original ranks versus the new ranks.

It looks like some submissions have been shuffled around. In particular, some top 10 submissions would drop to around 50th place, many submissions at rank 15-35 would fall out of the top 50, and some submissions ranked at around 130-150 would actually make top 50 and earn extra credit!

Looking at the data, I see many results where the APNews score is unusually larger than the Cranfield score. Under the formula used in the competition, a weighted arithmetic mean, this would be rewarded. But under a formula based on a geometric mean, this would be punished. **Therefore, I suggest that the course staff change the formula for future search engine competitions in order to deter overfitting based on any one dataset.**

## Miscellaneous Correlations

Finally, out of curiosity, I plotted some miscellaneous scatterplots regarding submission time—hours submitted before the second deadline for the *current* submission—and number of submissions the alias has made, including the current one.

I considered March 28, 2017, 11:59pm to be the second deadline. For the two submissions that came after the deadline, I counted them as submitted exactly zero hours before the deadline.

```
count 174.000000
mean 290.994411
std 148.249640
min 0.000000
25% 196.077014
50% 347.166667
75% 389.088958
max 607.480556
```

There is a marked spike at around 390 hours submitted before the second deadline. This roughly corresponds to March 12, the first deadline.

```
slope = -1.046712681622008e-05
intercept = 0.3875171397697222
Mean Squared Error: 5.601417215565846e-05
```

Submitting the current submission later—less hours before the deadline—is correlated with getting higher overall scores. That makes sense since the later you submit your last submission, the more time you’ve had to improve your retrieval function.

```
count 174.000000
mean 12.454023
std 13.550405
min 1.000000
25% 5.000000
50% 9.000000
75% 15.000000
max 113.000000
```

An overwhelming majority of people made between 1 and 20 submissions. And someone had the time to make 113 (!) submissions.

```
slope = 0.00019868231866322257
intercept = 0.3819968702037748
Mean Squared Error: 5.120182994209354e-05
```

More submissions is very slightly correlated with higher overall scores.

## Conclusion

In this analysis, I found two big insights in the results.

- The first insight is that a submission can still rank highly while having a low Cranfield score because of the nature of the arithmetic mean used in the scoring formula. Many of the top 50 submissions employed a strategy of overfitting to the APNews dataset—the dataset for which the score is weighted heavily—usually at the expense of the Cranfield score. Such submissions would fall in the rankings if the overall score formula were to penalize this strategy in some way (e.g. geometric mean).
- The second insight is that many people simply wanted to beat the baseline and be done with the assignment. Most people didn’t have many submissions, and according to the submission times histogram, there was a cluster of submissions around the first deadline.

I have provided the CSV dump of the results on Pastebin so you can analyze the data in your own time. I found this assignment engaging, and I appreciated the incentive to go above and beyond for extra credit. Props to Professor Zhai and the CS 410 course staff!

I hope you all have enjoyed this analysis! I will try my best to bring you guys more content in coming months.