RIA failure analysis process [Buckley+04, p.6] 1. The topic (or pair of topics) for the day was determined, with a leader being assigned the topic, on a rotating basis among all participants 2. Each participant was assigned one of the six standard runs (or systems) to examine, either individually or as a team 3. Each participant or team spent from 60 to 90 minutes investigating how their assigned system did on the assigned topic, examining how the system did absolutely, how it did compared to the other systems, and how performance could be improved for it. A template (see Figure 1) was generally filled out to guide both the investigation and subsequent discussions 4. All participants assigned to a topic discussed the topic for 20 to 30 minutes, in separate rooms if there were two topics. The failures of each system were discussed, along with any conclusions about the difficulty of the topic itself. 5. The topic leader summarized the results of the discussion in a short report (a template was developed for this by week 3 of the workshop). If there were 2 topics assigned for the day, each leader would give a short presentation on the results to the workshop as a whole.
RIA failure analysis categorisation [Buckley+04, pp.10‐12] Categorisation done by 1. General success ‐ present systems worked well Antarctic vs Antarctica one person 2. General technical failure (stemming, tokenization) 3. All systems emphasize one aspect; missing another required term “What disasters have occurred in tunnels used 4. All systems emphasize one aspect; missing another aspect for transportation?” 5. Some systems emphasize one aspect; some another; need both 6. All systems emphasize one irrelevant aspect; missing point of topic “How much sugar does 7. Need outside expansion of “general” term (Europe for example) Cuba export and which countries import it?” 8. Need QA query analysis and relationships 9. Systems missed difficult aspect that would need human help “What are new methods of 10. Need proximity relationship between two aspects producing steel?” “What countries are experiencing an increase in tourism?”
RIA failure analysis conclusions [Buckley+04, p.12] • The first conclusion is that the root cause of poor performance on any one topic is likely to be the same for all systems. • The other major conclusion to be reached from these category assignments is that if a system can realize the problem associated with a given topic, then for well over half the topics studied (at least categories 1 through 5), current technology should be able to improve results significantly. This suggests it may be more important for research to discover what current techniques should be applied to which topics, than to come up with new techniques.
Improvements that don’t add up [Armstrong+09] Armstrong et al. analysed 106 papers from SIGIR ‘98‐’08, CIKM ‘04‐’08 that used TREC data, and reported: • Researchers often use low baselines • Researchers claim statistically significant improvements, but the results are often not competitive with the best TREC systems • IR effectiveness has not really improved over a decade! What we want What we’ve got?
“Running on the spot?” [Armstrong+09] Each line represents a statistically significant improvement over a basline
References (2) – Available from http://research.nii.ac.jp/ntcir/publication1‐en.html • Fujii, A., Iwayama, M. and Kando, N.: Overview of the Patent Retrieval Task at the NTCIR‐6 Workshop, NTCIR‐6, pp.359‐365, 2007. • Goto, I., Chow, K.P. Lu, B., Sumita, E. and Tsou, B.K.: Overview of the Patent Machine Translation Task at the NTCIR‐10 Workshop, NTCIR‐10, pp.260‐286, 2013. • Kishida, K., Chen, K.‐H., Lee,, S., Kuriyama, K., Kando, N. and Chen, H.‐ H.: Overview of CLIR Task at the Sixth NTCIR Workshop, NTCIR‐6, 2007. • Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song, R., Kato, M.P. and Iwata, M.: Overview of the NTCIR‐10 INTENT‐2 Task, NTCIR‐10 Proceedings, pp.94‐123, 2013.