A: Activity A Tennis Croquet Vol eybal Human pose forehand shot smash H O: Object O Tennis Croquet Body parts Vol eybal racket mal et P1 P2 PN H: fO f1 f2 fN Intra-class variations • More than one H for each A; Image evidence • Unobserved during training. P: lP: location; θP: orientation; sP: scale. f: Shape context. [Belongie et al, 2002] Yao & Fei-Fei CVPR 2010
• Build ac5on models from web search results Ikizler-Cinbis, Cinbis, Sclaroff ICCV 2009
SLAG • Find repeated poses in a dataset Wang, Jiang, Drew, Li, Mori CVPR 2006
• Person loca5on given Riding horse Reading book Taking photo • Classify into one of 9 categories Riding bike Play instrument Running Phoning Use computer Walking
• Pose as representa5on for ac5on recogni5on – Captures much informa5on about ac5on – Invariance to clothing / ligh5ng eﬀects – Model and exemplar based representa5ons • New direc5on: Ac5on recogni5on from s5l images – Image retrieval and analysis – An important cue for video- ‐based ac5on recogni5on – Pose seems essen5al
• Describe low- ‐level components – Ac5ons of individual people – Movement of pixels • Iden5fy key objects or loca5ons in scene – Buildings, roads, etc. • Model interac5ons between people, objects, and loca5ons
• Detect and track moving objects • Manually iden5fy key regions in scene – E.g. road, checkpoint • Scenarios describe rela5ve arrangements of objects in scene – E.g. proximity of car to checkpoint – No5ons of scene context Medioni, Cohen, Bremond, Hongeng, Nevatia PAMI 2001
• Detect and track players, ball • Low- ‐level ac5on detectors for individual players • Hand- ‐constructed Bayes net for each ac5vity – Spa5al and temporal rela5ons between low- ‐ level ac5ons Intille & Bobick CVPR 1999
• Global, frame- ‐level feature – Bag- ‐of- ‐words representa5on • Detect unusual events by clustering – Isolated, varied clusters are unusual Zhong, Shi & Visontai CVPR 2004
• Chea5ng detec5on in • Real- ‐world highway dataset simulated card game – Cars pul ing oﬀ road, backing up, U- ‐turns
• Describe moving pixels by loca5on and mo5on direc5on – No object detec5on • Use as visual words in Latent Dirichlet Al oca5on (LDA) type model – Infer low- ‐level ac5ons from words Wang, Ma, Grimson PAMI 2009 Blei, Ng, Jordan JMLR 2003
• Higher- ‐level ac5vity analysis – Distribu5on of low- ‐level ac5ons over en5re scene • Applica5ons – Temporal segmenta5on by ac5vity – Abnormality detec5on
A 0.8 B 0.3 0.5 C • Hierarchical Dirichlet Process model – Learn number of ac5vi5es automa5cally Kuettel, Breitenstein, van Gool & Ferrari CVPR 2010
traﬃc light control ed scene • con5nuous video • annotated with states and history • 3x speed
Loy, Xiang & Gong CVPR, ICCV 2009
• Consider 5me- ‐delayed correla5ons between regions – Applica5ons to irregularity detec5on
y activity class h y action class 1 h2 h … x1 x2 xn image x0 Choi, Shahid, & Savarese VS 2009 Lan, Wang, Yang, & Mori SGA 2010, NIPS 2010
• Cap5oned baseball videos in training • Build AND- ‐OR graph representa5on of ac5vi5es – AND speciﬁes elements of an ac5vity that must occur – OR allows varia5on in how an element appears • Describe low- ‐level tracks using STIPs • Match tracks to ac5ons in AND- ‐OR graph Gupta, Srinivasan, Shi, Davis CVPR 2009
• Scene modeling to look at the big picture • Feature representa5ons – Holis5c: describe en5re scene, irrespec5ve of individuals – Local: describe ac5ons of individuals • Structure of ac5vi5es – Model free: clustering- ‐type approaches – Strong models: grammars, probabilis5c models
Objects: Actions: cars, glasses, drinking, running, people, etc… door exit, car enter constraints Scene categories: Geometry: indoors, outdoors, street Street, wal , scene, etc… field, stair, etc…
Riding horse Reading book A B 0.8 0.3 C 0.5 Riding bike Play instrument
• Standardiza5on of datasets for ﬁeld – Al ow comparison of algorithms • E.g. KTH for low- ‐level features, atomic ac5ons – Fair tuning of model parameters • New algorithms compare to baselines – Bag- ‐of- ‐words on densely sampled STIPs – Pose es5ma5on (Ferrari et al. code) – HOG SVM (Dalal & Triggs code, Ramanan code)
• Standardiza5on of datasets for ﬁeld – Don’t feel constrained by the exis5ng problem deﬁni5ons – Do make your new dataset available • Should clearly specify separate training and test sets • New algorithms compare to baselines – Do use reasonable variant of standard baselines for your new problem
• Even atomic low- ‐level ac5ons are very diﬃcult to detect reliably – Far more work needed on representa5ons for the ac5on of a single person – Features – Temporal representa5on, smoothing – Tracking – …
1. Cameras and bandwidth are cheap 2. Lots of training data is potential y available Training + = data Potential for the huge progress … if we can get the data
Aligned with video Describes visual Source content Subtitles Yes No DVD, Internet Scripts for No Yes Internet, e.g. TV series, movies and www.dailyscript.com sport games Plot summaries and No Yes, sparsely Internet synopses (e.g. IMDB) Instruction videos No Yes Internet, e.g. www.videojug.com Descriptive Video Service Yes Yes DVD, rare Word tags No Yes, sparsely Internet (e.g. YouTube) Manual labelling, ?? ?? Mechanical Turk, Human Computation ESP Game, Grad undergrad students
Open questions: How to benefit from the structure of the human body in complex situations, e.g. heavy occlusions, uniformly colored clothing? Wil action classification generalize over different video domains: Movies, TV, YouTube, Surveil ance video? What is the useful action vocabulary? Are we trying to solve the right problem? How can we visualize/display the results? Interesting novel directions: Use actions for recognizing functional and physical object properties, e.g. “sitable”, “eatable”, “heavy”, “solid” objects… Action prediction, i.e. what can happen in the given situation: e.g. is it dangerous to cross this road? Explore more sources of strong and weak supervision: Manual surveil ance, Descriptive Video Service (DVS), YouTube tags; Transcripts of sports games; Instruction videos.
• P. Viola, M. Jones, and D. Snow. Detec5ng pedestrians using paeerns of mo5on and appearance. In Proc. 9th Int. Conf. Computer Vision, pages 734–741, 2003. • N. Dalal and B. Triggs. Histograms of oriented gradients for human detec5on. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2005. • Bo Wu and Ram Neva5a. Detec5on of mul5ple, par5ally occluded humans in a single image by bayesian combina5on of edgelet part detectors. In Proc. 10th Int. Conf. Computer Vision, 2005. • Pedro Felzenszwalb, David McAl ester, and Deva Ramanan. A discrimina5vely trained, mul5scale, deformable part model. In IEEE Computer Society Conference on Computer Vision and Paeern Recogni5on, 2008. • Chris Stauﬀer and W.E.L. Grimson. Adap5ve background mixture models for real- ‐ 5me tracking. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 1999. • Kentaro Toyama, John Krumm, Barry Brumie, and Brian Meyers. Wallﬂower: Principles and prac5ce of background maintenance. In Proc. 7th Int. Conf. Computer Vision, 1999. • J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performance of op5cal ﬂow techniques. Int. Journal of Computer Vision, 12(1):43–77, 1994. • T. Brox, C. Bregler, and J. Malik. Large displacement op5cal ﬂow. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2009. • M. Isard and A. Blake. Condensa5on – condi5onal density propaga5on for visual tracking. Int. Journal of Computer Vision, 29(1):5–28, 1998. • Yuan Li, Chang Huang, and Ram Neva5a. Learning to associate: Hybridboosted mul5- ‐target tracker for crowded scene. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2009.
• W. T. Freeman, K. Tanaka, J.Ohta, and K. Kyuma. Computer vision for computer games. In IEEE 2nd Intl. Conf. on Automa5c Face and Gesture Recogni5on, 1996. • J. Sul ivan and S. Carlsson. Recognizing and tracking human ac5on. In ECCV 2002 • A.A. Efros, A.C. Berg, G. Mori, and J. Malik. Recognizing ac5on at a distance. In ICCV 2003 • A. Bobick and J. Davis. The recogni5on of human movement using temporal templates. IEEE Trans. PAMI, 23(3):257–267, 2001. • L. Zelnik- ‐Manor and M. Irani. Event- ‐based video analysis. In CVPR 2001 • E. Shechtman and M. Irani. Space- ‐5me behavior based correla5on. In CVPR 2005 • O. Boiman and M. Irani. Detec5ng irregulari5es in images and in video. In Proc. ICCV, 2005. • M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Ac5ons as space- ‐5me shapes. In Proc. ICCV, 2005. • Y. Ke, R. Sukthankar, and M. Hebert. Eﬃcient Visual Event Detec5on using Volumetric Features . In Proc. ICCV 2005. • Y. Ke, R. Sukthankar, and M. Hebert. Event detec5on in crowded videos. In Proc. ICCV, 2007. • I. Laptev and P. Pérez. Retrieving ac5ons in movies. In Proc. ICCV 2007 • D. Weinland and E. Boyer. Ac5on recogni5on using exemplar- ‐based embedding. In Proc. CVPR, 2008. • Z. Lin, Z. Jiang, and L. S. Davis. Recognizing ac5ons by shape- ‐mo5on prototype trees. In Proc. ICCV, 2009.
• I. Laptev and T. Lindeberg. Space- ‐5me interest points. In Proc. ICCV 2003. • C. Schuldt, I. Laptev, and B. Caputo. Recognizing human ac5ons: A local svm approach. In Proc. ICPR, 2004. • P. Dol ar, V. Rabaud, G. Coerel , and Serge Belongie. Behavior recogni5on via sparse spa5o- ‐temporal features. In VS- ‐PETS, 2005. • H. Jhuang, T. Serre, L. Wolf and T. Poggio. A Biologically Inspired System for Ac5on Recogni5on. In Proc. ICCV 2007 • P. Scovanner, S. Ali, and M. Shah, A 3- ‐Dimensional SIFT descriptor and its applica5on to ac5on recogni5on, ACM MM 2007. • J. C. Niebles, H. Wang, and L. Fei- ‐Fei. Unsupervised learning of human ac5on categories using spa5al- ‐ temporal words. In IJCV 2008. • I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realis5c human ac5ons from movies. In Proc. CVPR 2008. • A. Klaeser, M. Marszałek and C. Schmid. A spa5o- ‐temporal descriptor based on 3D- ‐gradients. In Proc. BMVC 2008 • G. Wil ems, T. Tuytelaars and L. Van Gool. An Eﬃcient Dense and Scale- ‐Invariant Spa5o- ‐Temporal Interest Point Detector. In Proc. ECCV 2008 • H. Wang, M. M. Ul ah, A. Kläser, I. Laptev and C. Schmid. Evalua5on of local spa5o- ‐temporal features for ac5on recogni5on. In Proc. BMVC 2009. • L. Yeﬀet and L. Wolf. Local Trinary Paeerns for Human Ac5on Recogni5on. In Proc. ICCV 2009. • A. Gilbert, J. Illingworth, R. Bowden. Fast realis5c mul5- ‐ac5on recogni5on using mined dense spa5o- ‐ temporal features, In Proc. ICCV 2009. • P. Ma5kainen, M. Hebert, R. Sukthankar. Trajectons: Ac5on recogni5on through the mo5on analysis of tracked features. ICCV workshop on Video- ‐oriented Object and Event Classiﬁca5on, 2009 • M. M. Ul ah, S. N. Parizi, I. Laptev. Improving bag- ‐of- ‐features ac5on recogni5on with non- ‐local cues. In Proc. BMVC 2010
• Y. Song, L. Goncalves, and P. Perona. Unsupervised learning of human mo5on. IEEE Trans. PAMI, 25 (7):814–827, 2003. • D. Ramanan and D. A. Forsyth. Automa5c annota5on of everyday movements. In Advances in Neural Informa5on Processing Systems 16, 2003. • V. Ferrari, M. Marin, and A. Zisserman. Pose search: retrieving people using their pose. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2009. • Yang Wang, Hao Jiang, Mark S. Drew, Ze- ‐Nian Li, and Greg Mori. Unsupervised discovery of ac5on classes. In CVPR, 2006. • Nazli Ikizler- ‐Cinbis, R. Gokberk Cinbis, and Stan Sclaroﬀ. Learning ac5ons from the web. In IEEE Interna5onal Conference on Computer Vision, 2009. • Weilong Yang, Yang Wang, and Greg Mori. Recognizing human ac5ons from s5l images with latent poses. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2010. • Bangpeng Yao and Li Fei- ‐Fei. Modeling mutual context of object and human pose in human- ‐object interac5on ac5vi5es. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2010.
• R. Polana and R.C. Nelson. Detec5on and recogni5on of periodic, nonrigid mo5on. In IJCV 1997. • S.M. Seitz and C.R. Dyer. View invariant analysis of cyclic mo5on. In IJCV 1997 • A. Thangali and S. Sclaroﬀ. Periodic mo5on detec5on and es5ma5on via space- ‐5me sampling. In IEEE Workshop on Mo5on and Video Compu5ng, 2005. • I. Laptev, S.J. Belongie, P. Pérez and J. Wil s. Periodic mo5on detec5on and segmenta5on via approximate sequence alignment, In Proc. ICCV 2005 • P. Wang, G.D. Abowd and J.M. Rehg. Quasi- ‐periodic event analysis for social game retrieval. In Proc ICCV 2009 • D. Weinland, E. Boyer, and R. Ronfard. Ac5on recogni5on from arbitrary views using 3D exemplars. in Proc. ICCV 2007. • A. Farhadi and M. Tabrizi. Learning to recognize ac5vi5es from the wrong view point. In Proc. ECCV 2008. • I. Junejo, E. Dexter, I. Laptev and Patrick Pérez. Cross- ‐view ac5on recogni5on from temporal self- ‐ similari5es. In Proc. ECCV 2008 • A. Farhadi, M. Kamali, I. Endres, D. Forsyth. A latent model of discrimina5ve aspect. In Proc. ICCV 2009.
• X. Wang, X. Ma, and E. Grimson. Unsupervised ac5vity percep5on in crowded and complicated scenes using hierarchical bayesian models. IEEE Trans. PAMI, 31(3):539– 555, 2009. • Abhinav Gupta, Praveen Srinivasan, Jianbo Shi, and Larry S. Davis. Understanding videos, construc5ng plots - ‐ learning a visually grounded storyline model from annotated videos. In CVPR, 2009. • T. Xiang and S. Gong. Beyond tracking: Model ing ac5vity and understanding behaviour. Int. Journal of Computer Vision, 67(1):21–51, 2006. • G. Medioni, I. Cohen, F. Bre ́mond, S. Hongeng, and R. Neva5a. Event detec5on and analysis from video streams. IEEE Trans. PAMI, 23(8):873–889, 2001. • Y. A. Ivanov and A. F. Bobick. Recogni5on of visual ac5vi5es and interac5ons by stochas5c parsing. IEEE Trans. PAMI, 22(8):852–872, 2000. • D. Moore and I. Essa. Recognizing mul5tasked ac5vi5es using stochas5c context- ‐free grammar using video. In AAAI, 2002. • Chen Change Loy, Tao Xiang, and Shaogang Gong. Model ing ac5vity global temporal dependencies using 5me delayed probabilis5c graphical model. In ICCV, 2009. • Xiaogang Wang, Keng Teck Ma, Gee Wah Ng, and W. Eric L. Grimson. Trajectory analysis and seman5c region modeling using a nonparametric bayesian model. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Paeern Recogn., 2008. • W. Choi, K. Shahid, and S. Savarese. ”what are they doing? : Col ec5ve ac5vity classiﬁca5on using spa5o- ‐temporal rela5onship among people”. In 9th Interna5onal Workshop on Visual Surveil ance, 2009. • Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior detec5on using social force model. In CVPR, 2009.