UCB Visual Object and Activity Recognition, Class CS 294-43
Prof. Trevor Darrell, trevor@eecs.berkeley.edu, Spring 2011
This course will cover computer vision techniques for object and category recognition, as well as recognition of human activity from video streams. Recognition of individual objects or activities (the coffee cup on your desk, a particular chair in your office, a video of you riding your bike) or generic categories (any cup, chair, or cycling event) is an essential capability for a variety of robotics and multimedia applications. The advent of standardized datasets and evaluation regimes has spurred considerable innovation in this arena, with performance on benchmark evaluations increasing dramatically in recent years. This course will review methods that have achieved success on such datasets, and will also consider the techniques needed for real-time interactive application on robots or mobile devices, e.g. domestic service robots or mobile phones that can retrieve information about objects in the environment based on visual observation. This class will be based exclusively on readings from the recent literature, including those appearing at the CVPR, ICCV, and NIPS conferences.
The format of the course this year will primarily be discussion based, with each class beginning with a short overview of the topic by the instructor followed by detailed student-led presentations and structured critique of selected papers. All students will be expected to actively discuss each paper each week. Class size will be limited to those who have preregistered, or to 16 students, whichever is greater, to foster an environment conducive to discussion.
Each week will focus on a different subtopic of object and activity recognition, covering three to five different papers from the recent literature. These papers will be presented jointly by two or three students, one acting as a primary presenter and the other student(s) as discussant. Each student will be expected to act as presenter once and as discussant once during the term. The presenting students will choose the papers from the list suggested for that subtopic, or they are welcome to suggest other papers.
Students are expected to be involved in a related research project during the term, and be experimenting with a technique covered during the course. (Graduate students who are not actively involved in a research project outside of the course can work on a class project specific for this course or joint with another course; undergraduates who are not actively involved in a related research project are not allowed in the course.) Students will be expected to present their research progress during the term in a ten minute presentation in the last class. Grades will be based entirely on in class presentations and participation.
This course will meet once a week, Friday 10-12noon, in the 7th floor conference room (Newton room) of Sutardja Dai Hall.
The first class will be jan 28th. The introduction class which would have been scheduled jan 21st will happen virtually -- please contact the instructor if you are not already on the email list.
Prerequisites: prior Computer Vision and Machine Learning courses, or permission of instructor. Advanced undergraduates allowed only with permission of instructor and if they are actively participating in a related research project. Students should already be familiar with or be willing to learn on their own: basic image processing in MATLAB; Optic Flow; Edge Detection; Support Vector Machines; Gaussian Mixture Models; Hidden Markov Models, etc.; students must be able to read and understand at a basic level recent conference papers in the computer vision literature.
DRAFT Syllabus (class members please see google site for most up to date version):
January 28, 2011 Global Features
Background readings:
1. Oliva and A. Torralba, "Modeling the shape of the scene: A holistic representation of the spatial envelope," International Journal of Computer Vision, vol. 42, no. 3, pp. 145-175, May 2001. http://dx.doi.org/10.1023/A:1011139631724
2. Efros, A. C. Berg, G. Mori, and J. Malik, "Recognizing action at a distance," ICCV 2003, pp. 726-733 vol.2. http://dx.doi.org/10.1109/ICCV.2003.1238420
3. N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005, pp. 886-893.
http://dx.doi.org/10.1109/CVPR.2005.177
Contemporary readings:
4. P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, "Cascade Object Detection with Deformable Part Models", CVPR 2010.
http://dx.doi.org/10.1109/CVPR.2010.5539906
5. T. Deselaers and V. Ferrari, "Global and efficient self-similarity for object classification and detection", CVPR 2010.
http://dx.doi.org/10.1109/CVPR.2010.5539775
February 4, 2011 Local Features
Background readings:
6. D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, November 2004.
http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94
7. T. Lindeberg, "Feature detection with automatic scale selection," International Journal of Computer Vision, vol. 30, no. 2, pp. 79-116, November 1998. http://dx.doi.org/10.1023/A:1008045108935
8. J. Matas, O. Chum, U. Martin, and T. Pajdla, "Robust wide baseline stereo from maximally stable extremal regions," in Proceedings of British Machine Vision Conference, vol. 1, London, 2002, pp. 384-393. http://citeseer.ist.psu.edu/608213.html
9. K. Mikolajczyk and C. Schmid, "Scale & affine invariant interest point detectors," Int. J. Comput. Vision, vol. 60, no. 1, pp. 63-86, October 2004.
http://dx.doi.org/10.1023/B:VISI.0000027790.02288.f2
10. I. Laptev, "On space-time interest points," International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107-123, September 2005. http://dx.doi.org/10.1007/s11263-005-1838-7
Contemporary readings:
11. L. Bo, X. Ren, and D. Fox, "Kernel Descriptors for Visual Recognition", NIPS 2010, http://books.nips.cc/papers/files/nips23/NIPS2010_0821.pdf
12. L. Bourdev, S. Maji, T. Brox, and J. Malik, "Detecting People Using Mutually Consistent Poselet Activations", ECCV 2010,
http://dx.doi.org/10.1007/978-3-642-15567-3_13
February 11, 2011 Bag-of-word and Correspondence Kernels
Background readings:
13. C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka, "Visual categorization with bags of keypoints," in ECCV International Workshop on Statistical Learning in Computer Vision, 2004. http://www.xrce.xerox.com/Publications/Attachments/2004%2D010/2004_010.pdf
14. K. Grauman and T. Darrell, "The pyramid match kernel: discriminative classification with sets of image features," ICCV, vol. 2, 2005, pp. 1458-1465 Vol. 2. http://dx.doi.org/10.1109/ICCV.2005.239
15. S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories," CVPR, vol. 2, 2006, pp. 2169-2178. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1641019
Contemporary readings:
16. S. Maji and A. C. Berg, "Max-margin additive classifiers for detection", ICCV 2009, http://dx.doi.org/10.1109/ICCV.2009.5459203
17. A. Vedaldi and A. Zisserman, "Efficient Additive Kernels via Explicit Feature Maps", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539949
18. A. Kovashka and K. Grauman, "Learning a hierarchy of discriminative space-time neighborhood features for human action recognition", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539881
February 18, 2011 Segmentation and Region Proposals
Background readings:
19. J. Shotton, M. Johnson, and R. Cipolla, "Semantic texton forests for image categorization and segmentation," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. http://dx.doi.org/10.1109/CVPR.2008.4587503
Contemporary readings:
20. Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, "Layered Object Detection for Multi-Class Segmentation", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540070
21. F. Li, J. Carreira and C. Sminchisescu, "Object Recognition as Ranking Holistic Figure-Ground Hypotheses", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539839
22. B. Alexe, T. Deselaers, V. Ferrari, "What is an object?", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540226
23. B. Packer, S. Gould, and D. Koller, "A Unified Contour-Pixel Model for Figure-Ground Segmentation", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15555-0_25
24. I. Endres and D. Hoiem, "Category Independent Object Proposals", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15555-0_42
March 4, 2011 Descriptor Sparse Coding and Topic Models
Background reading:
25. Olshausen B. and Field D. Sparse coding with an overcomplete basis set: A strategy employed by V1?. Vision Research (1997) vol. 37 (23) pp. 3311-3325 http://www.chaos.gwdg.de/~michael/CNS_course_2004/papers_max/OlshausenField1997.pdf
Contemporary readings:
26. Raina et al. Self-taught learning: Transfer learning from unlabeled data. ICML (2007). http://dx.doi.org/10.1145/1273496.1273592
27. Fritz M., Black M., Bradski G., Karayev S., Darrell T. An Additive Latent Feature Model for Transparent Object Recognition. NIPS (2009) http://books.nips.cc/papers/files/nips22/NIPS2009_0397.pdf
28. Wang et al. Locality-constrained Linear Coding for Image Classification. CVPR (2010) http://dx.doi.org/10.1109/CVPR.2010.5540018
March 11, 2011 Hashing and Metric Learning
Background readings:
29. G. Shakhnarovich, P. Viola, and T. Darrell, "Fast pose estimation with parameter-sensitive hashing," ICCV 2003, http://dx.doi.org/10.1109/ICCV.2003.1238424
30. A. Frome, Y. Singer, F. Sha, and J. Malik, "Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification", ICCV 2007, http://dx.doi.org/10.1109/ICCV.2007.4408839
Contemporary readings:
31. P. Jain, B. Kulis, and K. Grauman, Fast Similarity Search for Learned Metrics, CVPR 2008/PAMI 2009, http://doi.ieeecomputersociety.org/10.1109/TPAMI.2009.151
32. B. Kulis and T. Darrell, "Learning to Hash with Binary Reconstructive Embeddings", NIPS 2009, http://books.nips.cc/papers/files/nips22/NIPS2009_0971.pdf
March 18, 2011 Temporal Models
Background readings:
33. J. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised learning of human action categories using spatial-temporal words," International Journal of Computer Vision. 79(3): 299-318. 2008 Available: http://dx.doi.org/10.1007/s11263-007-0122-4
Contemporary readings:
34. K. Prabhakar, S. Oh, P. Wang, G. D. Abowd, J Rehg, "Temporal Causality for the Analysis of Visual Events", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539871
35. A. Yao, J. Gall, L. Van Gool, "A Hough Transform-Based Voting Framework for Action Recognition", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539883
36. J.C. Niebles, C. Chen, and L. Fei-Fei, "Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15552-9_29
37. D. Weinland1, M. Ozuysal and P. Fua, "Making Action Recognition Robust to Occlusions and Viewpoint Changes", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15558-1_46
38. P. Matikainen, M. Hebert and R. Sukthankar, "Representing Pairwise Spatial and Temporal Relations for Action Recognition", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15549-9_37
39. T. Lan, Y. Wang, W. Yang and G. Mori, "Beyond Actions: Discriminative Models for Contextual Group Activities", NIPS 2010, http://books.nips.cc/papers/files/nips23/NIPS2010_0115.pdf
April 1, 2011 Image and text models
Background readings:
40. K. Barnard and D. Forsyth, "Learning the Semantics of Words and Pictures," International Conference on Computer Vision, vol 2, pp. 408-415, 2001, http://doi.ieeecomputersociety.org/10.1109/ICCV.2001.937654
41. D. Blei and M. Jordan, "Modeling Annotated Data", SIGIR '03 Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, http://dx.doi.org/10.1145/860435.860460
42. T. Berg and D. Forsyth, "Animals on the Web", CVPR 2006, http://dx.doi.org/10.1109/CVPR.2006.57
Contemporary readings:
43. Chong Wang, D. Blei, Fei-Fei Li, "Simultaneous image classification and annotation," CVPR 2009, http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.5206800
44. K. Saenko and T. Darrell, “Filtering Abstract Senses From Image Search Results”, NIPS 2009, http://books.nips.cc/papers/files/nips22/NIPS2009_1143.pdf
45. A. Farhadi, M. Hejrati , M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier and D. Forsyth, "Every Picture Tells a Story: Generating Sentences from Images", NIPS 2010, http://dx.doi.org/10.1007/978-3-642-15561-1_2
46. B. Siddiquie and A. Gupta, "Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540044
April 8, 2011 Crowd sourcing and Active Learning
Background readings:
47. L. von Ahn and L. Dabbish, "Labeling images with a computer game", SIGCHI 2004, http://dx.doi.org/10.1145/985692.985733
48. A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, "Active Learning with Gaussian Processes for Object Categorization" ICCV 2007. http://doi.ieeecomputersociety.org/10.1109/ICCV.2007.4408844
Contemporary readings:
49. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. "ImageNet: A Large-Scale Hierarchical Image Database". In CVPR, 2009. http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.5206848
50. S. Vijayanarasimhan, P. Jain, K. Grauman, "Far-sighted active learning on a budget for image and video recognition", CVPR 2010. http://dx.doi.org/10.1109/CVPR.2010.5540055
51. P. Welinder, S. Branson, S. Belongie, P. Perona, "The Multidimensional Wisdom of Crowds", NIPS 2010. http://books.nips.cc/papers/files/nips23/NIPS2010_0577.pdf
52. S. Branson, C. Wah, B. Babenko, F. Schroff, P. Welinder, P. Perona, S. Belongie, "Visual Recognition with Humans in the Loop", ECCV 2010. http://dx.doi.org/10.1007/978-3-642-15561-1_32
April 15, 2011 Scene and Image Context
Background readings:
53. A. Torralba, K. P. Murphy, and W. T. Freeman, "Contextual models for object detection using boosted random fields," in Advances in Neural Information Processing Systems 17 (NIPS), 2005, pp. 1401-1408. http://dspace.mit.edu/handle/1721.1/6740
54. D. Hoiem, A. A. Efros, and M. Hebert, "Putting objects in perspective," in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, 2006, pp. 2137-2144. http://dx.doi.org/10.1109/CVPR.2006.232
55. L.-J. Li and L. Fei-Fei, "What, where and who? classifying events by scene and object recognition," in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1-8. http://dx.doi.org/10.1109/ICCV.2007.4408872
Contemporary readings:
56. S. Bao, M. Sun, S. Savarese, "Toward coherent object detection and scene layout understanding", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540229
57. B. Yao and L. Fei-Fei. "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities.", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540235
58. A. Gupta, A. Efros and M. Hebert, "Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics". ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15561-1_35
April 22, 2011 Taxonomies and Sub-category Recognition
Background readings:
59. A. Zweig and D. Weinshall, "Exploiting object hierarchy: Combining models from different category levels," in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1-8. Available:http://dx.doi.org/10.1109/ICCV.2007.4409064
60. G. Griffin and P. Perona, "Learning and using taxonomies for fast visual categorization," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1-8. Available:http://dx.doi.org/10.1109/CVPR.2008.4587410
61. J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros, "Unsupervised discovery of visual object class hierarchies," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1-8. Available: http://dx.doi.org/10.1109/CVPR.2008.4587622
Contemporary readings:
62. L.-J. Li, C. Wang, Y. Lim, D. Blei and L. Fei-Fei. "Building and Using a Semantivisual Image Hierarchy", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540027
63. M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele, "What helps where – and why? Semantic relatedness for knowledge transfer", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540121
April 29, 2011 Domain Adaptation
64. K. Saenko, B. Kulis, M. Fritz, and T. Darrell, "Adapting Visual Category Models to New Domains", ECCV 2010, http://dx.doi.org/10.1109/10.1007/978-3-642-15561-1_16
65. A. Bergamo and L. Torresani, "Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach", NIPS 2010, http://books.nips.cc/papers/files/nips23/NIPS2010_0093.pdf
66. L. Cao, Z. Liu, T. Huang, "Cross-dataset action detection", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539875