Human visual experience not only contains physical properties such as shape and object, but also includes a rich understanding of others' mental states, including intention, belief, and desire. This ability is highlighted as "Theory of Mind (ToM)" in social and developmental psychology. In my research, I synthesize human psychophysics with a computational model that implements ToM with advanced algorithms and techniques from computer vision, robotics, and artificial intelligence. Intentions are represented by a Hierarchical Task Network, which is a type of temporal parse graph describing the hierarchical structure of actions and plans. Human beliefs are represented as a spatial parse tree characterizing the hierarchical structure of all objects, actions and their physical relations. By observing human actions and the 3D world, the model infers intentions and beliefs by reverse-engineering the decision making and action planing processes in a human’s mind under a Bayesian framework. This cognitively motivated model illustrates (a) how to achieve human-like explanations and predictions of social scenes; (b) how to recognize flexible human actions with little training; and (c) how to produce semantic representations of a social scene that connects vision with other domain-general processes, including language and common-sense reasoning.