Individuals are highly accurate for visually understanding natural scenes. By extracting and extrapolating data we reach the highest stage of scene understanding. In the past few years it proved to be an essential part in computer vision applications. It goes further than object detection by bringing machine perceiving closer to the human one: integrates meaningful information and extracts semantic relationships and patterns. Researchers in computer vision focused on scene understanding algorithms, the aim being to obtain semantic knowledge from the environment and determine the properties of objects and the relations between them. For applications in robotics, gaming, assisted living, augmented reality, etc a fundamental task is to be aware of spatial position and capture depth information. First part of this paper focuses on deep learning solutions for scene recognition following the main leads: low-level features and object detection. In the second part we present extensively the most relevant datasets for visual scene understanding. We take into consideration both directions having in mind future applications.