Although many solutions in production systems still use a sliding window as described below in this answer, the field of computer vision is moving quickly. Recent advances in this field include R-CNN and YOLO.
Detecting object matches in an image, when you already have an object classifier trained, is usually a matter of brute-force scanning through image patches.
Start with the largest expected patch size. E.g. if your image is 1024 x 768, but always a distance shot of a road maybe you do not expect any car to take up more than 80 x 80 pixels in the image. So you take an 80x80 block of pixels from one corner of the image, and ask your classifier what chance there is a car in that corner. Then take the next patch - perhaps move by 20 pixels.
Repeat for all possible positions, and decide which patches are most likely to contain cars.
Next, take a block size down (maybe 60 x 60, moving 15 pixels at a times) and repeat the same exercise again. Repeat this until you have hit the expected smallest block size for your goal.
Eventually you will have a list of areas within the image, with the probability that each contains a car.
Overlapped blocks both with high probability are most likely the same car, so the logic needs to have thresholds for merging blocks - usually taking the overlapped area with the highest probability score - and declaring there is only one car in that area.
As usual with ML approaches, you will need to experiment with correct meta-params - in this case block sizes, step sizes and rules for merging/splitting areas - in order to get the most accurate results.