The Synthetic Control method has pioneered a class of powerful data-driven techniques to estimate the counterfactual reality of a treated unit from several unexposed donor units. At its core, the technique involves a linear model fitted on the pre-intervention period that combines donor covariates to yield the control. However, combining spatial information at each temporal instance using time-agnostic weights fails to capture important inter-unit and intra-unit temporal contexts. We instead propose an approach to use local spatiotemporal information before the onset of the intervention as a promising way to estimate the counterfactual sequence. To this end, we suggest a Transformer model that leverages particular positional embeddings and a modified decoder attention mask to perform spatiotemporal sequence-to-sequence modeling. Our experiments on synthetic data demonstrate the robustness of our method against noise and missing data and outperform the baselines. Further, we demonstrate the interpretability of our model through qualitative evaluations on real-life datasets.