Many researchers and vendors are exploiting the increasing number of transistors to build chip multiprocessors (CMPs) by partitioning a chip into multiple simple ILP cores. As in traditional multiprocessors, CMPs extract thread-level parallelism (TLP) from programs by running multiple independent program segments, i.e., threads, in parallel. Currently CMPs are used widely in high performance servers, and even in embedded systems. In this paper, we present an extension of the OpenMP shared directive for performance optimization on BlackFin 561 (ADSP-BF561) dual core processors. In order to support memory consistency between multiple cores, many architectures have been proposed. On the dual core processor, like ADSP-BF561, each core has its own private LI cache, and a shared L2 cache. In order to execute multithreaded parallel programs, we need to consider carefully where to allocate shared variables on targeted memory architecture. We could improve the speedup by up to 107% and reduce the energy consumption by up to 108% in our measured benchmarks with respect to no use of our extension.