Performance tuning of applications for shared-memory multiprocessors is to a great extent concerned with removal of performance bottlenecks caused by communication among the processors. To simplify performance tuning, our approach has been to extend the hardware/software interface with powerful memory-control primitives in combination with compiler optimizations to remove communication bottlenecks in distributed shared-memory multiprocessors. Evaluations have shown that this combination can yield quite dramatic application performance improvements. This raises the fundamental question of how the hardware/software interface in future distributed shared-memory machines should be defined to serve as a good target for performance tuning of shared-memory programs, either automatically or by hand. An approach along those lines is discussed.