State estimation of deformable objects such as textiles is notoriously difficult due to its extreme high dimensionality and complexity. Lack of data and benchmarks is another challenge impeding progress in robotic cloth manipulation. In this paper, we make a first attempt to solve the problem of semantic state estimation through RGB-D data only in an end-to-end manner with the help of deep neural networks. Since neural networks require large amounts of labeled data, we introduce a novel Mujoco simulator to generate a large-scale fully annotated robotic textile manipulation dataset including bimanual actions. Finally, we provide a set of baseline deep neural networks and benchmark them on the problem of semantic state prediction on our proposed dataset. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.