CCRI: Principled Dataset Generation, Sharing and Maintenance Tools for the Wireless Community

Applied machine learning (ML) research in wireless faces challenges due the inability of domain experts to easily access existing well-curated, well-structured, and open-access datasets. Furthermore, there is a lack of direct access to a software framework that automates dataset creation and distribution based on detailed user requirements. RFDataFactory is a collaborative project that brings together investigators from Northeastern University and Rice University to bridge this gap. RFDataFactory aims to make available categorized datasets suitable for research related to ML in 5G and beyond networks, and advance fundamental understanding and design tools for accessing, creating, sharing and storing wireless datasets.

RFDataFactory will enable easy collection and preprocessing of physical layer to packet-level datasets through high-level directives and application programming interfaces. This will enable dataset generation for several NSF-funded experimentation platforms, such as the Colosseum emulator and NSF Platforms for Advanced Wireless Research. The project will significantly advance autonomous statistical analysis of RF spectrum activity, which will reduce data storage needs. Moreover, it will create pre-processing tools for removing device identifying information and facilitate generating standards compliant metadata headers. The project will also result in a search-able, centralized repository of both project-supported and user-contributed datasets with the focus on re-usability.

RFDataFactory will accelerate interdisciplinary research at the intersection of machine learning and the wireless domain, as well as bridging different communities and train a new generation of professionals for wireless dataset creation and sharing. The project will seek to involve underrepresented students in research and learning activities, support annual dataset gathering challenges, update advanced course materials with hands-on tutorials and laboratory sessions. Through targeted high-school outreach, the project will increase awareness and excitement in the next generation of researchers. The project will also generate value for other large-scale infrastructure investments already made by the NSF.

Project Url: All datasets, meta-data files, software application programming interfaces, tutorial materials, webinar recordings and other digital outcomes of this project will be maintained for 3 years, accessible via the project website after the completion of the project.

NSF Abstract: https://www.nsf.gov/awardsearch/showAward?AWD_ID=2120447&HistoricalAwards=false