AI & Open Government Data Assets -- Commerce Chief Data Officer requests input (by 7/16)

Question

AI & Open Government Data Assets -- Commerce Chief Data Officer requests input (by 7/16)

asked 2 years ago by Andrew Reamer (58.3k points)
edited 2 years ago by Andrew Reamer

Apr 17 -- The U.S. Department of Commerce is committed to advancing transparency, innovation, and the responsible use and dissemination of public data assets, including for use by data-driven AI technologies. To this end, we are pleased to issue this Request for Information (RFI) to seek valuable insights from industry experts, researchers, civil society organizations, and other members of the public on the development of AI-ready open data assets and data dissemination standards. Comments must be received on or before July 16, 2024.

The U.S. Department of Commerce (Commerce) is committed to leading the way in producing and disseminating high-quality public data. Commerce's data assets enable U.S. scientific discovery, innovation, and economic growth, serving as an invaluable asset to the country. In its mission to publish data for the American public and achieve its strategic goal to “expand opportunity and discovery through data,” Commerce is dedicated to continuously refining its processes for creating, curating, and distributing its data as new technologies emerge. This Request for Information (RFI) seeks to understand ways to improve Commerce's creation, curation, and distribution of its open data assets to facilitate the development and advancement of AI technologies such as generative AI.

Commerce, as a premier data provider, has a long history of adapting to technological change. In the past 40 years, Commerce has moved data publication efforts into electronic forms, and in the past 20 years, that has included the provision of both data services and tools to support discovery and exploration of Commerce's data. In the last five years, Title II of the Foundations for Evidence-Based Policymaking Act, commonly known as the OPEN Government Data Act, began Commerce's commitment to the dissemination of open data assets in machine-readable formats, or “data in a format that can be easily processed by a computer without human intervention while ensuring no semantic meaning is lost” (44 U.S.C. 3502(18)).

Today, Commerce is facing a new technological change with the emergence of AI technologies that provide improved information and data access to users. Commerce is specifically interested in generative AI (GenAI) applications, which digest disparate sources of text, images, audio, video, and other types of information to produce new content. GenAI and other AI technologies present both opportunities and challenges for both data providers such as Commerce and data users including other government entities, industry, academia, and the American people.

AI has brought transformative changes to many industries including health, finance, education, and transportation, while GenAI has the promise of democratizing access to data by enabling the average person to engage with data in ways that had not previously been possible. Recent GenAI tools allow users to input simple prompts to engage with content gathered by these tools from a wide range of sources, including Commerce's public data.

The challenge for Commerce, as an authoritative provider of data, is to ensure that these new AI intermediaries can appropriately access its data without losing the integrity, including quality, of said data. AI tools require mass amounts of trustworthy information to accurately respond to the needs of their users. As AI applications become more sophisticated and ingrained in everyday life, the role of high-quality data becomes increasingly critical. Commerce acknowledges, as a key data producer, that in order for AI systems to utilize its data for training and for instant data retrieval, its data may need to be reconfigured in easily consumable formats. AI tools are increasingly used for data analysis and data access, so Commerce hopes to ensure that the data these tools consume is easily accessible and “machine understandable,” versus just “machine readable.” Therefore, this RFI explores how to achieve better data integrity, accessibility, and quality for emerging AI technologies.

The uniqueness of emerging technologies such as GenAI arises from the fact that the interpretation and use of data is no longer solely executed by human experts (e.g., scientists, engineers, software developers) who bring their own knowledge and understanding to working with Commerce's data. This human understanding is grounded in shared disciplinary knowledge and in human-readable documentation that Commerce provides with its published data. AI systems currently lack common knowledge and the ability to use such knowledge in their activity. Although these systems demonstrate fluency and intelligence, their outputs are often driven by contextual prediction rather than higher-order reasoning capabilities. Recent AI systems are trained on tremendous amounts of digital content and generate responses based on the contextual properties of that content. However, these systems do not truly “understand” the texts in a meaningful way. While there is ongoing improvement, today's AI systems are fundamentally limited by their reliance on extensive, unstructured data stores, which depend on the underlying data rather than an ability to reason and make judgments based on comprehension. Knowing this, Commerce seeks to adhere to its strategic mission to “expand opportunity and discovery through data,” by disseminating public data in AI ready formats while ensuring no semantic meaning is lost.

To respond to the challenge and realize the opportunity offered by these new technologies, it is important that Commerce enables AI systems to access and use its public data assets correctly and responsibly.

This RFI seeks feedback, recommendations, and suggestions from industry experts, researchers, civil society organizations, and the public regarding Commerce's creation, curation, and distribution of data assets that are specifically designed to facilitate the development and advancement of AI technologies such as GenAI.

Thus far, Commerce has made efforts to expose its public data through structured APIs and is developing enriched metadata standards for describing its data assets. To date, Commerce metadata has focused on enabling discovery of data assets rather than the use of those data assets by AI systems, but Commerce sees value in changing this focus. Commerce seeks to further understand how it can make its data assets AI-ready.

In particular, Commerce wishes to explore the following:

-- The use of knowledge graphs for variable level metadata, allowing systems to better link human terms to data elements;
-- Embracing standardized ontologies such as schema.org or NIEM;
-- Harmonizing and linking our internal ontologies and vocabularies using knowledge graphs grounded in standardized ontologies;
-- Gathering internal and external written documentation of existing data products and:
○ Mining them for terminology to use in metadata harmonization and linking; or
○ Releasing them in raw formats for the training of AI models;
-- Adopting data formats which allow for rich metadata as well as generating metadata “sidecars” for more traditional formats such as CSV or SAS;
-- Using open standards for APIs with the ability to link into knowledge graphs; and
-- Improving guidance and metadata around appropriate data usage and licensing for purposes such as research analytics, text-and-data mining, and AI system ingestion.

Commerce seeks comment on the topics discussed above and responses to the following questions: . . . .

FRN: https://www.federalregister.gov/d/2024-08168
Commerce press release: https://www.commerce.gov/news/blog/2024/04/request-information-ai-ready-open-government-data-assets

AI & Open Government Data Assets -- Commerce Chief Data Officer requests input (by 7/16)

Please log in or register to add a comment.

Please log in or register to answer this question.