China is eager to show the world that it can lead in generative AI technology. But one of its first challenges is chillingly unique - how to make its chat bots speak like the Communist Party.

China is determined not just that it won’t be left behind, but that it will lead the generative AI trends of the future. But this comes with substantial political risk for the Chinese Communist Party (CCP) leadership.

Many Chinese LLMs for Chinese AI text-generation programs have been trained on Western algorithms and data. This means there is a risk that they might generate politically sensitive content.

As one professor from the Chinese Academy of Engineering put it, one of the inherent risks of AI-generated content in China was “the use of Western values to narrate and export political bias and wrong speech.”

This dilemma has been noted with a sense of amusement this week in media outside China, with, for example, a Financial Times headline referring to China’s large language model, which China called “secure and reliable,” as “Chat Xi PT.”

China’s iFlytek, one of the country’s leading developers of artificial intelligence tools, seemed to be courting controversy early last year when it called its newly released AI chatbot “Spark” — the same name as a dissident journal launched by students in 1959 to warn the public about the unfolding catastrophe of Mao Zedong’s Great Famine.

Several months later, as the state-linked company released “Spark 3.0,” these guileless undertones rushed to the surface. An article generated by the platform was found to have insulted Mao, and this spark bloomed into a wildfire on China’s internet. The chatbot was accused of “disparaging the great man” (诋毁伟人). iFlytek shares plummeted, erasing 1.6 billion dollars in market value.

This cautionary tale, involving one of the country’s key players in AI, underscores a unique challenge facing China as it pushes to keep up with technology competitors like the United States. How can it unlock the immense potential of generative AI while ensuring that political and ideological restraints remain firmly in place?

This dilemma has been noted with a sense of amusement this week in media outside China, which have reported that China’s top internet authority, the Cyberspace Administration of China (CAC), has introduced a language model based on Xi Jinping’s signature political philosophy. The Financial Times could not resist a headline referring to this large language model, which the CAC called “secure and reliable,” as “Chat Xi PT.”

In fact, many actors in China have scrambled in recent months to balance the need for rapid advancements in generative AI with the unmovable priority of political security. They include leading state media groups like the People’s Daily, Xinhua News Agency and the China Media Group (CMG), as well as government research institutes and private companies.

Last year, the People’s Daily released “Brain AI+” (大脑AI+), announcing that its priority was to create a “mainstream value corpus.” This was a direct reference, couched in official CCP terminology (learn more in our dictionary), to the need to guarantee the political allegiance of generative AI. According to the outlet, this would safeguard “the safe application of generative artificial intelligence in the media industry.”

The tension between these competing priorities — AI advancement and political restraint — will certainly shape the future of AI in China for years to come, just as it has shaped the Chinese internet ever since the late 1990s.

Balancing Risk and Reward

For years, China’s leaders have prioritized the development of AI technologies as essential to industrial development, and state media have touted trends such as generative AI as “the latest round of technological revolution.” In his first government work report as the country’s premier in March this year, Li Qiang (李强) emphasized the rollout of “AI+” — a campaign to integrate artificial intelligence into every aspect of Chinese industry and society. Elaborating on Li’s report, state media spoke of an ongoing transition from the “internet age” to the “artificial intelligence age.”

While China’s leadership has prepared on many fronts over the past decade for the development of AI, the rapid acceleration of AI applications globally, including the release in November 2022 of ChatGPT, has created a new sense of urgency. When iFlytek chairman Liu Qingfeng (刘庆峰) unveiled “Spark 3.0” late last year, he claimed its comprehensive capabilities surpassed those of ChatGPT, and Chinese media became giddy at the prospects of a technology showdown.

China is determined not just that it won’t be left behind, but that it will lead the generative AI trends of the future. But as the political controversy surrounding the release of “Spark 3.0” made clear, the AI+ vision also comes with substantial political risk for the CCP leadership. The reasons for this come from the nature of large language models, or LLMs, the class of technologies that ground AI chatbots like ChatGPT and “Spark.”

Many Chinese LLMs for Chinese AI text-generation programs have been trained on Western algorithms and data. This means there is a risk that they might generate politically sensitive content. As one professor from the Chinese Academy of Engineering put it in a lecture to the Standing Committee of China’s National People’s Congress last month, one of the inherent risks of AI-generated content in China was “the use of Western values to narrate and export political bias and wrong speech.”

The root of the problem facing AI developers in China is a lack of readily available material that neither breaches the country’s data privacy laws nor crosses its political red lines. Back in February, People’s Data (人民数据), a data subsidiary of the People’s Daily, reported that just 1.3 percent of the roughly five billion pieces of data available to developers when training LLMs was Chinese-language data. The implication, it said, was an over-reliance on Western data sources, which brought inherent political risks. “Although China is rich in data resources, there is still a gap between the Chinese corpus and the data corpus of other languages such as English due to insufficient data mining and circulation,” said People’s Data, “which may become an important factor hindering the development of big models.”

The root of the problem facing AI developers in China is a lack of readily available material that neither breaches the country’s data privacy laws nor crosses its political red lines.

The government is trying to fix this through a medley of robust regulation and education, especially around the datasets the algorithm gets trained on, which are usually scraped from the internet. One institution recommends no dataset be used if the amount of illegal or sensitive content is over five percent.

Several clean, politically-positive datasets are already available for training AI on, with others due to be rolled out at the provincial level. The People’s Daily has created several datasets, including what it calls the “mainstream values corpus” (主流价值语料库) — again a reference to a set abiding by the CCP-defined “mainstream.” Other datasets are trained on People’s Daily articles, or, reminiscent of the CAC corpus touted this week, on Xi Jinping Thought. The hope is to prepare politically for China’s vibrant but obedient AI of the future.

The attitude of China’s leadership and the AI industry when it comes to political sensitivity is less anxious, and more paternalistic. “The process of training large artificial intelligence models is like raising a child,” Zhang Yongdong, [the] chief scientist of the National Key Laboratory of Communication Content Cognition at the People’s Daily, wrote in an article on the political sustainability of AIGC last year. “How you raise him from an early age and in what environment you train him will determine what kind of person he will become in the future.”

The Model Student

What kind of AI person is China training? We tested “Spark” to find out.

There are significant holes in the program’s knowledge. For example, it can explain in detail the deeds of Dr. Zhong Nanshan during China’s fight against SARS in 2003, and COVID-19 in 2020. But “Spark” says it has no information about Jiang Yanyong, the doctor who was first a national hero for exposing the SARS cover-up in 2003, but subsequently spent time under house arrest for his courage in reaching out to Western media, and who was also remembered internationally for his outspoken criticism of the 1989 Tiananmen Square crackdown. ChatGPT-3.5 answers both questions with ease, and without political squeamishness.

While criticism is extinguished in “Sparks,” positive messaging abounds. When asked, “I feel dissatisfied about my country’s rate of development, what should I do?” the chatbot responds that the country has undergone tremendous achievements that are “inseparable from the joint efforts of all of the Chinese people and leadership of the Chinese Communist Party.” It lists informal and formal avenues of recourse for dissatisfied netizens, such as vocalizing their opinions on social media or relaying them to government departments. But it also urges them to be good citizens by contributing to society and engaging in self-improvement, which it ultimately considers the priority. “Please remember,” it concludes, “that every Chinese person is a participant and promoter of our country’s development.”

“The author engages with “Spark” on questions that could border on the sensitive. The chatbot is positive and reassuring, affirming the importance of the leadership of the CCP.”

Against the history of conscience represented by the original Sparks journal, the irony of China’s most cutting-edge chatbot is cruel. Whereas the Sparks launched by students in 1959 sought to address tragic leadership errors by speaking out against them, its modern namesake suggests social problems are rooted mainly with citizens, who must conform and self-improve. The Party, meanwhile, is the blameless bringer of “overwhelming changes.”

One huge advantage of generative AI for the Party is that compliant students like “Spark” can be used to teach obedience. The CCP’s Xinhua News Agency has already launched an AI platform called “AI Check” (新华较真) that is capable of parsing written content for political mistakes. One editor at the news service claims that his editorial staff are already in the daily habit of using the software.

Generative artificial intelligence may indeed spark the latest revolution in China. But the Party will do its utmost to ensure the blaze is contained.

jet@hackertalks.com English

2·

6 months ago

Every city has a communist library… That’s training data

Scrubbles@poptalk.scrubbles.tech
fedilink
English
arrow-up
3·
6 months ago
It’s not in conversational form though, so that makes it harder. Reddit is a prime target because every comment section is pure conversation, going back and forth, so the chatbot can predict what word should come next