Hey, it's Bernie!

SovereignAI

What would a truly localized Singaporean LLM look like? Beyond just Singlish, I explore how a 'fellow citizen' AI could navigate our unique cultural nuances.


As I'm going through the free Stanford Online course CME295, I thought about Natural Language Processing (NLP) and how we currently tokenize English text to feed into LLMs. If we transform words into numbers, how do we add in culture nuances, slangs, etc that are unique to a country so that we can create a sort of naturalized citizen? What is Singapore currently doing?

So I asked Gemini for some thoughts: so if I am a citizen of a country who wants to help build a realistic LLM for my country where we have 4 languages and we want to not only create a balanced LLM, but also to incorporate our country's culture and slangs into it so that the LLM performs like a "fellow citizen" and knows all our nuances and history. How should I start?

Gemini's answer: “This is a noble and technically fascinating mission. Building a “Sovereign LLM” is the move from a model that knows about your country to a model that is of your country.

To make an AI feel like a “fellow citizen,” you have to move beyond just scraping the web. You need to follow a four-stage blueprint:

  • Collection: Gathering multicultural datasets.
  • Curation: Cleaning data to avoid “Data Poisoning.”
  • Alignment: Teaching the AI nuances and cultural slangs.
  • Evaluation: Testing if the AI actually sounds like a “Fellow Citizen.”

...

Be careful with “Data Poisoning.” If you only scrape one political side or one ethnic group's slang, your “Fellow Citizen” AI might end up being biased or exclusionary. Balance is your hardest job.”

It then went on to suggest some open-source models to start with and actionable next steps.

[NOTE] I'm using this blog article to document the start of my journey. It will be a multi-year quest of mine. I know Singapore is already embarking on building her own LLM or LLMs, but wouldn't it be great if I can create my very own Singaporean friend, a super knowledgeable, kind and caring one at that.

Cool idea?

#ai #LLM #NLP #SovereignAI #Singapore #Singlish #AISingapore #CME295 #LearningInPublic