
Artificial Intelligence (AI) has been around long before OpenAI launched ChatGPT two years ago, bringing AI to the attention of mainstream media. Its massive popularity made ordinary people aware of its existence, starting a wave of new trends. Entrepreneurs tried to capitalise by creating countless new start-ups. Businesses scrambled to adopt it in some way fearing they would be left behind. Students used AI in their work, creating whole new challenges in academia. Products began to flaunt AI claims in their marketing, regardless of the actual value added by the AI factor. Anything with “AI” sells—for now.
There was no clear start date in the field of AI. I personally consider Alan Turing’s “Computing Machinery and Intelligence” publication back in 1950 as a significant starting milestone. The term Artificial Intelligence itself was coined in 1955 in a proposal submitted by John McCarthy (Dartmouth College), Marvin Minsky (Harvard University), Nathaniel Rochester (IBM), and Claude Shannon (Bell Telephone Laboratories). Their workshop a year later (1956) is considered by some as the birth date of this new field of study. Over the following decades, alongside incredible advancements in computing hardware, AI research progressed significantly.
ChatGPT’s ability to generate responses based on prompts belongs to a branch of AI called Generative AI. While this branch introduced AI to the mainstream, it is not the only field within AI. Many other fields exist. A machine’s ability to transform human speech to text is AI. Identifying animals from images is AI. Differentiating between faces or voices is AI. Playing chess against computers is an implementation of AI. Many AI innovations have been inspired by fictional stories that attempted to bring imaginings to reality.
As businesses enter a new race to benefit from AI, many put their focus on the products offered by some of the big names. OpenAI’s ChatGPT is considered the current market leader, as it has become the foundation of Microsoft’s Copilot and Apple’s upcoming Apple Intelligence. People also have the choice to use alternatives like Google Gemini, Amazon Q, Anthropic Claude, Perplexity AI, Jasper and others. While choosing the right model is crucial to determining a business’s value in AI implementation, many forget one important yet underrated factor: the quality of their data.
An AI model can only be as good as the data it is trained on.

At a very fundamental level, an AI model gets its “intelligence” from prior learning before generating responses to answer prompts. This prior learning occurs when the model is given a large set of data for training. The AI model processes this training data and performs complex calculations to identify patterns and correlations within it. It then uses these patterns to generate responses to future prompts. For instance, if the training data contains many cases where people consuming products from a certain brand X ended up in the hospital, an AI model might conclude that brand X is unsafe and causes sickness. If a different dataset contains information about people’s allergies, the AI model may discern that those hospitalised after consuming brand X are mostly individuals with a specific allergy. The more training data available, the better the chances are that the AI model will meet its creators’ expectations.
Does this mean that a large amount of training data guarantees best AI outcomes?
Not necessarily. While the size of training data is indeed an important factor, data quality is equally important. A large volume of poor-quality data will likely cause an AI model to identify incorrect patterns, leading to a phenomenon known as AI hallucinations.
Then surely we could overcome that by using training data from the whole Internet?
Also, no.
First, “the whole Internet” is not really a well-structured data and a significant portion of it contains incorrect information. Using the Internet for AI training data requires advanced strategies to shape these data into usable forms. Second, the sheer volume of data from the whole Internet would require excessive computing power, beyond what today’s technology can handle. Third, there are serious legal and ethical issues with using people’s data, regardless of whether these are protected by copyright law or not.
Moreover, the Internet is not static. Billions of user entities (not all of them are even human) use the Internet every day, constantly adding new data. Automated bots form a significant part of online activities. In recent months (or years) humans have also begun adding AI-generated results into the massive pool of Internet data. When AI-generated data becomes a significant portion of available data, an AI model may start treating its past results (which may or may not make sense) as part of its “statistically significant” inputs. Imagine someone prints our notes from primary school as a book and adds it to the pile of material we need to study for a postgraduate exam. This example illustrates the importance of marking AI-generated results, whether they are in text, images, audio, video, or other forms.
Then what kind of training data would be best?
It depends on the intended use of the AI. For AI with a specific use case, well controlled and context-specific training data usually works better than a massive amount of low-quality data.
Not every business needs global-scale training data for their AI implementation. Some businesses may want AI to produce responses tailored to their specific context using their own dataset. For example, an Italian company doing nutritional programs might want to make sure that the new AI chatbot they build for customers will never recommend pineapple on pizza, a suggestion that could happen if they train their AI with data from across the globe. Other reasons for training AI with in-house data include unique business aspects, regulatory compliance, or privacy concerns. When businesses opt to use their own data for training, they must understand that the quality of their historical data is just as important as their choice of AI model or platform.
A Comparison Example
Drawing from my field of expertise in Project and Portfolio Management, here is a simplified example of the importance of data quality, using two fictional companies: A and B. Both companies want to implement AI to predict which projects are likely to bring more benefits and fewer issues. This will help management approve project proposals that maximise investment returns.
Company A has four departments using different tools to manage their projects: Jira, Project Online, Microsoft Project and Excel. This company uses other tools to manage their finances and resources. An AI implementation would need to train on their historical data in the last 15 years, including project activities, finances (budget vs. actual costs), milestones, deliverables, registers, and benefits. Consolidating tasks from Project Online and Microsoft Project will take some effort. Further consolidation with Jira will require more work due to differences in how projects are managed (Agile methodology in Jira vs. Waterfall in Project Online and Microsoft Project). The most challenging aspect will be consolidating data from the fourth department, which uses Excel. Even though they have a template, each Project Manager is allowed to modify it, creating Excel files with different structures. After consolidating project data, Company A will still need to include finance and resource data to get a complete picture.
Company B has three departments managing different kinds of projects, but they all use the same Project and Portfolio Management tool. Even though they use other tools to manage finances and resources, they already have automated processes that synchronise data to the PPM tool every night, allowing them to have one single source of truth to access end-to-end views of their projects. When the CTO announces plan to implement AI to predict risks and benefits of potential projects, all the data necessary for training is already available in a single system. This allows company B’s AI model to train more effectively, resulting in better predictions.
By having all PPM data in a single system, company B can ensure certain levels of data integrity. For example, by making certain fields mandatory, they know that every record in the training data for their AI will include project budget and duration.
Implementing Artificial Intelligence in PPM
The company I work for, Altus, develops a tool to manage projects, programs, and portfolios in a unified solution with integrations to popular project management tools. It has its own scheduling and resource management tools, and our implementation partners would be able to set up integration with various external tools. This allows companies at different levels of process maturity to use the tools that work for them, while still maintaining a single source of truth for all project data.
If implementing AI in your organisation’s PMO or portfolio management is something of your interest, please get in touch with one of our partners. Remember, consolidating your data into one place is a crucial first step for any AI implementation. Further analysis of your organisation’s business processes will then help improve data hygiene, resulting in even better training data.
What is data hygiene?
In simplistic terms, data hygiene refers to maintaining a database that is clean, consistent and free from issues. Clean data means that it is up-to-date (all updates are recorded correctly), complete (all mandatory fields are filled), structurally correct, and free from duplication or corruption. While some AI models may seem intelligent in their ability to understand (unstructured) human prompts, they work significantly better when trained on clean, structured data.
Once an AI model is trained, does it still need good training data?
Even after training, AI models benefit from ongoing high-quality data. Most organisations want their AI models to continuously improve, and that requires ongoing input of clean, structured, and relevant data. Preparing high-quality data should be seen as an ongoing requirement instead of a once off process.
Furthermore, high quality data is useful not only for training purposes. When users feed generative AI models with prompts or attachment data, they perform better with well-structured and well-maintained data. At the end of the day, AI models, no matter how sophisticated, are still machines — machines that thrive on consistency and order.
Conclusions
As Artificial Intelligence continues to evolve and permeate various industries, businesses must not underestimate the critical role of data quality. The best AI models are not only the ones built on cutting-edge technology, but those trained and refined with clean, contextually relevant, and structured data. Whether you are a start-up leveraging AI for brand new products or a multinational corporation embedding AI into internal operations, your model will only be as smart as the data you provide.
In this new AI-driven world, businesses would do well to remember: data quality is not just a factor, it is the foundation. Ensuring the highest standards of data hygiene and structure is the key to unlocking AI’s full potential and long-term value.
Disclaimer: This article was also published in my LinkedIn account. The content of this article is my personal view. Microsoft Designer AI is used to generate the illustrations for this article.
Leave A Comment