Machine Learning System Design Stage: Data Preparation

Paul Deepakraj Retinraj
5 min readJun 17, 2023

--

Earlier in this series:

Machine Learning System Design: Template

Machine Learning System Design Stage: Problem Navigation

Introduction:

Training data collection is a critical stage in the design of machine learning systems. The quality, quantity, and preprocessing of data significantly impact the performance and reliability of machine learning models. This comprehensive blog explores various aspects of training data collection, including methods, constraints, data size and quality, labeling requirements, sampling techniques, splitting strategies, data storage policies, preprocessing considerations, and addressing data privacy concerns. By understanding and addressing these factors, practitioners can lay a solid foundation for building robust and accurate machine learning models.

Training Data Collection Methods:

When embarking on training data collection, it is crucial to consider the methods employed. Each method has its constraints and risks. Manual data collection, for example, can be time-consuming and prone to errors. Crowdsourcing data collection may introduce challenges related to data quality and inconsistency. Understanding the limitations and risks associated with each proposed method helps ensure the reliability and representativeness of the collected data.

Data Size and Quality:

The size and quality of the training data play a pivotal role in model performance. Evaluating the available data and assessing its adequacy for training is essential. Considerations include the data size, its representativeness, and potential biases. Additionally, determining whether the data is labeled or requires annotation is crucial. Understanding if the labels are direct or derived can help identify the level of complexity in the annotation process. Furthermore, considering the need for human labellers to establish ground truth and leveraging user feedback data for model improvement are important considerations.

Sampling Techniques:

Sampling techniques are employed to select a subset of data for training, testing, and validation. Random, stratified, time-based partitioning, or K-fold cross-validation methods are commonly used based on the problem and data characteristics. Addressing imbalanced data, where the distribution of positive and negative training samples is skewed, is crucial. Techniques such as downsampling the majority class and uplifting the minority class can help balance the dataset, ensuring a more representative and robust model.

Splitting Techniques:

Proper data splitting into training, testing, and validation sets is crucial for model evaluation and generalization assessment. Determining the appropriate ratio and ensuring data independence among the splits are key. This allows for reliable estimates of model performance and the ability to identify potential overfitting or underfitting issues.

Data Storage and Retention Policies:

Establishing data storage and retention policies is crucial for managing training data effectively. Defining expiration policies ensures that outdated or irrelevant data is removed from the dataset, maintaining its relevance and reliability. Additionally, adherence to data privacy regulations and ensuring data security are critical considerations when defining storage policies.

Data Preprocessing:

Data preprocessing is a fundamental step in preparing the training data for modeling. Feature engineering techniques transform raw data into meaningful and informative features, optimizing the model’s performance. Handling imbalanced data is crucial to mitigate biases and improve model accuracy. Techniques such as resampling, ensemble methods, or synthetic data generation can address the imbalance and boost model performance. Furthermore, addressing missing values, outliers, duplication, data inconsistencies, and errors are necessary steps in the preprocessing pipeline. Data normalization helps bring features to a standardized scale, avoiding any undue influence of certain features on the model’s training process.

Data Privacy:

Data privacy is of utmost importance in training data collection. It is essential to ensure compliance with relevant privacy regulations, protect personally identifiable information, and anonymize sensitive data when necessary. Implementing robust data privacy measures instills trust and safeguards user information.

Conclusion:

Training data collection forms the foundation of successful machine learning system design. By considering the methods, constraints, data size and quality, labeling requirements, sampling techniques, splitting strategies, data storage policies, preprocessing considerations, and data privacy concerns, practitioners can ensure the reliability, performance, and ethical use of their machine learning models. A thorough and thoughtful approach to training data collection empowers businesses to leverage the power of data and build accurate, reliable, and impactful machine learning systems.

Training data collection methods:

— — — Constraints/risks with a proposed method

Data:

— — — Size (How much data is available?) and Quality of data

— — — Is it labelled data? Does it need to be annotated?

— — — Direct labels or derived labels

— — — Human labellers for ground truth?

— — — User feedback data to improve the model?

Sampling techniques:

— — — Random vs stratified vs Time-based partitioning vs K-fold cross-validation

— — — Imbalanced data - Balancing positive and negative training sample

— — — — — Downsample and uplifting

Splitting techniques:

— — -Train, test and validation

Data Storage and Retention Policies:

-— -Expiration policies

Data Preprocessing:

—- -Features, imbalanced

— — Data Quality:

Missing values:

Missing values are data points that are absent for one or more features.

They can occur due to various reasons, such as data entry errors, equipment malfunctions, or incomplete data collection.

Identify the missing values and their patterns in the dataset to determine their impact on the analysis and the appropriate handling method.

Outliers:

Outliers are data points that deviate significantly from the overall distribution or pattern of the data.

They can be caused by data entry errors, measurement errors, or genuine extreme observations.

Identify and analyze outliers to determine their cause and decide whether to retain, correct, or remove them from the analysis.

Duplication:

Duplicate records are instances that appear more than once in the dataset, either as exact copies or with minor variations.

Duplicates can occur due to data entry errors, merging datasets, or other data processing issues.

Identify and remove duplicates to avoid biased estimates and overfitting in the machine learning model.

Data inconsistencies and errors:

Data inconsistencies and errors are discrepancies or inaccuracies in the data that can affect the quality and reliability of the analysis.

These can include typos, formatting errors, incorrect units of measurement, or inconsistent encoding of categorical features.

Identify and correct data inconsistencies and errors to ensure the data is accurate, consistent, and suitable for analysis.

Data Privacy:

— — Sensitive customer data

Understand different features and their relationship with the target
— Is the data balanced? If not do you need oversampling/undersampling?
— Is there a missing value (not an issue for tree-based models)
— Is there an unexpected value for one/more data columns? How do you know if its a typo etc. and decide to ignore?

Data normalization:

Normalize the data to ensure that it follows a standard distribution (e.g., Gaussian distribution) if required by the chosen machine learning algorithm. Techniques include log transformation, Box-Cox transformation, or Yeo-Johnson transformation.

Note that not all machine learning algorithms require normalized data, and normalization might not always be necessary.

References:

Further in this series:

Machine Learning System Design Stage: Feature Engineering

Machine Learning System Design Stage: Modelling

Machine Learning System Design Stage: Model Evaluation

Machine Learning System Design Stage: Deployment

Machine Learning System Design Stage: Monitoring and Observability

--

--

Paul Deepakraj Retinraj

Software Architect at Salesforce - Machine Learning, Deep Learning and Artificial Intelligence. https://www.linkedin.com/in/pauldeepakraj/