CST 383 - Intro to Data Science | Week 5
Learning log 4:
This week we learned more about how important preprocessing is before building a machine learning model. One thing that stood out to me is that missing data is not always obvious. Sometimes it shows up as actual NA values, but other times it can be hidden as values like 0 or “information requested,” depending on the dataset. That made me realize that cleaning data is not just a technical step, but also requires thinking carefully about what the values actually mean.
I also learned why scaling matters, especially for models like KNN. Since KNN uses distance to compare points, features with larger numbers can have too much influence if the data is not scaled. This helped me understand why preprocessing and modeling are connected instead of being separate tasks.
The test/train split and cross validation topics were useful too. I understand that the test set should be saved for checking how well the model works on new data, while cross validation helps compare models or hyperparameters more reliably. One concept I am still trying to fully get comfortable with is how to choose the best value of k in KNN and how much I should trust cross validation results compared to the final test accuracy.
This week helped me see the full workflow better:
- clean the data
- scale it when needed
- train a model
- then evaluate it carefully
I feel like I am kind of starting to understand why each step matters instead of just memorizing code.
Comments
Post a Comment