CST 383 - Intro to Data Science

This week I learned more about pandas and how it is used to work with data in a more organized way. Last week we used NumPy arrays, but this week pandas Series and DataFrames made the data feel easier to understand because the rows and columns can have labels. I learned that a Series is like one column of data with an index, while a DataFrame is like a full table with rows and columns.

One topic that made more sense after the labs was indexing. With a pandas Series, I can use dictionary style indexing like mpg['Ana'], or I can use .loc to get values by label. With DataFrames, I practiced getting columns, rows, and specific values. I also learned that pandas lines up data by index, not just by position. That was important in the series lab because one student was missing from the distance data, so pandas returned NaN for that calculation.

I also learned about aggregation, which seems like one of the most important skills so far. Simple aggregation uses functions like .mean(), .median(), .min(), and .max() to summarize data. Grouping with .groupby() is more powerful because it lets me compare groups. For example, I can find the average trip length by age group, or the median age by user type. This helped me see how data science is not just about calculating one number, but comparing patterns between different categories.

Another big topic this week was continuous variables and distributions. I learned that a PDF shows the shape of a distribution, while a CDF shows the cumulative probability up to a certain value. I am starting to understand how to estimate probabilities from graphs, like estimating the probability that a penguin’s body mass is less than 5000 grams. I also learned that for a continuous variable, the probability of one exact value is 0, which felt strange at first but makes more sense now because probability is based on area under the curve.

One thing I still find a little confusing is visually estimating probabilities from density plots. I understand that the area under the curve represents probability, but it is hard to know exactly how much area is on one side of a value just by looking. The CDF feels easier to read because the y-axis directly gives the probability. I think I need more practice connecting PDFs and CDFs together.

One question I still have is when it is better to use value_counts() compared to groupby(). I understand that value_counts() is good for counting categories, and groupby() is good for calculating statistics by group, but sometimes they feel like they overlap. However, I feel like pandas is starting to make sense especially after doing the labs, but I still need more practice writing the expressions without constantly looking back at examples.

Search This Blog

Suhaib's CS Journal

CST 383 - Intro to Data Science | Week 2

Comments

Post a Comment

Popular posts from this blog

CST 334 - Week 3

CST 334 - Week 2

CST 370-30 - Algorithm Design & Analysis Week 1