November 5, 2024 · Mia Tanaka

Building BearViz: What I Learned Wrangling 2M Grade Records

A postmortem on building a grade distribution visualizer for Berkeley students — the interesting data pipeline problems, the D3.js lessons, and the one query that took 45 seconds before I fixed it.

Data EngineeringPostgreSQLD3.jsProjects

Last semester I scraped, cleaned, and visualized 8 years of UC Berkeley grade distributions. Over 2 million records. It's live at bearviz.netlify.app and about 400 Berkeley students have used it during enrollment season.

Here's what I learned.

The Data Was Messier Than I Expected

Berkeley publishes grade data publicly through their EdX system, but the format changed in 2019, department codes are inconsistently capitalized ("EECS" vs "eecs"), some instructor names appear six different ways, and about 3% of records have obviously wrong grade counts (more A's than enrolled students).

I wrote a normalization pipeline in Python that:

Canonicalized department codes via a hand-curated lookup table (there's no good automated way — "CS" and "COMPSCI" and "L&S CS" are all the same)
Used fuzzy matching on instructor names (rapidfuzz with a threshold of 92)
Flagged statistically impossible records for manual review

The lesson: data cleaning is 60% domain knowledge and 40% code. I couldn't have written those normalization rules without actually knowing how Berkeley departments work.

The Query That Took 45 Seconds

The main query for the course detail view was:

SELECT
  semester,
  AVG(gpa_average) as avg_gpa,
  SUM(enrollment) as total_enrolled,
  json_agg(grade_distribution) as distributions
FROM courses
WHERE dept = $1 AND course_num = $2
GROUP BY semester
ORDER BY semester DESC;

This was running 45 seconds on the full dataset. EXPLAIN ANALYZE showed a sequential scan on 2.3M rows.

The fix was a composite index on (dept, course_num). After adding it, the query ran in 12ms. The json_agg over a small filtered result set is fast; it was the scan that was killing it.

I've never looked at a query the same way since. Always check the query plan.

D3.js Is Hard for the Right Reasons

Most JavaScript libraries abstract away the rendering. D3 doesn't — it gives you a set of primitive transformations and makes you think in terms of data joins: for each datum, what element should exist?

The bar chart was straightforward. The animated transition between semesters was not. D3's enter/update/exit pattern made me think carefully about identity: when a bar "moves" from position 3 to position 1 as data updates, is it the same bar or a new bar? The answer determines whether you animate a transition or destroy-and-create.

I ended up keying on (dept, instructor, semester) and got smooth transitions that actually communicate data change rather than just flickering.

What I'd Do Differently

Use a materialized view for the aggregations. The raw data doesn't change — pre-computing department-level summaries would have cut query time further and simplified the backend.
Schema first, scraper second. I wrote the scraper before finalizing the schema, then had to rewrite parts when I realized I needed instructor-level granularity, not just course-level.
Test with production data volume earlier. I developed on a 10K row sample and discovered the 45-second query only after deploying to the full dataset. Would have been embarrassing if users had hit it first.

The project is open source if you want to dig into the code or the pipeline.