I am pleased to announce the 0.9.0 release of the data algebra. The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include […]

Estimated reading time: 1 minute

I would like to share another quick tutorial on some aspects of the data algebra, this time using the example of comparing two tables. Please check it out here.

Estimated reading time: 14 seconds

I have a new intermediate introduction on the data algebra up here: Using the data algebra for Statistics and Data Science. The data algebra is a tool for data processing in Python which is implemented on top of any of Pandas, Google BigQuery, PostgreSQL, MySQL, Spark, and SQLite. It allows […]

Estimated reading time: 37 seconds

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database. I like Pandas, and thank the authors and maintainers for their efforts. Now I kind of wonder what Pandas is, or what it wants […]

Estimated reading time: 4 minutes

Back to teaching. For a few years we’ve been running a data science intensive at for a really neat FAAMG company. The idea is to give engineers some hands on live workbook time using methods varying from linear regression, xgboost, to deep neural networks. Learning how participants progress and internalize […]

Estimated reading time: 1 minute

I’ve been tinkering a lot recently with the data_algebra, and just released version 0.7.0 to PyPi. In this note I’ll touch on what the data algebra is, what the new features are, and my plans going forward.

Estimated reading time: 10 minutes

Statistics is the science of relating summaries of observable samples to the unobserved summaries of the populations they are drawn from. I try to explain that with an example in this video. (link)

Estimated reading time: 22 seconds

Nina Zumel and John Mount will be speaking at the online University of San Francisco Seminar Series in Data Science! How and why to use probability models to outperform decision rules Friday April 30, 2021 12:30pm – 2pm Pacific Time See here for full details and to RSVP In this […]

Estimated reading time: 58 seconds