Powerful Things Quickly
How should we construct an interface for best practices analysis? What choices should we include as defaults and as options available?
As we iterate on TrovBase, one key consideration is for our analysis scripts that we provide. How should we construct an interface for best practices analysis and what choices should we include as defaults and as options available?
Our thinking on this is heavily informed by David Robinson's classic blog post. Giving users the capability to bring the techniques the experts use to their use case is what we are about as a platform, and that means taking the modern exploratory data analysis workflow to users upfront. This also restricts users in some ways we like. Our canonical example is that we don't support pie charts and instead encourage stacked bar graphs for proportions.
We love graphical exploratory data analysis (EDA) at TrovBase, and with that comes accessibility concerns. Rather than having the complication of a color picker to overwhelm the user, we prefer just giving them viridis
to be color blind safe, with the option to import a theme for more advanced users. Plotting packages (for understandable reasons) don't serve up alt text when exporting images but that's no reason for us not to. Finally, we can bring good typefaces, font sizes, and padding choices by default. We don't want users to have to remember or copy their code for them to have, say, the benefits of pyplot.tight_layout
. Some of these choices are made pretty well in existing libraries like ggplot
and seaborn
, but our awareness of the data being plotted, along with less need for backwards compatibility, means we can do things like having bar plots be sideways by default or ensure that vertical axis labels are readable without having to turn your head.
Once we get beyond the most basic graphical EDA, having the schema be a first class citizen for data projects means we can be sensible about which kinds of variables should have what functionality. Listing correlations for numeric variables is the obvious example here, but we can go a bit deeper for things like showing an informative warning if a user want us to serve a script for a t-test where the independent variable takes more than two values. A major inspiration for us here is the dfSummary
function in the summarytools
R package, which allows users to inspect a dataset with histograms of numeric variables and most common values in a way that makes sense, a vast improvement over the now antiquated tools that use numeric coding for all kinds of variables by default.
One thing that our early users are really excited about, to our initial surprise, is how we can integrate these with a reshape functionality. Having reactive reshape to see a preview of the outcome without having to execute code is plenty useful on its own, but given that the choice for one's data extract is one TrovBase can see, we can provide EDA scripts for the reshaped form of the data! So if, for example, a political scientist has a TrovBase dataset of issue view polling results by country and year, organized according to tidy data principles with each observation being a country, year, issue item, and corresponding number, then they can use TrovBase to filter to a specific year, have each country be its own column, and then generate a script which will generate a scatter plot of how two countries' issue view polling correlates. Making that workflow easy is what our emphasis on schema allows, and we're excited to bring that to researchers everywhere.
Want to learn more about what we are doing at TrovBase? Check out our Notion page.
If you enjoyed this post, you are probably a great candidate for TrovBase. We are still accepting our early cohort of first users. Join the wait list here.