Interactive Visual Statistics(2)

Hands-On: Statistics Worksheets and Cards

In the last hands-on lesson, we saw how the beginnings of how to prepare data through a Prepare recipe. We also saw how tools like the Analyze window can reveal the distribution of a column.

You can also use the Statistics tab of a dataset to perform more in-depth exploratory data analysis (EDA). The Statistics tab allows you to generate statistical reports on your data by creating Worksheets, and Cards within those worksheets.

Interactive Statistics

Let’s create a worksheet with cards that perform common EDA tasks. For example, if we are interested in seeing a side-by-side summary of the orders_prepared dataset for each of the variables pages_visited, tshirt_category, and total, then:

With the orders_prepared dataset open:

After making a selection, Dataiku DSS automatically selects the statistical “Options” (in the third panel of the window) that are appropriate for the numerical variables (pages_visited and total) and the categorical variable (tshirt_category). You can deselect any of these options if you so choose.

Dataiku DSS creates a card with one section for each variable. The type of statistical chart and descriptive statistic in each section depends on whether the variable is categorical or numerical.

For example, tshirt_category, a categorical variable, has a bar chart (or categorical histogram), while pages_visited and total each have a numerical histogram and box plot insert. Also, the quantile table is applicable to the numerical variables, while the frequency table is applicable to the categorical variable.

By default, Dataiku DSS computes worksheet statistics on a sample of the first records in your dataset. You can configure this setting by clicking the drop-down arrow next to Sampling and filtering.

We may also be interested in checking whether the total variable follows an exponential distribution. The interactive statistics feature allows you to estimate the parameters of univariate probability distributions using the Fit Distribution card.

Dataiku DSS creates a card that shows the exponential distribution fit to the data. There is also a Q-Q plot that compares the quantiles of the data against the quantiles of the fitted distribution. Observing points far from the identity line suggests that the data could not have been drawn from the exponential distribution.

To learn about the full capabilities in the Statistics tab, see the Interactive statistics section of the reference documentation.

What’s next?

This was just a brief introduction into the kinds of statistical tests we can easily perform in Dataiku DSS. Now let’s continue building our Flow.


