One of the best ways to improve performance is to minimize the size of your data. During ingestion, Druid gives you the ability to perform aggregations on your data that can significantly reduce its size and improve performance. Your goal is to maintain your ability to answer all of the questions that you want to answer, while reducing the size of your data through aggregation and probabilistic approximation. You may need to iterate to find the tradeoffs that will give you the best speed with the least loss of information.
Configure Schema ingestion stage gives you the power to make these tradeoffs. During this stage you can:
- Specify what dimensions and metrics are included in or excluded from your dataset
- Aggregate measures to a coarser level of time detail
- Create new measures based on probabilistic approximations. For example you can create a measure that represents an approximation of a count or a unique count of a dimension
The remainder of this section will walk you through an example to show you the power of this functionality.
We'll load the same dataset that you loaded in the Using Druid section. To get started, please walk through the steps in that section, up through step 8, which brings you to the
Configure Schema stage with the Sample data.
Review the names of the columns. Note that the time column is listed first, followed by a number of dimensions (turquoise background), followed by measures (gold background).
countmeasure is special and is added by Druid. It represents the number of rows that have been aggregated. Currently there is no aggregation, so you'll see
1in every row.
The measures that follow
countrepresent aggregations of the measures in the original data. For example, the wikipedia data contains five columns:
deltaBucket. Notice that each of these is prefixed by the string
sum_. This indicates that as aggregation occurs, the value of these measures will be aggregated by summing. If this doesn't make sense yet, hold on!
Notice the date column on the far left. This wikipedia data is one days worth of data, at a millisecond granularity. By default the Query granularity is set in the right panel to
Hour. Set that to
Noneand view the time column, which should now show milliseconds (you can drag the column border to make the column wider). A common method of aggregating is to aggregate to a coarser level of time, such as the hour. However, aggregation can only occur if all other dimension values are the same. For example, if we have one row that has a time of 2016-06-27T00:00:01 and another that has a time of 2016-06-27T00:00:02 (second one and second two of 2016-06027), we can aggregate those two rows if all of their other values are the same. In this data set, the
diffUrlfields are different for every single row, so in order to achieve aggregation by time, we must remove those columns. Let's do that next.
With Query granularity set to NONE, click on the
diffUrlcolumn and then click the red trash can in the lower right. Then do the same thing for the
pagecolumn. We still aren't aggregating because our time dimension is using milliseconds. You can see this by scrolling to the count measure and noticing that it contains
1in every column. Next we will aggregate by setting our query granularity to hour.
Hourfor Query granularity and scroll over and view the count column. Note that it now contains a
5in row 5, just to the right of user
Kolega2357. Here you are seeing the effect of our aggregation. Scroll back to the time column and you will see that all rows show hour 0 of the date 2016-06-27. The '5' in the count for
Kolega2357indicates that in that hour, there were five rows that "rolled up" to create the row with the count of 5. The measures to the right show similar aggregation. For example,
sum_addednow reflects the sum of the
addedfield for those 5 rows.
Play around with different query granularity, removing other dimensions and note how the aggregated measures change. Understanding how to aggregate your data so that you have the information that is valuable for your analytics, but the data is as compact as possible, is key to fast query performance.
Sometimes you need to know the count or a unique count of a dimension, but you don't need the detailed values for the dimension. Let's demonstrate this with an example. If you have modified your data, go back to
Startand 'Next' your way through until you get to
Configure Schema. Next, remove all dimensions except for
user. Looking at the display,and in particular the count column and the
countryNamecolumn, we can see that there are 2 unique users that made wiki comments from Argentina (users 22.214.171.124 and 126.96.36.199) and 13 unique users that made comments where the countryName was
null. One user,
Kolega23567made 5 comments where all of the attributes (comment string, etc) were the same, so that was rolled up to a single row.
Let's add a new metric that represents the unique number of users. In the right panel. select
Add metric. Set the
unique_users, set the
users. Your display should look like this:
Apply and you should now see your newunique_users` field at the far right of the measures
Next: Partitionand then
Next: tune, and then
Next: publish. Choose the name of your new dataset. If you don't want to overwrite a previous copy, choose a different name.
Submitand then when the job is complete, click
Queryto go to the Druid SQL Console.
Updated about 2 months ago