MicroStrategy ONE
Additional Examples of Functions in Expressions
Hypothesis Testing Example
The reports and report objects in this example can be found in the following folder: MicroStrategy Tutorial\Public Objects\Reports\ MicroStrategy Platform Capabilities\Advanced Analytics\ Statistics and Forecasting\Hypothesis Testing
Which Call Centers had a Statistically Significant Increase in the Average Daily Sales in Recent Years?
To answer this question, you must first find the average of daily sales for recent years for each call center. You accomplish this using the Average Daily Sales metric, defined as:
Average Daily Sales = Avg(Sum(Revenue) {~+, Day+} ) {~+}
This is a simple metric that contains the nested group-value functions Sum and Avg.
Next, create a report that uses this metric, placing the Call Center attribute on the row axis and the Year attribute on the column axis. A report with this definition is shown below.
In this report it looks like all call centers had a significant increase in the average daily sales in each successive year. However, an average value by itself does not represent the complete picture.
To get more information, you can check the standard deviation of daily sales between each successive year for each call center. You can accomplish this by adding Standard Deviation in Daily Sales to the report. The metric is defined as:
StDev Daily Sales = Stdev(Sum(Revenue) {~+, Day+} ) {~+}
In statistics, standard deviation is a value which shows how widely a set of values differs from the mean.
The resulting report provides the average and the standard deviation of daily sales for all years, for each call center.
This report shows that all call centers have greater average daily sales in each successive year. Additionally, the standard deviation of daily sales is higher for each successive year, with a number of exceptions. A couple of these exceptions include Atlanta between 2009 and 2010, as well as San Francisco between 2008 and 2009. The higher standard deviation values mean that there is more volatility in these results. Therefore, the higher average daily sales in each successive year are not necessarily a progressive increase over previous years; they may have been caused by random fluctuations.
Based on this observation, you next need to find out which call centers have a statistically significant difference in their average daily sales between these successive years. You can get the result by testing the hypothesis that the average daily sales are the same, against the hypothesis that the average daily sales are significantly different. This must be tested for each call center.
The hypothesis testing is accomplished by computing the p-value. In statistics, p-value is the probability of making a decision to reject a fact, given that the fact is correct. In the context of this example, it is the probability of making a mistake by concluding that the average daily sales in 2008 is significantly different from 2009, given that actually they are about the same. In general, you want to restrict this type of error so that it is smaller than a certain tolerance level. This tolerance level is usually set to between 2.5% to 10.0%.
If you assume that the standard deviations of daily sales for all years are the same, then you can use a function called HomoscedasticTTest to compute the p-value. If you assume that the daily standard deviations are different, then you must use the function HeteroscedasticTTest. The following examples consider both of these assumptions and compute the p-value using HomoscedasticTTest and HeteroscedasticTTest.
This report uses simple metrics based on other metrics, because to calculate the p-value, you need to compute the daily sales for each call center for each day. You also need to group the fact Revenue at the levels of Day and Call Center for all years. The fact table DAY_CTR_SLS
is available at the levels of Employee, Order, and Day.
Call Center is related to the fact table via Employee.
You need a table with a structure similar to the following:
Call Center | Day | Daily Sales (last year) | Daily Sales (current year) |
Northwest |
1 |
123456 |
123456 |
... |
123456 |
123456 |
|
N |
123456 |
123456 |
|
Southeast |
1 |
123456 |
123456 |
... |
123456 |
123456 |
|
N |
123456 |
123456 |
You need to create a temporary table with this structure using metrics. These metrics are then used in simple metrics to calculate the p-value.
Follow the steps below to build the necessary metrics for the Hypothesis Testing report.
- Build a metric to create the column for [Daily Sales (current year)] in the temporary table. This metric is defined as:
Daily Sales (current year)= Sum(Revenue) {~+, Day+} <[Current year in sample DB]; @2; ->
The default level notation {~+} is placed on the metric since the default is to group by an attribute on the report. In this example, that report attribute is Call Center. The metric condition
Current year in sample DB
is also applied to calculate the data for the current year available in the MicroStrategy Tutorial data. -
Build a metric to create the column for [Daily Sales (last year)] in the temporary table. This metric is defined as:
Daily Sales (last year) = Sum(Revenue) {~+, Day+} <[Last year in sample DB]; @2; -> | [Last Year's] |
This metric uses a metric condition and a transformation.
- The metric condition
Last year in sample DB
is applied to calculate the data for the previous year available in the MicroStrategy Tutorial data. - The transformation
Last Year's
is necessary because attribute Day is a child of attribute Year. To show Daily Sales in Day 1, 2, …, N of last year together with Daily Sales in Day 1, 2, …, N of the current year, you must use the transformation to supply this data.
- The metric condition
-
Build the simple metrics to compute the p-value for each call center by using the metrics above. The simple metrics are defined as follows:
HeteroscedasticTTest([Daily Sales (last year)], [Daily Sales (current year)]) {~+}
HeteroscedasticTTest([Daily Sales (last year)], [Daily Sales (current year)]) {~+}
Create a report with the two metrics shown above (for p-value) and the attribute Call Center. The Hypothesis Testing report is shown below.
The resulting report, Hypothesis Testing, is shown above. All call centers, except for Atlanta, Salt Lake City, Seattle, and Fargo, have a p-value of less that 5%. This indicates that the probability of making an error in concluding that the sales have significantly increased is low for all but these four call centers.
Additionally, even Salt Lake City and Fargo are at 20% or lower. This is strong evidence that average daily sales for the majority of the call centers can be attributed to a steady increase, rather than random fluctuations.
The SQL generated for the report Hypothesis Testing is displayed below, along with a summary of the actions taken at the end of the SQL statement:
Pass0 - Execution Duration:0:00:00.06 create table ZZTH0FDXLS4MD000 ( DAY_DATE TIMESTAMP, CALL_CTR_ID SHORT, WJXBFS1 DOUBLE) Pass1 - Execution Duration:0:00:00.34 insert into ZZTH0FDXLS4MD000 select a12.[DAY_DATE] AS DAY_DATE, a11.[CALL_CTR_ID] AS CALL_CTR_ID, a11.[TOT_DOLLAR_SALES] AS WJXBFS1 from [DAY_CTR_SLS] a11, [LU_DAY] a12 where a11.[DAY_DATE] = a12.[LY_DAY_DATE] and a12.[YEAR_ID] in (2010) Pass2 - Execution Duration: 0:00:00.00 create table ZZTU1RY0YSKMD001 ( DAY_DATE TIMESTAMP, CALL_CTR_ID SHORT, WJXBFS1 DOUBLE) Pass3 - Execution Duration: 0:00:00.03 insert into ZZTU1RY0YSKMD001 select a11.[DAY_DATE] AS DAY_DATE, a11.[CALL_CTR_ID] AS CALL_CTR_ID, a11.[TOT_DOLLAR_SALES] AS WJXBFS1 from [DAY_CTR_SLS] a11, [LU_DAY] a12 where a11.[DAY_DATE] = a12.[DAY_DATE] and a12.[YEAR_ID] in (2010) Pass4 - Execution Duration: 0:00:00.07 select pa11.[CALL_CTR_ID] AS CALL_CTR_ID, a13.[CENTER_NAME] AS CENTER_NAME, pa11.[DAY_DATE] AS DAY_DATE, pa11.[WJXBFS1] AS WJXBFS1, pa12.[WJXBFS1] AS WJXBFS2 from [ZZTH0FDXLS4MD000] pa11, [ZZTU1RY0YSKMD001] pa12, [LU_CALL_CTR] a13 where pa11.[CALL_CTR_ID] = pa12.[CALL_CTR_ID] and pa11.[DAY_DATE] = pa12.[DAY_DATE] and pa11.[CALL_CTR_ID] = a13.[CALL_CTR_ID] Pass5 - Execution Duration: 0:00:00.00 [Analytical SQL calculated by the Analytical Engine: select CALL_CTR_ID, CENTER_NAME, HomoscedasticTTest(WJXBFS1, WJXBFS2), HeteroscedasticTTest(WJXBFS1, WJXBFS2) from [previous pass] ] Pass6 - Execution Duration: 0:00:00.00 [Populate Report Data] Pass7 - Execution Duration: 0:00:00.06 drop table ZZTH0FDXLS4MD000 Pass8 - Execution Duration: 0:00:00.01 drop table ZZTU1RY0YSKMD001
- Pass0 and Pass1 are issued by Intelligence Server to compute the metric Daily Sales (last year). The Intelligence Server prepares the temporary table with Call Center and Day as its key. Then, it retrieves last year's sales using the transformation Last Year's.
- Pass2 and Pass3 are issued to compute the metric Daily Sales (current year). Attributes Call Center and Day are used as keys to the temporary table. This pass is similar to Pass0 and Pass1 with the key difference being that the metric does not have a transformation.
- In Pass4 and Pass5, the Intelligence Server computes the p-value for each call center, using the HeteroscedasticTTest and the HomoscedasticTTest functions.
- The remaining passes perform final report preparation and drop the temporary tables.