Home   Science Page   Data Stream Momentum   Directionals   Root Beings   The Experiment

11.3 Data Sets & Crazy Correlations

A. Types

Now that we have seen how flawed the Data is let us look into the Data Sets that make up our study.

 

 

 

Data Sets

 

 

Characteristics

DS#1

DS#2

DS#3

DS#4

DS#5

DS#6

Began

1976

1977

1981

1981

1989

1989

End

2001

1999

1999

2001

2001

1999

Years

25

22

18

20

13

11

Data Points

306

264

222

248

156

130

Name

Longest

Sleep

Complete

Viewing

Modern

Modern
w/Sleep

 

The Chart above enumerates the salient features of each Data Set. These Data sets were not chosen arbitrarily. We will discuss the advantages and disadvantages of each in turn.

Data Set #1, the Longest Set

The Data Set #1, the Longest Data Set, includes all the longest Data Streams. However it does not include Sleep, Talking, Viewing and Reading, because the accumulation of their Data Streams began much later. It has the advantage of including the longest history of interaction between the various activities that make it up. While Kid Time, Management/Consulting Time, and Jewelry Time started much later, they all existed as an exact zero at this point, so they are included in the set of Actions that make up the Entire Set. The Accumulation of each of these Data Streams began when their Action began. This was not true of the 4 Actions, which are not included in this Set. Sleep, Talking, Viewing and Reading were all going on long before they began to be recorded.

Data Set #2: the Sleep Set

Data Set #2, the Sleep Set, starts with all the Actions that were around when the recording of Sleep Time began and ended. The only Actions not included were Talking and Viewing whose accumulation started later. This Set stops in November 1999 when the accumulation of Sleep Time ceased for reasons mentioned above. Again one of the advantages of this Data Set is its length, so that one is able to get a long-term perspective. The disadvantage is that not all Actions are included. The beauty of this set is that it starts significantly before the actual non-zero beginning of Kid Time and Management Time. So it is a good set to study the effect of those Actions upon our Person.

Data Set #4: the Viewing Set

Data Set #4, the Viewing Set, starts with the beginning of the Viewing Set and continues to end of Data Accumulation, at least for now. Viewing was the last Action to be included in the Data Collection. Therefore this Set includes all of the Actions except Sleep, because its accumulation was stopped at the end of 1999. The advantage of this set is its longevity with all the actions. One can at last study the relation of Viewing and Talking to the rest, which was impossible before.

Data Set #3: the Complete set

Data Set #3 has the same beginning as DS#4, i.e. the beginning of Viewing, but ends with the end of Sleeping. Therefore this Data Set is the longest Data Set that includes all the Actions.

Data Set #5: the Modern Set

Data Set #5, the Modern Set, begins in 1989 and continues to the Present, or at least until the last Data has been accumulated, i.e. December 2001. The advantage of this Data Set is that Data was accumulated in a more consistent fashion than before this point. Prior to 1989, there were different categories that had merged or split. After this point all the categories are pure. Before this point more guesstimations were made and categories weren’t as clear. The only disadvantage to this set is that Sleep isn’t included because of its premature end and it isn’t as long. It certainly can tell us nothing about our Subject’s Management Time or the beginning of Kid Time.

Data Set #6: the Modern Sleep Set

Data Set #6, the Modern Sleep Set, begins in 1989 and stops in November of 1999 so as to include Sleep in its calculations. The advantage of this set is the consistency of the Data Collection along with the fact that all Actions are included. Because we have been discussing Sleep and Creative Time in this paper, this is the Data Set we have chosen to examine. As mentioned other Data Sets would be more appropriate to other types of examinations.

Data Set #7: the 4 Year Set

While the above sets are the primary sets that we will consider, there are two other sets worth mentioning. The First is the 4 Year Set. This only includes the 4 years of monthly Data from 1989 -> 1992. This Data Set is the one that was analyzed thoroughly in experiments of 1993-4. The Experimenter started with 1989 because of the same reasons as above. He felt that this is when the consistent data keeping really began. (Not.) As we shall see, this 4 years period bears some consistency with the other Data Sets and also some major differences. We shall explore these supposed anomalies.

Data Set #8: the Old Set

This set is the exact opposite of the Modern Set. It contains all the Data Sets before 1989 through to the beginning. It includes the Data Sets that go from July 1976 -> December 1988. Again this set is interesting for comparison, especially with the Modern Set.

B. Data Set consistency

Because of the reasons mentioned above we have 8 different Data Sets that we will be considering. Some interesting results emerged, which we would like to discuss. While most of our ‘results’ have to do with correlations and connections or lack thereof, the ‘results’ of this discussion have to do with the ambiguity of correlation, and the illusion of causality.

Art & Sex an example of a spurious connection

Let us illustrate thru example.

Scientist: “Art has a positive correlation with Sex, +21% over the period from 1989 - Nov. 1999. This correlation is based upon 130 readings. The level of significance is very high, above the .02 level. This is the highest positive correlation with Art.”

One might think that Sex augments Art. Or that Art augments Sex. Which is the cause and which the effect? Does more Sex inspire Art or does the creation of Art stimulate Sex? And even if it could be determined that having Sex inspired Art, who is it that makes the choice of what to do?

The Experimenter forms theories: “It is almost as if they form a Right Brain Alliance. Perhaps Sex neutralizes the dominance of the Left Brain and allows Art to emerge. Certainly our Subject did no Art before he met his wife.”

Scientist: “You’re making far too much of this piece of Data. This strong positive correlation only exists in DS#6. In fact, not too much should be made of this positive correlation because it is not consistent throughout the Data Sets. For instance, in Data Set #2, the Complete Sleep Set, there is a -10% Correlation between Art and Sex. (See the chart below for the correlations between the two streams with the other Data Sets.)”

 

Data Streams

DS#1

DS#2

DS#3

DS#4

DS#5

DS#6

Ave

StDev

Art vs. Sex

-5

-11

2

-1

10

21

3

11

 

In looking at the Correlations above, one sees that the relation between Art and Sex has no real consistency. The Average reading is 3% with a Standard Deviation of 11. This means that the average relation between Art and Sex tends to be from a little negative to a little positive. In other words there is not a continuous relation between the two that extends through out the entire Data Stream. While statistically significant for a 10-year chunk of the Data, it does not apply over other chunks of the Data Streams.

The Creative 3 consistently negative towards each other.

For some of the interactions, the correlations are much more continuous. For instance the interactions between the Creative 3 have been consistently negative no matter which Data Chunk is used. (See the Chart Below.)

 

Data Streams

DS#1

DS#2

DS#3

DS#4

DS#5

DS#6

Ave

StDev

Art vs. Writing

-20

-16

-19

-22

-36

-34

-25

8

Write vs. Science

-17

-14

-17

-22

-47

-42

-27

14

Art vs. Science

-16

-17

-20

-18

-23

-30

-21

5

 

The average interaction for all three is in the -20% range. Further the range of each is consistently negative. The lowest negative is -14, while the highest is -47%. The Standard Deviation of each of the series is fairly low. Taking a sample of an even smaller chunk of the Data we see the same trends continuing. Looking at the Data Set #7, for the 4 years between 1989 and 1992 we see that Art & Writing have -41% correlation, Writing & Science -21% and Art & Science -27%. (All of these correlations are within 2 Standard Deviations of the averages in the above chart.)

Correlations between Data Streams aren’t necessarily continuous

The point is that the correlations between Data Streams aren’t necessarily internally consistent. Just because there is a high correlation over the entire 25 years of the Data Base does not mean that one will find a high correlation over every 4-year chunk of the Data. For instance Sex and Kids have an incredibly high negative correlation, -83%, over the 25 years of the Longest Data Set #1, while having a very high positive correlation, +35%, over the 4-year chunk of Data from 1989-1992. This is statistically significant at a .02 level. Again this is very high.

Before continuing our assault on causality, let us understand this seeming anomaly.

C. Sex vs. Kids, a negative or positive correlation?

Why is the correlation between the Sex and Kid Data Streams so negative on the long-term study of 25 years while very positive on the short-term study of 4 years? Standard Deviations, related to power of relation

Standard Deviation

One factor to be noted is the relative size of the Standard Deviation of the different sets. Below is a chart that shows the Standard Deviations of the different chunks of time. Notice that the Standard Deviations of the 25 year Data chunk are nearly 3 times greater than those of the 4 year chunk. This means that the effect of the positive correlation on the four-year chunk was minimized by the negative correlation on the 25-year chunk.

 

 

4 Year Data Set

Longest Data Set #1

 

(1989-1992)

(1976-2001)

 

Average

St Dev

Average

St Dev

Sex

0.27

0.06

0.41

0.19

Kids

2.91

0.42

1.82

1.06

A Scatter Plot of Kids vs. Sex over 25 years and Kids Vs Sex over 4 years.

Let us look at two scatter plots to see what is going on between these Data Streams over the years that would yield such an anomaly. The first chart below shows the scatter plot of the relation between Kids and Sex between the 4 years of 1989 -> 1992. The slope is fairly high, 2.31, and distinctly positive. The correlation between the two Data Streams is a high +35% over this period, extremely significant.

However if we look at a scatter plot of the relation between Kids and Sex over the whole 25 years period, the correlation and direction of the line is completely changed.

Note that the 48 Data Points from the 4-year graph are all included among the over 300 Data Points of the 25-year graph. In the 25-year study the results and direction of the correlation are quite clearly negative. The 4 years was an aberration, even though statistically significant in its own right. Viewed in the larger context the positive correlation would be considered false and misleading. Any predictions made from that 4-year study would have been totally wrong.

Small Variations

One clue, as to the fact that it was a bogus correlation was the relatively small variations in the 4-year study versus the 25-year study. Kid time varies from 1.5 to 4 while Sex only varies from about 0.2 to 0.4. This contrasts with the 25-year study when Kid time varies from 0 to 4 and Sex varies from 0.2 to 0.8. Both variations are double what they were for the 4-year block of time. The Variation and the Standard Deviations in Kid Time and Sex Time are quite narrow in the 4-year study compared to 25 years study.

The point of all these numbers is that certain chunks of time have nothing going on and so slight changes might translate into high correlations when in fact the actual effect is trivial, as we saw in the 4 year time segment examined above.

An area graph of Kids & Sex 1976 -> 2001

Not to belabor the point, but below is a graph which exhibits the points we’ve been making. Looking at the long-term perspective, it is easy to see that Kids immediately squashes Sex down. Further it is quite easy to visually see that as Kid Time shrinks that Sex rises. However it is very difficult, if not impossible, to see the positive correlation between Kids and Sex from 1989 to 1992. Even if it can be seen, it is more easily seen that this positive correlation has no significant reality to it.

4 Year Studies, potentially misleading, without greater context

The main reason we’ve spent so much time on this spurious result is not because of the importance of the relationship but because it’s a great example of misleading correlations. Four years is a fairly good amount of time for a study. After all how many people have 25 years to wait around for a significant result. And yet the four-year period that was chosen was an accidental aberration that had no bearing on long-term trends. If the Experimenter only had these 4 years as a time period, he would have gone to publication with some statistically significant results, which were bogus, as seen from a long-term perspective.

Correlations must be balanced with other measures

While this is yet another way in which correlations are misleading, it is not to throw out correlations as a tool as much as to warn of unexpected mechanisms. This mechanism can be avoided by looking at other measures such as variation, standard deviation, and graphing to reveal the actual effect of the interaction. To reiterate relying on correlation alone is like flying with one instrument on your airplane with no windows.

 

Home   Science Page   11. The Experiment   Next Section