Features and Correctness¶
We want to be able to describe our populations of sequences and compare them. To do this we extract various distributions, but we call them features. These features are designed such that they can be used to describe and measure correctness as a distance.
We define features as belonging to one of the following "domains":
- Structural features are designed to check for "structural zeros", ie outputs that should not be possible.
- eg, if each sequence starts with a "home" activity
- eg, if each sequence has a total duration of 24 hours
- Participation features check for the occurance of activity types in a sequence, they can be presented as rates or probabilities.
- eg, if each sequence participates in the "work" activity or not
- Transition features describe the ordering of activities within sequences.
- eg, how many times a sequence transitions from "home" to "work"
- times/scheduling features describe when activities take place.
- eg, the start time of all "shop" activities
We also use frequency features to describe the aggregate probability of an activity taking place in a given time bin. For example, "X% of agents are at work between 10am and 11am".
We also use uniqueness as a measure of diversity within a population of sequences.
import random
import pandas as pd
from caveat.evaluate import ops
from caveat.evaluate.describe.times import (
joint_time_distributions_plot,
times_distributions_plot,
)
from caveat.evaluate.describe.transitions import sequence_prob_plot
from caveat.evaluate.distance import emd, mape
from caveat.evaluate.features import participation, times
# create some data
raw = pd.read_csv("data/synthetic_schedules.csv")
def down_sample(df, p):
n_samples = int(len(df.pid.unique()) * p)
sample_ids = random.sample(list(df.pid.unique()), n_samples)
sampled = df[df.pid.isin(sample_ids)]
return sampled
observed = down_sample(raw, 0.2)
a = down_sample(observed, 0.2)
b = down_sample(raw, 0.2)
synthetic = {"a": a, "b": b}
For example we can extract the start times for each activity and report the averages:
starts = times.start_times_by_act(observed)
ops.average(starts)
education 0.655208 home 0.345277 leisure 0.368498 shop 0.366820 work 0.365031 dtype: float64
Note that the starts feature is a dictionary of tuples. Where the first value describes the 'support' of the feature and the second the frequncy of each observation.
We can take a better look at the distributions by plotting them:
fig = times_distributions_plot(observed, None)
Feature Structure¶
Features are commonly segmented into a dictionary of keys and values, where the key describes the segment. Features are commonly segmented by activity or transition type:
participation_probability = {
"home": [1,1,1],
"work": [1,0,0],
"shop": [0,1,0]
}
In the above example, each feature segment records if each of the 3 segements contained a "home", "work" or "shop" activity.
In practice we compress this representation into frequency counts, represented by a tuple of the (i) possible values and (ii) their frequncies, in this simple case we get:
participation_probability = {
"home": ([0, 1], [0, 3]),
"work": ([0, 1], [2, 1]),
"shop": ([0, 1], [2, 1]),
}
Feature Descriptions¶
Features can be described or plotted using functions in the describe module:
participation_rates = participation.participation_rates_by_act(observed)
print(ops.average(participation_rates))
education 0.050 home 2.080 leisure 0.745 shop 0.885 work 0.465 dtype: float64
fig = sequence_prob_plot(observed, synthetic, figsize=(12, 4))
Feature Segmentation¶
We can use more interesting types of segmentation to extact more descriptive features. For example we can enumerate activity type by it's location in the sequence:
participation_rates = participation.participation_rates_by_seq_act(observed)
print(ops.average(participation_rates).head(10))
0home 1.000 10leisure 0.005 11work 0.005 12home 0.005 1leisure 0.155 1shop 0.505 1work 0.340 2education 0.035 2home 0.210 2leisure 0.465 dtype: float64
Or by the enumeration of that type of activity in each sequence:
participation_rates = participation.participation_rates_by_act_enum(observed)
print(ops.average(participation_rates).head(10))
education0 0.050 home0 1.000 home1 1.000 home2 0.065 home3 0.010 home4 0.005 leisure0 0.710 leisure1 0.025 leisure2 0.005 leisure3 0.005 dtype: float64
In these examples we use additional segmentation to get more information about the sequence. For example we can differentiate between the participation in (i) education as the third activity (2education) versus the fourth activity (3education), or (ii) the first education activity (education0) versus the second education activity (education1).
In all cases we use weighted averaging to combine segmented features into single metrics. Where weighting is ussually the number of each feature in the observed population of sequences.
Dimensions¶
Start times are a one dimensional feature, but we can also consider multi-demnsional features:
start_durations = times.start_and_duration_by_act_bins(observed, bin_size=10)
# average 2d averages each dimension and then sums so that we can return an float
ops.average2d(start_durations)
education 0.721528 home 0.633998 leisure 0.625886 shop 0.435891 work 0.706467 dtype: float64
fig = joint_time_distributions_plot(observed, None, figsize=(12, 4))
Distances¶
When comparing features we can generally see complex distributions:
fig = times_distributions_plot(observed, synthetic)
To make a quantitave comparison between populations of sequences we primarilly use Wassersetin "earth movers" distance. This measure the amount of "work" required to make one distribuion match another.
x = times.start_times_by_act(observed)
ya = times.start_times_by_act(synthetic["a"])
yb = times.start_times_by_act(synthetic["b"])
print("synthetic population A: ", emd(x["home"], ya["home"]))
print("synthetic population B: ", emd(x["home"], yb["home"]))
synthetic population A: 0.006147486580939153 synthetic population B: 0.004812311348643272
In this case we might proclaim population B to be better. In practice we will use a lot more features and there will generally be trade-offs between them.
For probability features (in particular participation) we also sometime use absolute percentage error. This is particularly useful for highlighting the participation in uncommon activities which are often problematic for generative models.
x = participation.participation_prob_by_act(observed)
ya = participation.participation_prob_by_act(synthetic["a"])
yb = participation.participation_prob_by_act(synthetic["b"])
print("synthetic population A: ", mape(x["leisure"], ya["shop"]))
print("synthetic population B: ", mape(x["leisure"], yb["shop"]))
synthetic population A: 0.020689655172413814 synthetic population B: 0.10126582278481021
By this metric, population A appears better.