Feature Derivation, Lineage, and Versioning: A Technical Deep Dive

Published on September 25, 2024 by Gautham Vemulapalli
⏳ 6 min read

I’ve spent considerable time thinking about how data representations evolve over time in machine learning systems. This applies not just to model architectures but also to the fundamental building blocks of ML systems: features. Today, I want to explore several approaches to the problem of feature evolution, derivation, and management in production ML systems.

The Three Core Questions

In designing robust feature stores, we need to answer three fundamental questions:

How might we build features upon previously built features? (compositional architecture)
How might we track changes between features? (lineage and provenance)
How might we update previously built features? (versioning and lifecycle management)

Let me walk through our thinking on these challenges, with a particular focus on two competing paradigms: derived features vs. revised features.

Derived Features: A Graph-Based Approach

A derived feature exists within a directed acyclic graph (DAG) of feature dependencies, where each feature can have one or many parent features.

class DerivedFeature:
    def __init__(self, name, transformation_fn, source_features=None):
        self.name = name
        self.transformation_fn = transformation_fn
        self.source_features = source_features or []
        
    def compute(self, context):
        # First compute all dependent features
        source_values = {}
        for feature in self.source_features:
            source_values[feature.name] = feature.compute(context)
            
        # Then apply the transformation function
        return self.transformation_fn(source_values)

This pattern enables powerful compositional architectures where complex features emerge from simpler building blocks. For example:

# Base features
miles_driven_yearly = Feature("miles_driven_yearly", lambda ctx: query_odometer_data(ctx))
days_owned = Feature("days_owned", lambda ctx: query_ownership_duration(ctx))

# Derived feature
avg_daily_miles = DerivedFeature(
    "avg_daily_miles",
    lambda sources: sources["miles_driven_yearly"] / max(sources["days_owned"], 1),
    [miles_driven_yearly, days_owned]
)

In production systems, these relationships would be managed via database references rather than direct Python object relationships, but the concept remains the same.

Real-world Examples of Derived Features

The compositional nature of derived features enables increasingly sophisticated abstractions:

Average Daily Miles Driven:

SELECT $features.miles_driven_yearly / 365.0 AS avg_daily_miles
FROM entity_table

Miles Between Maintenance:

SELECT 
  $features.total_miles_driven / 
  NULLIF($features.maintenance_events_count, 0) AS miles_between_maintenance
FROM entity_table

The power here is that $features.miles_driven_yearly and other referenced features can themselves be derived from other features, creating a rich graph of feature dependencies.

Implementation Considerations for Derived Features

For derived features to work effectively in a production setting, we need several core capabilities:

Tag-based reference system - A mechanism to reference existing features in feature definitions
Query expansion - The ability to recursively expand references to their source queries
Dependency tracking - Metadata to understand and validate the feature dependency graph
Cycle detection - Algorithms to prevent circular dependencies

The implementation might look something like:

def expand_feature_query(query, feature_store):
    """Replace $features.X tags with the appropriate subqueries"""
    feature_references = extract_feature_references(query)
    
    for ref in feature_references:
        feature = feature_store.get_feature(ref)
        expanded_query = feature.source_query
        
        # Recursively expand nested features
        if has_feature_references(expanded_query):
            expanded_query = expand_feature_query(expanded_query, feature_store)
            
        query = query.replace(f"$features.{ref}", f"({expanded_query})")
        
    return query

This recursive expansion, while powerful, introduces significant complexity and potential performance issues.

Revised Features: A Simpler Alternative

In contrast, a revised feature represents a new version of an existing feature, with a flattened dependency structure:

class RevisedFeature:
    def __init__(self, name, source_query, parent_feature=None):
        self.name = name
        self.source_query = source_query
        self.parent_feature = parent_feature  # For lineage tracking only
        
    def compute(self, context):
        # Direct execution of the complete query
        return execute_query(self.source_query, context)

The critical difference is that when executing a revised feature, we don’t need to traverse the dependency graph – the query is self-contained.

The Tradeoff Space

This leads us to the central tension in feature store design: the tradeoff between expressivity and simplicity. Let me analyze this across several dimensions:

1. Development and Maintenance Complexity

Derived features create what is essentially a compiler problem – we need to correctly expand, validate, and optimize a complex graph of feature dependencies. This introduces:

Circular dependency detection requirements
Complex query optimization challenges
Potential for exponential query expansion

The revised feature approach significantly reduces this complexity but requires users to manually manage the composition process.

2. Authorization and Access Control

A particularly thorny issue emerges with derived features: how do we handle authorization? If feature C depends on features A and B, and a user has access to A but not B, should they be able to access C?

def check_derived_feature_authorization(user, feature, feature_store):
    """Recursive authorization check for derived features"""
    # Direct authorization check
    if not user_has_access(user, feature):
        return False
        
    # Check authorization for all dependencies
    for source_feature_name in feature.source_features:
        source_feature = feature_store.get_feature(source_feature_name)
        if not check_derived_feature_authorization(user, source_feature, feature_store):
            return False
            
    return True

This recursive authorization creates significant computational overhead and potential security implications that are avoided in the revised feature model.

3. Performance Considerations

The derived feature approach potentially introduces significant query complexity when expanded:

-- What might start as a simple derived feature query
SELECT $features.avg_daily_miles * 7 AS avg_weekly_miles
FROM entity_table

-- Could expand to
SELECT 
  (
    SELECT miles_driven_yearly / 365.0
    FROM odometer_readings
    WHERE entity_id = entity_table.id
    ORDER BY reading_date DESC
    LIMIT 1
  ) * 7 AS avg_weekly_miles
FROM entity_table

As the dependency graph deepens, these queries become increasingly complex and potentially inefficient without sophisticated query optimization.

4. User Experience

From a user perspective, derived features offer elegant composition but introduce a new syntax and mental model:

-- Derived feature approach
SELECT 
  $features.avg_daily_miles * 7 AS avg_weekly_miles,
  $features.total_fuel_consumption / $features.miles_driven_yearly AS fuel_efficiency
FROM entity_table

The revised feature approach maintains a more familiar, if verbose, pattern:

-- Revised feature approach
SELECT 
  (SELECT miles_driven_yearly / 365.0 FROM odometer_readings WHERE entity_id = e.id) * 7 AS avg_weekly_miles,
  (SELECT fuel_consumption FROM fuel_data WHERE entity_id = e.id) / 
  (SELECT miles_driven_yearly FROM odometer_readings WHERE entity_id = e.id) AS fuel_efficiency
FROM entity_table e

Implementation Pathway Decision

After weighing these considerations, I believe the revised feature approach offers a more pragmatic path forward for most organizations, despite its limitations. The derived feature model is intellectually elegant but introduces significant complexity that may not justify its benefits in many real-world scenarios.

That said, a hybrid approach might be optimal. We could implement the revised feature model first, while designing the metadata schema to support future implementation of derived features if needed:

# Initial implementation - Revised features
class Feature:
    def __init__(self, name, query, parent_feature=None):
        self.name = name
        self.query = query
        self.parent_feature = parent_feature  # For lineage only
        
# Future extension - Derived features
class DerivedFeature(Feature):
    def __init__(self, name, query, source_features=None):
        super().__init__(name, query)
        self.source_features = source_features or []

Final Thoughts

The tension between expressivity and simplicity in feature store design mirrors many challenges we face in designing machine learning systems. While powerful abstractions enable sophisticated capabilities, they often come with hidden costs in terms of complexity, performance, and maintainability.

As with many engineering decisions, the right approach depends on your specific context. For teams with sophisticated infrastructure and strong engineering resources, the derived feature model may be worth its complexity. For others, the simplicity and reliability of the revised feature approach may be more appropriate.

The key insight is recognizing that feature engineering, like model development, benefits from clear provenance, versioning, and compositional architecture – regardless of which specific approach you choose to implement these principles.