Core Concepts - Eventual

The Eventual platform is built around three core abstractions that give you the serverless experience of a cloud data warehouse but for all the other modalities of data. Understanding these concepts is key to effectively using the ev SDK and daft (our multimodal query engine).

Jobs

Jobs are procedures composed of daft operations that run on the Eventual platform.They automatically handle scaling, retries, and fault tolerance, so you can focus on your business logic.

What is a Job?

A Job is a Python function decorated with @job.main() that defines what work should be performed. Jobs are:

Automatically Distributed: Your code runs across multiple machines without configuration
Fault Tolerant: Built-in retry logic and error handling
Scalable: Automatically scales based on workload
Monitored: Full logging and metrics collection

Basic Job Example

from ev import Env, Job

# Create environment
env = Env("3.11").pip_install(["daft==0.5.9"])

# Create job
job = Job("data_processor", env)

@job.main()
def main():
    """Process data using daft."""
    import daft
    
    # Read data using daft - automatically distributed
    df = daft.read_parquet("s3://input/data.parquet")
    
    # Process with daft operations - scales across cluster
    df = df.where(df["status"] == "active")
    df = df.with_column("processed_at", daft.lit("2024-01-01"))
    
    # Write results with daft - fault tolerant
    df.write_parquet("s3://output/processed.parquet")
    
    print(f"Processed {len(df)} rows")
    return 0

Job Lifecycle

Submit

Job is submitted to the Eventual platform

Schedule

Platform schedules execution on available compute resources

Execute

Job function runs with automatic scaling and fault tolerance

Monitor

Progress is tracked with full logging and metrics

Complete

Results are returned and resources are cleaned up

Environments

Environments define the runtime context for your jobs, including Python dependencies and configuration.They ensure your jobs have everything needed to run successfully.

What is an Environment?

An Environment specifies:

Python Dependencies: Packages required by your job
Environment Variables: Configuration values and secrets
Files: Additional files needed at runtime
System Configuration: Runtime settings and resource requirements

Creating Environments

from ev import Env, Job

# Create environment with dependencies
env = Env("3.11").pip_install([
    "daft==0.5.9",
    "requests==2.31.0"
])

job = Job("data_fetcher", env)

@job.main()
def main():
    import daft
    import requests
    
    # Download data
    response = requests.get("https://api.example.com/data")
    data = response.json()
    
    # Process with daft
    df = daft.from_pylist(data)
    
    # Show summary
    df.show()
    print(f"Processed {len(df)} records")
    return 0

Environment Best Practices

Pin Dependencies

Always specify exact versions to ensure reproducibility:

# Good
env = Env("3.11").pip_install([
    "daft==0.5.9",
    "numpy==1.21.0"
])

# Avoid
env = Env("3.11").pip_install([
    "daft",
    "numpy"
])

Use Environment Variables

Include Required Files

Resources

Resources are reference-able entities that can be used across jobs and shared within your organization.They provide abstractions over infrastructure components like data volumes, ML models, and compute clusters.

What are Resources?

Resources represent:

Data Volumes: S3 buckets, databases, file systems
ML Models: Trained models, embeddings, checkpoints
Compute Resources: GPU clusters, specialized hardware
External Services: APIs, databases, third-party systems

Using Resources

from ev import Env, Job

# Create environment
env = Env("3.11").pip_install(["daft==0.5.9"])

# Define volume paths as environment variables
env.environ["DATA_PATH"] = "s3://company-data-bucket/customers/"
env.environ["MODEL_PATH"] = "s3://company-models/production/"

job = Job("customer_processor", env)

@job.main()
def main():
    import daft
    import os
    
    data_path = os.environ["DATA_PATH"]
    model_path = os.environ["MODEL_PATH"]
    
    # Read from data volume
    df = daft.read_parquet(data_path + "*.parquet")
    
    # Process data
    df = df.where(df["status"] == "active")
    
    print(f"Processed {len(df)} customer records")
    return 0

Resource Benefits

Reusability

Define once, use across multiple jobs

Versioning

Track versions and metadata

Sharing

Share resources across teams

Governance

Control access and permissions

Putting It All Together

Here’s how Jobs, Environments, and Resources work together:

from ev import Env, Job

# Define environment with all dependencies
env = Env("3.11").pip_install([
    "daft==0.5.9",
    "torch==2.0.0",
    "torchvision==0.15.0",
    "pillow==9.0.0"
])

# Set up paths and configuration
env.environ["IMAGE_DATA_PATH"] = "s3://company-images/products/"
env.environ["MODEL_PATH"] = "s3://models/product-classifier-v3.pkl"
env.environ["MODEL_CACHE_DIR"] = "/tmp/models"
env.environ["MODEL_VERSION"] = "3.0.0"

# Include model files
env.include(["models/", "config/"])

# Create job using environment
job = Job("product_classifier", env)

@job.main()
def main():
    import daft
    import torch
    import torchvision.transforms as transforms
    from PIL import Image
    import os
    
    image_data_path = os.environ["IMAGE_DATA_PATH"]
    model_path = os.environ["MODEL_PATH"]
    model_version = os.environ["MODEL_VERSION"]
    
    # Load images metadata from path
    df = daft.read_parquet(image_data_path + "metadata.parquet")
    
    # Load model
    model = torch.load(model_path)
    model.eval()
    
    # Set up image transforms
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                           std=[0.229, 0.224, 0.225])
    ])
    
    # Process images (simplified example)
    def classify_image_path(image_path):
        # In practice, would load and classify the actual image
        # For demo, return mock classification
        return "electronics"
    
    # Apply classification using daft
    df = df.with_column(
        "category",
        df["image_path"].apply(classify_image_path, return_dtype=daft.DataType.string())
    )
    
    # Save results
    df.write_parquet(image_data_path + "classified/")
    
    print(f"Classified {len(df)} images using model version {model_version}")
    return 0

How They Work Together

Environment Setup

The environment installs PyTorch and sets up the model cache directory

Resource Loading

The job loads images from the data volume and the ML model from the model resource

Distributed Processing

daft automatically distributes the image classification across the cluster

Result Storage

Classified results are saved back to the data volume

Next Steps

Now that you understand the core concepts, dive deeper into each area:

Jobs Deep Dive

Learn advanced job patterns and best practices

Environments Guide

Master environment configuration and dependency management

Resources Guide

Understand resource types and sharing patterns

daft Integration

Learn how to process data with daft

Ready to see these concepts in action? Check out our image processing example to see how Jobs, Environments, and Resources work together in a real-world scenario.

Getting Started

​Jobs

Jobs

​What is a Job?

​Basic Job Example

​Job Lifecycle

​Environments

Environments

​What is an Environment?

​Creating Environments

​Environment Best Practices

​Resources

Resources

​What are Resources?

​Using Resources

​Resource Benefits

Reusability

Versioning

Sharing

Governance

​Putting It All Together

​How They Work Together

​Next Steps

Jobs Deep Dive

Environments Guide

Resources Guide

daft Integration

Jobs

What is a Job?

Basic Job Example

Job Lifecycle

Environments

What is an Environment?

Creating Environments

Environment Best Practices

Resources

What are Resources?

Using Resources

Resource Benefits

Putting It All Together

How They Work Together

Next Steps