Posted 1 month ago

Senior Data Engineer

TallinnOn-siteFull-time

AI Summary

Senior data engineer who builds data systems and productionizes pipelines. Owns end-to-end data pipelines, onboarding, governance, and platform tooling to support product engineers and data science.

About this role

About Us

Dragonfly is building the world's first Automated Solutions Architect. We help businesses navigate the complex landscape of modern tools (SaaS, AI, Infrastructure) by using an AI-powered platform that understands their unique context and recommends the optimal tech stack.

Our platform is powered by a proprietary knowledge graph of 230K+ products, 4M+ companies, and the relationships between them — built from 100+ external data sources, LLM-driven research, and human curation. The architecture is designed but much of it still runs as manually-triggered pipelines. We need someone to turn it into a production data platform.

The Role

We don't have a separate data team with its own stack. The data platform lives in the same monorepo as the product, follows the same engineering practices, and ships through the same CI pipeline. You'll be working alongside product engineers in Python and SQL — the same codebase, the same review process, the same standards.

The best way to describe this role is a software engineer who builds data systems. You'll be as comfortable making a product change as you are writing a pipeline.

We've designed the data architecture: a medallion-layer pipeline in BigQuery via Dataform, a signal-based quality scoring model, an append-only ontology, and an AI-driven research pipeline. We need someone to own it — productionise what exists, extend it as the product evolves, and keep it running day-to-day.

We operate a high-autonomy, high-trust environment. You'll be given a problem and the space to solve it — not a task list. AI is your first port of call for everything: understanding the codebase, exploring data, drafting implementations, debugging. Tickets describe what needs doing, not how. We expect you to think beyond the immediate task — consider knock-on effects, integration points, and how your work fits into the broader product and business strategy. Curiosity matters: you should want to understand how everything connects, not just the bit you're working on.

The Data Pipeline

Dragonfly's core asset is a proprietary knowledge graph — every SaaS product, AI tool, and infrastructure service, what it does, who makes it, how it compares, and how confident we are in that data.

Building this knowledge graph is a multi-stage pipeline: raw data from external sources is normalised, scored for quality, matched and enriched by AI agents, curated by humans, and served to downstream systems (search, recommendations, product catalogue). The architecture follows a medallion pattern (Bronze → Silver → Gold) in BigQuery via Dataform, with an append-only ontology at the centre.

What's missing is the production engineering: scheduling, monitoring, incremental processing, source onboarding automation, and the platform layer that makes it all self-service for product engineers.

What You'll Do

1. Production Data Pipelines

The pipeline exists as ~200 Dataform files across multiple datasets. Today, builds are triggered manually. Incremental processing has been validated in design but not hardened for production. You'll own making this run reliably.

Scheduled builds, retry logic, and failure alerting
Harden incremental processing — source refreshes need to flow through without full rebuilds
Monitoring: volume checks per layer, freshness SLAs, broken source contract detection
Maintain and evolve the data marts that feed search, recommendations, and the product catalogue
Manage versioned dataset transitions with rollback capability

2. Source Onboarding & Ingestion

We ingest from several sources. Adding a new source today means writing a silver view by hand. You'll build a repeatable process so that onboarding a new vendor is a config change, not a project.

Evaluate and implement managed ingestion tooling (Airbyte, Meltano, or similar) for batch sources
New sources land in Bronze and flow through the existing Silver standardisation layer automatically
Handle the reality of messy source data — self-referential URLs, missing fields, schema drift, API rate limits
Support three ingestion patterns: scheduled batch, ad-hoc triggers, and streaming from internal AI agents

3. Data Governance & Lineage

We've started with Dataplex for source trust metadata and quality scans. You'll extend this into a proper governance layer that's automated, not bureaucratic.

Data contracts between pipeline layers (schema enforcement, freshness SLAs)
Dataplex tagging for lineage tracking across the full pipeline
Quality monitoring — volume checks, anomaly detection, broken source contract alerts

4. Platform for Product Engineers

You're an enabler, not a gatekeeper. Product engineers building the webapp, AI agents, and recommender system should be able to iterate on data without needing to understand the plumbing.

Well-documented, stable data marts with clear contracts
Self-service tooling so engineers can query, debug, and trace data through the pipeline
Make it easy to add new derived fields, views, and downstream consumers

5. Feature Engineering & Data Science

The line between data engineering and data science is blurry here. Our scoring model, quality tiers, and recommendation features are the pipeline. You'll work closely with data science (and do some yourself).

Own the quality scoring pipeline — signal computation, score aggregation, tier thresholds
Iterate on signal weights and quality thresholds with empirical validation
Build and maintain entity features that feed the recommender
Exploratory analysis in BigQuery to inform pipeline design — the kind of deep-dive EDA that shapes architecture decisions

6. AI-Native Workflow

Our quality scoring model — designed, validated against real data, and documented extensively — was built using Claude Code to interrogate BigQuery directly. Work that would have taken a team weeks of testing took hours. This is how we build data systems: AI does the heavy lifting on exploration and validation, you direct it and make the calls.

AI coding tools (Claude Code, Cursor, or similar) as your primary development environment
Explore data, generate hypotheses, validate with SQL, and ship — at the speed AI tooling enables
Work in the terminal, not in GUI pipeline builders

Boundaries (Soft, Not Hard)

Your primary focus is the data platform, but you're not siloed. If a product change is needed to make the data flow work — a new API endpoint, a schema change in Firestore, a fix to how the webapp consumes a mart — you make it. The codebase is a monorepo for a reason.

You won't be the primary owner of the recommender agent, LLM research pipeline, or curation webapp — but you'll touch all of them when the data platform work requires it.

Tech Stack

Warehouse: BigQuery (EU region)
Transformations: Dataform (SQLX, medallion architecture)
Languages: Python (pipelines, agents), TypeScript (webapp, infrastructure)
Infrastructure: GCP, Pulumi (IaC), Docker
Monorepo tooling: moonrepo (task orchestration), proto (toolchain), pnpm, uv
Governance: Dataplex (source metadata, quality scans, lineage)
AI tooling: Claude Code, ADK agents

Requirements

AI-native. This is the most important requirement. AI writes most of our code — Claude Code, not you, will be producing the SQL, Python, and infrastructure. Your job is to direct it well, understand what it produces, and know when it's wrong. If you're not already using AI coding tools daily to ship real work, this isn't the right role.
Solid foundations. You need to understand Python, SQL (BigQuery dialect — window functions, STRUCT/ARRAY, UDFs), GCP services (BigQuery, Dataform, Dataplex, Cloud Run), and data pipeline patterns well enough to review and course-correct what AI generates. You don't need to write every line yourself, but you need to know when something is wrong.
Data engineering experience. You've built production pipelines before — you understand the difference between "it works when I run it" and "it runs reliably in production." Experience with ingestion tooling (Airbyte, Meltano, Fivetran), data modelling, and medallion/lakehouse patterns.
Reach for off-the-shelf first. Most of what we need is a solved problem — ingestion, orchestration, lineage, quality monitoring. We want someone who assembles the right tools rather than building custom solutions for problems that don't need them. Write code for what's genuinely unique to Dragonfly; use existing tools for everything else.

Nice to Have

Experience with entity resolution, knowledge graphs, or ontology systems
Familiarity with signal/scoring models and feature engineering
Experience building self-service data platforms for product engineering teams

Why Join Us

This is a ground-floor opportunity to shape the product and the codebase from day one. You won’t just be implementing tickets — you’ll help decide what we build, how we build it, and what the user experience should feel like. If you’re someone who thrives in creative, fast-moving environments and loves sweating the details, this is your playground.
Let’s build something people love to use.

What We Offer

The opportunity to define and shape the content narrative of a high-potential startup from day one.
Creative freedom and a high-trust environment focused on outcomes over process.
Direct access to founders and an experienced, mission-driven team.
Competitive salary.
Hybrid work options.
An intellectually stimulating environment where speed, curiosity, and product delivery are celebrated.

We are an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Skills

Ad-hoc TriggersAirbyteApplied ML/AI In Data PipelinesBigQueryCI/CDCloud PlatformDataformData GovernanceData LineageData PipelinesDataplexETLIncremental ProcessingMeltanoMonitoringPythonSchedulingSQLSQL AnalyticsStreaming IngestionVersioning