Semantic Data Engineer / Data Scientist

Data Engineer, AI and Machine Learning

$ 5,000/month

B2 (Upper-Intermediate) English

Senior (5-10 years)

Are you a talented developer looking for a remote job that lets you show your skills and get decent compensation? Join Upstaff.com, a platform that connects you with hand-picked startups and scale-ups in the US and Europe.

Summary

Looking for Semantic Data Engineer to build and train a model from scratch (similar to LLMs) for a data platform with federated learning features and privacy-preserving techniques.
Full-time, remote, long-term.
European timezone overlap 100%, B2 or C1 English

Required Skills

LLM

AWS

NLP

Nice to Have

Vector DB Vector RAG Text NLP models Python RAG BERT Apache Spark

ID: 100-123-557

Last Updated: 2025-07-09

Project Description

We are looking for a Semantic Data Engineer with Onthology, Data Architecture, AI/ML skills, who will work on developing a Data platform with federated learning features and privacy-preserving techniques.

You will work on a platform that automates data ingestion, processing, and sharing with user-friendly, privacy-preserving, and scalable solutions for industrial manufacturing.

The platform will incorporate scalable and dynamic tools for creating and managing data spaces, handling complex data workflows, and ensuring modularity and privacy compliance.

Project:

Focus is on developing data management system for industrial manufacturing companies targeted at both IT/OT (Information Technology/Operational Technology) data.
UseCases: Compliance, EU certification. Digital Passports., missions reporting. Collaboration between organizations, product development etc - our solution will supply the data layer (according to RAMI 4.0 - see below)
Example: BMW/Audi/other EU auto manufacturers have thousands of suppliers, and one product certification implies following up dozens of thousands of links in thousands of supply chains! (via https://catena-x.net/en/)
Automation & integration of data will be implemented with metadata only and without central permanent storage (such as data lakes), so Machine Learning / federated learning, and AI features are key.
In our platform pilot use-cases, the plan is to start with compliance-related use cases (e.g., digital product passports for compliance, or supply chain inspection).

Team Skills Coverage:

Ontologies, Semantics, Knowledge Graphs
Data contextualization & transformation
Data federation & correlation
Time-series data & streaming data handling
RPA - Robotic Process Automation
MLOps
Policy engines
Backend (Go, Python)
Databases
Infrastructure and DevOps (presumably Azure but can be changed to AWS or GCP)
Security & Zero Trust (attribute-based access control, role-based access control, encryption, SSO)
API & codeless integration agents (Zapier-like functionality)

Domain and Reference Companies

Broader Industrial Robotics Application areas which need to be covered (additional areas to look for experience):

IT/OT (Information Technology/Operational Technology) integration
Digital twins, master data, single source of truth
Data spaces (industrial data traceability)
Manufacturing & compliance (digital product passports, emissions reporting)
Predictive maintenance for industrial processes
Industrial automation & data quality automation

Reference Industrial Data Management Technologies and Companies, including competitors and look-alikes:

Siemens Digital Thread (proprietary)
Cognite (Data platform for industry 4.0) https://cognite.com
Litmus (Industrial data management) https://litmus.io
Catena-X (EU automotive data space) https://catena-x.net/en/1
Rami 4.0 (Reference architecture for Industry 4.0) https://ec.europa.eu/futurium/en/system/files/ged/a2-schweichhart-reference_architectural_model_industrie_4.0_rami_4.0.pdf

Core Responsibilities:

Semantic Data Modeling: Designing and implementing semantic data models that capture the meaning and relationships of data elements.
Ontology Development: Creating and managing ontologies, which are formal representations of knowledge that define concepts and their relationships.
Knowledge Graph Construction: Building and maintaining knowledge graphs, which are networks of interconnected data points, allowing for complex queries and reasoning. Data
Semantic Querying and Analysis: Enabling users to query and analyze data using semantic knowledge, rather than just relying on traditional data structures.

Key Skills:

Machine Learning Fundamentals:

Understanding of supervised, unsupervised, and semi-supervised learning techniques.
Knowledge of algorithms relevant to semantic data, such as graph neural networks (GNNs), embeddings (e.g., Word2Vec, BERT), or clustering for entity resolution.

Semantic Querying and Reasoning:

Proficiency in SPARQL for querying RDF datasets to prepare training data.
Understanding of reasoning techniques to augment training datasets with inferred knowledge.

Semantic Data Processing:

Ability to preprocess and transform RDF, OWL, or other semantic data formats for model input.
Expertise in generating embeddings for entities and relations in knowledge graphs (e.g., TransE, DistMult, ComplEx).

Natural Language Processing (NLP):

Skills in NLP techniques for semantic tasks like named entity recognition (NER), entity linking, or text-to-triple extraction.
Familiarity with transformer models (e.g., BERT, RoBERTa) for semantic understanding.

Graph-Based Machine Learning:

Proficiency in working with graph-based models for tasks like link prediction, node classification, or knowledge graph completion.
Understanding of graph algorithms and embeddings for semantic reasoning.

Programming and Data Manipulation:

Strong programming skills in Python (or similar languages like R or Java) for model development and data preprocessing.
Experience with libraries for data manipulation (e.g., Pandas, NumPy) and semantic data handling (e.g., RDFLib, OWLReady2).

Model Training and Evaluation:

Knowledge of training machine learning models, including hyperparameter tuning, cross-validation, and optimization.
Ability to evaluate models using metrics like precision, recall, F1-score, or Mean Reciprocal Rank (MRR) for knowledge graph tasks.

Data Pipeline Development:

Skills in building ETL pipelines to extract semantic data, transform it (e.g., into embeddings), and load it into training environments.
Familiarity with data augmentation techniques for semantic datasets.

Cloud and Scalability:

Experience with AWS
Knowledge of big data tools like Apache Spark for processing large-scale semantic datasets.

Not your tech stack?

Join the Upstaff community and we are looking for the best project for you. Be ready for the next steps:

Create your profile on our website (import from LinkedIn)
20-30-minute screening call
Technical interview
Feedback
Project Selection (we are looking for the best project for you).

We work with developers from 50+ countries in different regions: Europe, LATAM, the U.S. (W-9 form owners), Canada, Asia (Philippines, Indonesia), Oceania (Australia, New Zealand, Papua New Guinea), and the the UK.

We don’t have a legal and ethical basis to accept applicants from the following countries: Russia, Belarus, Iran, North Korea
We do not provide visa assistance, and our cooperation model does not include the benefits typically offered with direct hire.