Semantic Data Engineer / Data Scientist
Are you a talented developer looking for a remote job that lets you show your skills and get decent compensation? Join Upstaff.com, a platform that connects you with hand-picked startups and scale-ups in the US and Europe.
Summary
Looking for Semantic Data Engineer to build and train a model from scratch (similar to LLMs) for a data platform with federated learning features and privacy-preserving techniques.
Full-time, remote, long-term.
European timezone overlap 100%, B2 or C1 English
Project Description
We are looking for a Semantic Data Engineer with Onthology, Data Architecture, AI/ML skills, who will work on developing a Data platform with federated learning features and privacy-preserving techniques.
You will work on a platform that automates data ingestion, processing, and sharing with user-friendly, privacy-preserving, and scalable solutions for industrial manufacturing.
The platform will incorporate scalable and dynamic tools for creating and managing data spaces, handling complex data workflows, and ensuring modularity and privacy compliance.
Project:
- Focus is on developing data management system for industrial manufacturing companies targeted at both IT/OT (Information Technology/Operational Technology) data.
- UseCases: Compliance, EU certification. Digital Passports., missions reporting. Collaboration between organizations, product development etc - our solution will supply the data layer (according to RAMI 4.0 - see below)
- Example: BMW/Audi/other EU auto manufacturers have thousands of suppliers, and one product certification implies following up dozens of thousands of links in thousands of supply chains! (via https://catena-x.net/en/)
- Automation & integration of data will be implemented with metadata only and without central permanent storage (such as data lakes), so Machine Learning / federated learning, and AI features are key.
- In our platform pilot use-cases, the plan is to start with compliance-related use cases (e.g., digital product passports for compliance, or supply chain inspection).
Team Skills Coverage:
- Ontologies, Semantics, Knowledge Graphs
- Data contextualization & transformation
- Data federation & correlation
- Time-series data & streaming data handling
- RPA - Robotic Process Automation
- MLOps
- Policy engines
- Backend (Go, Python)
- Databases
- Infrastructure and DevOps (presumably Azure but can be changed to AWS or GCP)
- Security & Zero Trust (attribute-based access control, role-based access control, encryption, SSO)
- API & codeless integration agents (Zapier-like functionality)
Domain and Reference Companies
Broader Industrial Robotics Application areas which need to be covered (additional areas to look for experience):
- IT/OT (Information Technology/Operational Technology) integration
- Digital twins, master data, single source of truth
- Data spaces (industrial data traceability)
- Manufacturing & compliance (digital product passports, emissions reporting)
- Predictive maintenance for industrial processes
- Industrial automation & data quality automation
Reference Industrial Data Management Technologies and Companies, including competitors and look-alikes:
- Siemens Digital Thread (proprietary)
- Cognite (Data platform for industry 4.0) https://cognite.com
- Litmus (Industrial data management) https://litmus.io
- Catena-X (EU automotive data space) https://catena-x.net/en/1
- Rami 4.0 (Reference architecture for Industry 4.0) https://ec.europa.eu/futurium/en/system/files/ged/a2-schweichhart-reference_architectural_model_industrie_4.0_rami_4.0.pdf
Core Responsibilities:
- Semantic Data Modeling: Designing and implementing semantic data models that capture the meaning and relationships of data elements.
- Ontology Development: Creating and managing ontologies, which are formal representations of knowledge that define concepts and their relationships.
- Knowledge Graph Construction: Building and maintaining knowledge graphs, which are networks of interconnected data points, allowing for complex queries and reasoning. Data
- Semantic Querying and Analysis: Enabling users to query and analyze data using semantic knowledge, rather than just relying on traditional data structures.
Key Skills:
Machine Learning Fundamentals:
- Understanding of supervised, unsupervised, and semi-supervised learning techniques.
- Knowledge of algorithms relevant to semantic data, such as graph neural networks (GNNs), embeddings (e.g., Word2Vec, BERT), or clustering for entity resolution.
Semantic Data Processing:
- Ability to preprocess and transform RDF, OWL, or other semantic data formats for model input.
- Expertise in generating embeddings for entities and relations in knowledge graphs (e.g., TransE, DistMult, ComplEx).
Natural Language Processing (NLP):
- Skills in NLP techniques for semantic tasks like named entity recognition (NER), entity linking, or text-to-triple extraction.
- Familiarity with transformer models (e.g., BERT, RoBERTa) for semantic understanding.
Graph-Based Machine Learning:
- Proficiency in working with graph-based models for tasks like link prediction, node classification, or knowledge graph completion.
- Understanding of graph algorithms and embeddings for semantic reasoning.
Programming and Data Manipulation:
- Strong programming skills in Python (or similar languages like R or Java) for model development and data preprocessing.
- Experience with libraries for data manipulation (e.g., Pandas, NumPy) and semantic data handling (e.g., RDFLib, OWLReady2).
Model Training and Evaluation:
- Knowledge of training machine learning models, including hyperparameter tuning, cross-validation, and optimization.
- Ability to evaluate models using metrics like precision, recall, F1-score, or Mean Reciprocal Rank (MRR) for knowledge graph tasks.
Semantic Querying and Reasoning:
- Proficiency in SPARQL for querying RDF datasets to prepare training data.
- Understanding of reasoning techniques to augment training datasets with inferred knowledge.
Data Pipeline Development:
- Skills in building ETL pipelines to extract semantic data, transform it (e.g., into embeddings), and load it into training environments.
- Familiarity with data augmentation techniques for semantic datasets.
Cloud and Scalability:
- Experience with AWS
- Knowledge of big data tools like Apache Spark for processing large-scale semantic datasets.