Overview
Company: Mactores
Website: Visit Website
Business Type: Startup
Company Type: Service
Business Model: B2B
Funding Stage: Pre-seed
Industry: Data Analytics
Job Description
Mactores is a trusted leader among businesses in providing modern data platform solutions. Since 2008, Mactores have been enabling businesses to accelerate their value through automation by providing End-to-End Data Solutions that are automated, agile, and secure. We collaborate with customers to strategize, navigate, and accelerate an ideal path forward with a digital transformation via assessments, migration, or modernization.
We are seeking a highly skilled and innovative Spark Engineer to join our team. In this role, you will design, develop, optimize, and operationalize high-performance data pipelines and applications using Apache Spark. This role requires hands-on expertise in distributed data processing, ETL engineering, performance tuning, cluster management, and working with cross-functional teams to deliver reliable, scalable, and efficient data solutions
What Will You Do
- Architect, design, and build scalable data pipelines and distributed applications using Apache Spark (Spark SQL, DataFrames, RDDs)
- Develop and manage ETL/ELT pipelines to process structured and unstructured data at scale.
- Write high-performance code in Scala or PySpark for distributed data processing workloads.
- Optimize Spark jobs by tuning shuffle, caching, partitioning, memory, executor cores, and cluster resource allocation.
- Monitor and troubleshoot Spark job failures, cluster performance, bottlenecks, and degraded workloads.
- Debug production issues using logs, metrics, and execution plans to maintain SLA-driven pipeline reliability.
- Deploy and manage Spark applications on on-prem or cloud platforms (AWS, Azure, or GCP).
- Collaborate with data scientists, analysts, and engineers to design data models and enable self-serve analytics.
- Implement best practices around data quality, data reliability, security, and observability.
- Support cluster provisioning, configuration, and workload optimization on platforms like Kubernetes, YARN, or EMR/Databricks.
- Maintain version-controlled codebases, CI/CD pipelines, and deployment automation.
- Document architecture, data flows, pipelines, and runbooks for operational excellence
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- 4+ years of experience building distributed data processing pipelines, with deep expertise in Apache Spark.
- Strong understanding of Spark internals (Catalyst optimizer, DAG scheduling, shuffle, partitioning, caching).
- Proficiency in Scala and/or PySpark with strong software engineering fundamentals.
- Solid expertise in ETL/ELT, distributed computing, and large-scale data processing.
- Experience with cluster and job orchestration frameworks.
- Strong ability to identify and resolve performance bottlenecks and production issues.
- Familiarity with data security, governance, and data quality frameworks.
- Excellent communication and collaboration skills to work with distributed engineering teams.
- Ability to work independently and deliver scalable solutions in a fast-paced environment
- Experience with Databricks, AWS EMR, Glue Spark, or GCP Dataproc.
- Familiarity with workflow orchestration tools like Apache Airflow, Dagster, or Prefect.
- Exposure to streaming platforms such as Kafka, Kinesis, or Pub/Sub.
- Experience running Spark workloads on Kubernetes.
- Familiarity with data warehouse ecosystems (Snowflake, BigQuery, Redshift, Iceberg, Delta Lake, Hudi).
- Understanding of DevOps practices, CI/CD, and IaC (Terraform, CloudFormation).
- Knowledge of distributed logging and monitoring tools (Grafana, Prometheus, CloudWatch, ELK).
- Prior experience in high-scale production environments or data platform teams