Data Transformation

Data transformation is the process of converting data from a source format into a destination format, often including cleaning, filtering, and joining.

Techniques: 1. Scrubbing. 2. Deduplication. 3. Format conversion. 4. Summarization. 5. Integration. Tools: Apache Spark, dbt, Talend, Informatica, SQL.

        graph LR
  Center["Data Transformation"]:::main
  Rel_data_validation["data-validation"]:::related -.-> Center
  click Rel_data_validation "/terms/data-validation"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

      

🧒 Explain Like I'm 5

Imagine you have a big bucket of mixed LEGO blocks and you want to build a specific red car. Data transformation is like sorting out only the red pieces, making sure they aren't broken, and clicking some together to make the wheels before you even start building the main car. It’s preparing your materials so everything fits perfectly together.

🤓 Expert Deep Dive

Technically, transformation logic is moving from 'Imperative' code (scripts written in Python/Java) to 'Declarative' code (SQL or dbt). The rise of 'ELT' (Extract, Load, Transform) has changed the game; instead of transforming data in a middle-tier server, we dump raw data into a data lake and use 'SQL-based models' to transform it on-demand. This allows for 'Idempotent' transformations, where re-running a process always yields the same result. Common technical operations include 'Casting' (changing data types), 'Flattening' (turning nested JSON into flat tables), and 'Window Functions' (calculating trends across rows). A critical subset is 'Data Anonymization', where sensitive fields are hashed or masked during the transformation to maintain privacy compliance.

📚 Sources