Data Engineering

Importance of Configuration Files in Python : A Focus on ETL Pipelines

Content

Introduction
- Fundamentals of Oceanography
- Key Remote Sensing Instruments
Sentinel 3 : Satellite
Reference

This is one of the most important concepts in python project development. I have taken few part of the page from reference.

Introduction

When building complex Python projects like ETL (Extract, Transform, Load) pipelines, managing configurations effectively becomes crucial. Configuration files are an essential tool in this process, providing a centralized way to manage settings, improve maintainability, and enhance flexibility. Let's dive into what configuration files are, how they are helpful, and how to use them in your ETL projects.

What Are Configuration Files?

Configuration files store settings and parameters that a program needs to run. Instead of hardcoding values into your code, which can lead to maintenance challenges and inflexibility, configuration files allow you to separate configuration from code. These files can be in various formats, including JSON, YAML, INI, and environment variables.

JSON: Easy to read and widely used, especially in web applications.
YAML (Yet Another Markup Language): More human-readable than JSON, often used in DevOps and configuration management.
INI: Simple format for key-value pairs, commonly used in older applications.
TOML: Configure from .toml. TOML files define variables via key/value pairs in a similar manner to that of ini files. Unlike ini files, however, TOML expects that the values of keys to be stored as the data type they're intended to be utilized as.
Environment Variables: Managed by the operating system, suitable for sensitive information (for more details, check out my blog post on best habits of Python programming enviroment variable ).

Why Use Configuration Files?

Centralized Management: Configuration files allow you to manage settings in one place, making it easier to update and maintain.
Flexibility: You can change configurations without modifying the codebase, facilitating easier adjustments across different environments (development, testing, production).
Security: Sensitive information, like database credentials and API keys, can be managed securely using environment variables or encrypted configuration files.
Scalability: As projects grow, configuration files help in managing complexity by organizing settings logically.
Portability: Configuration files make it easier to share and deploy your code across different environments and systems.

How to Use json-Configuration Files in ETL Projects?

Let's consider an example ETL pipeline where we need to extract data from a source, transform it, and load it into a destination. We'll use a JSON configuration file to manage our settings.

Step 1: Define the Configuration File: Create a config.json file with necessary settings, such as source and destination details, transformation rules, and logging configurations.


                                # Json-configuration file
                                {
                                    "source": {
                                        "type": "database",
                                        "host": "localhost",
                                        "port": 5432,
                                        "database": "source_db",
                                        "user": "source_user",
                                        "password": "source_password"
                                    },
                                    "destination": {
                                        "type": "database",
                                        "host": "localhost",
                                        "port": 5432,
                                        "database": "destination_db",
                                        "user": "dest_user",
                                        "password": "dest_password"
                                    },
                                    "transformations": [
                                        {"name": "remove_nulls", "fields": ["field1", "field2"]},
                                        {"name": "convert_types", "fields": {"field3": "int", "field4": "float"}}
                                    ],
                                    "logging": {
                                        "level": "INFO",
                                        "file": "etl.log"
                                    }
                                }

Step 2: Load the Configuration File in Python: In the ETL script, load the configuration file to access the settings.


                                # Main python script
                                import json

                                def load_config(config_file):
                                    with open(config_file, 'r') as file:
                                        config = json.load(file)
                                    return config                                
                                
                                config = load_config('config.json')

Step 3: Use Configurations in ETL Pipeline: Utilize the loaded configurations in your ETL process.


                                # Main python script
                                import logging
                                import psycopg2
                                
                                # Function to set up logging
                                def setup_logging(log_config):
                                    logging.basicConfig(filename=log_config['file'], level=log_config['level'])
                                    logging.info("Logging is set up.")
                                
                                # Function to extract data from the source
                                def extract_data(source_config):
                                    connection = psycopg2.connect(
                                        host=source_config['host'],
                                        port=source_config['port'],
                                        database=source_config['database'],
                                        user=source_config['user'],
                                        password=source_config['password']
                                    )
                                    cursor = connection.cursor()
                                    cursor.execute("SELECT * FROM source_table")
                                    data = cursor.fetchall()
                                    connection.close()
                                    return data
                                
                                # Function to transform the data
                                def transform_data(data, transformations):
                                    for transformation in transformations:
                                        if transformation['name'] == 'remove_nulls':
                                            data = [row for row in data if None not in row]
                                        elif transformation['name'] == 'convert_types':
                                            for row in data:
                                                for field, dtype in transformation['fields'].items():
                                                    if dtype == 'int':
                                                        row[field] = int(row[field])
                                                    elif dtype == 'float':
                                                        row[field] = float(row[field])
                                    return data
                                
                                # Function to load data into the destination
                                def load_data(destination_config, data):
                                    connection = psycopg2.connect(
                                        host=destination_config['host'],
                                        port=destination_config['port'],
                                        database=destination_config['database'],
                                        user=destination_config['user'],
                                        password=destination_config['password']
                                    )
                                    cursor = connection.cursor()
                                    for row in data:
                                        cursor.execute("INSERT INTO destination_table VALUES (%s, %s, %s, %s, %s)", row)
                                    connection.commit()
                                    connection.close()
                                
                                # Main function to run the ETL pipeline
                                def main():
                                    config = load_config('config.json')
                                    setup_logging(config['logging'])
                                
                                    logging.info("Starting ETL process.")
                                    data = extract_data(config['source'])
                                    transformed_data = transform_data(data, config['transformations'])
                                    load_data(config['destination'], transformed_data)
                                    logging.info("ETL process completed.")
                                
                                if __name__ == '__main__':
                                    main()

Step 4: Managing Sensitive Information: For sensitive information like passwords, consider using environment variables or secret management tools.


                                # Example with Environment Variables:
                                import os

                                source_password = os.getenv('SOURCE_PASSWORD')
                                destination_password = os.getenv('DEST_PASSWORD')
                                
                                # Use these variables in the connection setup
                                source_config['password'] = source_password
                                destination_config['password'] = destination_password

Note: If we need to update the main function, then we can do it as follows:


                                def main():
                                config = load_config('config.json')
                            
                                # Overwrite passwords with environment variables
                                config['source']['password'] = os.getenv('SOURCE_PASSWORD', config['source']['password'])
                                config['destination']['password'] = os.getenv('DEST_PASSWORD', config['destination']['password'])
                            
                                setup_logging(config['logging'])
                            
                                logging.info("Starting ETL process.")
                                data = extract_data(config['source'])
                                transformed_data = transform_data(data, config['transformations'])
                                load_data(config['destination'], transformed_data)
                                logging.info("ETL process completed.")

Conclusion: Configuration files are invaluable in managing complex Python projects, especially ETL pipelines. They provide a centralized, flexible, and secure way to handle settings, making your codebase cleaner and easier to maintain. By separating configuration from code, you enhance the scalability and portability of your projects, ensuring they can adapt to various environments and requirements with minimal effort.

Setting configuration using the enviroment variables

In the world of data engineering and machine learning, maintaining clean and secure code is paramount. One key practice that enhances both security and efficiency is the use of enviroment variables in a file .env in the project directory. The .env file is a simple text file used to store environment variables. These variables can include sensitive information like API keys, database credentials, and configuration settings.

Security: Storing sensitive data in `.env` files keeps it out of your main codebase, reducing the risk of exposing secrets when sharing or uploading your code to platforms like GitHub.
Configuration Management: `.env` files allow for easy management of different configurations for development, testing, and production environments. This makes your code more adaptable and easier to maintain.
Simplicity: Loading environment variables from a `.env` file is straightforward, using libraries like python-`dotenv` in Python.

Loading the `.env` file

Create `.env` in the root directory of the project and save the environment variables, such as database URLs, API keys, and secret keys, which are sensitive information that you don't want to expose in your code. For example:


                        DATABASE_URL=your_database_url 
                        API_KEY=your_api_key 
                        SECRET_KEY=your_secret_key

To avoid sharing your secrets, add `.env` to the `gitignore` file. Adding .env to .gitignore protects your sensitive information from being exposed publicly or shared unintentionally. To load the enviroment variables in the python scripts, Use


                        from dotenv import load_dotenv 
                        import os 
                        load_dotenv() 
                        
                        database_url = os.getenv("DATABASE_URL") 
                        api_key = os.getenv("API_KEY") 
                        secret_key = os.getenv("SECRET_KEY")

where:

load_dotenv() loads the environment variables from the .env
os.getenv() retrieves the value of the specified environment variable.

Configure from .ini

ini files are perhaps the most straight configuration files available to us. ini files are highly suitable for smaller projects, mostly because these files only support hierarchies 1-level deep. ini files are essentially flat files, with the exception that variables can belong to groups. The below example demonstrates how variables sharing a common theme may fall under a common title, such as [DATABASE] or [LOGS] (the config.ini is):


                        [APP]
                        ENVIRONMENT = development
                        DEBUG = False
                        
                        [DATABASE]
                        USERNAME = root
                        PASSWORD = p@ssw0rd
                        HOST = 127.0.0.1
                        PORT = 5432
                        DB = my_database
                        
                        [LOGS]
                        ERRORS = logs/errors.log
                        INFO = data/info.log
                        
                        [FILES]
                        STATIC_FOLDER = static
                        TEMPLATES_FOLDER = templates

This structure surely makes things easier to understand by humans, but the practicality of this structure goes beyond aesthetics. Let's parse this file with Python's configparser library to see what's really happening. We get started by saving the contents of config.ini to a variable called config in a python script ():


                        """Load configuration from .ini file."""
                        import configparser
                        
                        # Read local file `config.ini`.
                        config = configparser.ConfigParser()
                        config.read('settings/config.ini')  # suppose our config.ini is in 'setting' folder of home directory.

Calling read() on an ini file does much more than store plain data; our config variable is now a unique data structure, allowing us various methods for reading and writing values to our config. Try running print(config) to see for yourself: < configparser.ConfigParser object at 0x10e58c390 >. Now using config variable, we can get various values from the config.ini (here we use get() method, where it returns the value as a string) as:


                        # Get values from our .ini file
                        config.get('DATABASE', 'HOST')        # retrieves the value associated with the 'HOST' key from the 'DATABASE' 
                        config['DATABASE']['HOST']            # alternative way to access the same value. 
                        config.getboolean('APP', 'DEBUG')     # retrieves the value associated with the 'DEBUG' key from the 'APP' 
                                                               # section and converts it to a boolean.

The main difference is between the first and second is that if the key doesn't exist, this method will raise a KeyError instead of a configparser.NoOptionError. This method retrieves the value associated with the 'DEBUG' key from the 'APP' section and converts it to a boolean.

Configuring Applications from .yaml Files

YAML is a human-readable data serialization standard that can be used in conjunction with all programming languages and is often used to write configuration files. Its syntax is minimal and straightforward, which makes it easier for humans to read and write compared to JSON or XML.

Why Use YAML for Configuration?
- Readability: YAML’s syntax is clean and easy to understand.
- Hierarchical Data: It naturally represents hierarchical data, making it suitable for complex configurations.
- Language-agnostic: YAML can be used with any programming language, making it versatile.
Basic Structure of a .yaml File:


                                database:
                                    host: localhost
                                    port: 5432
                                    user: admin
                                    password: secret
                              
                                features:
                                    enable_feature_x: true
                                    max_connections: 100

Configuring Applications with YAML: Let’s walk through configuring an application using a YAML file. We’ll use a Python application as an example.

Step-1: Create a YAML Configuration File: First, create a file named config.yaml:


                                        app:
                                        name: MyApp
                                        version: 1.0
                                      
                                      database:
                                        host: localhost
                                        port: 5432
                                        user: admin
                                        password: secret
                                      
                                      logging:
                                        level: INFO
                                        file: /var/log/myapp.log

Step 2: Load the YAML Configuration in Your Application: To read and use this configuration in your application, you’ll need a YAML parser. In Python, PyYAML is a popular library for this purpose. Next, load the configuration file in your Python application:


                                    import yaml

                                    def load_config(file_path):
                                        with open(file_path, 'r') as file:
                                            config = yaml.safe_load(file)
                                        return config
                                    
                                    config = load_config('config.yaml')
                                    
                                    # Accessing configuration values
                                    app_name = config['app']['name']
                                    db_host = config['database']['host']
                                    log_level = config['logging']['level']
                                    
                                    print(f"App Name: {app_name}")
                                    print(f"Database Host: {db_host}")
                                    print(f"Log Level: {log_level}")

Step 3: Using the Configuration: With the configuration loaded, you can now use it to set up your application. For example, configuring a database connection and logging:


                                        import logging
                                        import psycopg2
                                        
                                        # Configure logging
                                        logging.basicConfig(level=config['logging']['level'],
                                                            filename=config['logging']['file'],
                                                            format='%(asctime)s %(levelname)s:%(message)s')
                                        
                                        # Connect to the database
                                        conn = psycopg2.connect(
                                            host=config['database']['host'],
                                            port=config['database']['port'],
                                            user=config['database']['user'],
                                            password=config['database']['password']
                                        )
                                        
                                        logging.info('Application started')
                                        logging.info('Connected to the database')

Advanced Configurations: YAML supports more complex data structures, which can be very useful for advanced configurations.

Nested Configurations: You can nest configurations to represent complex settings:


                                    server:
                                        host: localhost
                                        ports:
                                            http: 80
                                            https: 443

Lists: YAML also supports lists, which can be useful for configurations like allowed hosts or user roles:


                                    allowed_hosts:
                                        - localhost
                                        - 127.0.0.1
                                        - example.com
                                    
                                    user_roles:
                                        - admin
                                        - user
                                        - guest

Anchors and Aliases: YAML supports anchors and aliases to reuse parts of the configuration. This can help avoid redundancy:


                                    defaults: &defaults
                                        adapter: postgres
                                        host: localhost

                                    development:
                                        <<: *defaults
                                        database: dev_db
                                    
                                    production:
                                        <<: *defaults
                                        database: prod_db
                                        host: prod_host

Error Handling: When working with configurations, it’s essential to handle errors gracefully. For example, checking if required fields are present and logging errors:


                                    try:
                                        db_host = config['database']['host']
                                    except KeyError as e:
                                        logging.error(f'Missing configuration for {e}')
                                        raise

A example complex yaml configuration file is:


                        appName: appName
                        logLevel: WARN
                        
                        AWS:
                            Region: us-east-1
                            Resources:
                              EC2:
                                Type: "AWS::EC2::Instance"
                                Properties:
                                  ImageId: "ami-0ff8a91507f77f867"
                                  InstanceType: t2.micro
                                  KeyName: testkey
                                  BlockDeviceMappings:
                                    -
                                      DeviceName: /dev/sdm
                                      Ebs:
                                        VolumeType: io1
                                        Iops: 200
                                        DeleteOnTermination: false
                                        VolumeSize: 20
                              Lambda:
                                  Type: "AWS::Lambda::Function"
                                  Properties:
                                    Handler: "index.handler"
                                    Role:
                                      Fn::GetAtt:
                                        - "LambdaExecutionRole"
                                        - "Arn"
                                    Runtime: "python3.7"
                                    Timeout: 25
                                    TracingConfig:
                                      Mode: "Active"
                        
                        routes:
                          admin:
                            url: /admin
                            template: admin.html
                            assets:
                                templates: /templates
                                static: /static
                          dashboard:
                            url: /dashboard
                            template: dashboard.html
                            assets:
                                templates: /templates
                                static: /static
                          account:
                            url: /account
                            template: account.html
                            assets:
                                templates: /templates
                                static: /static
                        
                        databases:
                          cassandra:
                            host: example.cassandra.db
                            username: user
                            password: password
                          redshift:
                            jdbcURL: jdbc:redshift://<IP>:<PORT>/file?user=username&password=pass
                            tempS3Dir: s3://path/to/redshift/temp/dir/
                          redis:
                            host: hostname
                            port: port-number
                            auth: authentication
                            db: databaseconfig.yaml

In this case, we can read the yaml file as:


                        # yaml_config.py
                        """Load configuration from .yaml file."""
                        import confuse
                        
                        config = confuse.Configuration('MyApp', __name__)
                        runtime = config['AWS']['Lambda']['Runtime'].get()
                        print(runtime)

Confuse also gets into the realm of building CLIs, allowing us to use our YAML file to inform arguments which can be passed to a CLI and their potential values:


                        # cli_config.py
                        config = confuse.Configuration('myapp')
                        parser = argparse.ArgumentParser()
                        
                        parser.add_argument('--foo', help='a parameter')
                        
                        args = parser.parse_args()
                        config.set_args(args)
                        
                        print(config['foo'].get())

References

Configuring Python Projects with INI, TOML, YAML, and ENV files.

Some other interesting things to know:

Visit my website on For Data, Big Data, Data-modeling, Datawarehouse, SQL, cloud-compute.
Visit my website on Data engineering