Data Engineering for Fraud Prevention - Angela Ramirez
Oct 6, 2023
auto_awesome
Angela Ramirez, a data engineer with experience in fraud prevention, talks about her career journey, the usefulness of knowing ML as a data engineer, best practices for system design and data engineering, working with different types of databases including document and network-based databases, and selecting the appropriate database type to work with. She also discusses the importance of software engineering knowledge in data engineering, data quality check tooling, debugging failed jobs, and working with external data sources.
Working with external data sources requires establishing data contracts, understanding data reliability, and considering batch versus real-time data.
Dealing with failed jobs and debugging requires identifying root causes, relying on experience, documentation, and runbooks.
Recommended resources for learning data engineering include books on data engineering principles, designing data-intensive applications, and PySpark SQL.
Deep dives
Challenges of Working with External Data Sources
Working with external data sources can present challenges in terms of identifying the right teams to work with, obtaining proper documentation, and ensuring the data remains consistent and accessible. Data engineers must establish data contracts and understand the frequency and reliability of the data received. Additionally, considerations such as batch versus real-time data and potential changes to the data source must be taken into account.
Handling Failed Jobs and Debugging
One of the most challenging tasks as a data engineer is dealing with failed jobs and debugging issues. It requires identifying the root cause of the failure, whether it's a bug in the code, schema changes, database issues, or problematic data. Experience and familiarity with common errors and solutions can help in troubleshooting these issues. Documentation and runbooks can also aid in resolving future failures efficiently.
Recommended Resources for Learning
For those interested in learning more about data engineering, recommended resources include overview books on data engineering principles and designing data-intensive applications. O'Reilly books, such as 'Data Engineering' and 'Designing Data-Intensive Applications,' are excellent options. Additionally, resources specific to PySpark SQL can be valuable for interview preparation.
Working with Different Databases and Tools
Data engineers frequently work with various databases and tools, depending on their specific use cases. Some commonly used tools and technologies include GCP services such as Cloud DataProc, Cassandra for structured data, PySpark for large-scale data processing, and Pandas with the recent pyarrow implementation change for improved performance. Each tool has its unique strengths and use cases based on factors like scalability, fault tolerance, and data structure.
Choosing the Right Database for Use Cases
Selecting the appropriate database for a particular use case depends on the nature of the data and the desired analysis. Relational databases like Cassandra are effective for structured and relational data, while document-based databases such as MongoDB or document capabilities in Elasticsearch are preferable for dynamic data. Key-value stores like Redis or network databases like Neo4j may be suitable for specific graph-related use cases. The schema, scalability, and performance requirements should guide the decision-making process.