Program | Abracadata Summit

All session times refer to Coordinated Universal Time (UTC). Click on the calendar icon for a session to add it to your calendar at your local time.

Wednesday, June 11, 2025

13:50

Welcome

14:00

14:35

15:10

15:45

16:20

16:55

17:30

Open Models: Use cases, fine-tuning and tooling ecosystem

By Omar Sanseviero

06/11/2025 2:00 PM 06/11/2025 2:30 PM UTC AD25: Open Models: Use cases, fine-tuning and tooling ecosystem

Although many models work well out of the box, many times you want to customize them with your own data or for a specific domain, adding new features and boosting performance. In this talk, Omar will explain how to fine-tune LLMs, covering topics such as RLHF (reinforcement learning from human feedback) and quantization. Omar will share some of the latest advances in open models, the ecosystem of tools to build your own AI solutions, and how to evaluate and choose the best models according to your needs.

Although many models work well out of the box, many times you want to customize them with your own data or for a specific domain, adding new features and boosting performance. In this talk, Omar will explain how to fine-tune LLMs, covering topics such as RLHF (reinforcement learning from human feedback) and quantization. Omar will share some of the latest advances in open models, the ecosystem of tools to build your own AI solutions, and how to evaluate and choose the best models according to your needs.

2 packages for ethnicity imputation

By Jesus Ramos

06/11/2025 2:35 PM 06/11/2025 3:05 PM UTC AD25: 2 packages for ethnicity imputation

Imputation/inference of ethnicity is, sometimes contrary to privacy concerns, an important and often executed task in the data/feature engineering space for the public and social sector. We will take a look at 2 python packages to perform this task and look at their pros and cons, and where they work best.

Imputation/inference of ethnicity is, sometimes contrary to privacy concerns, an important and often executed task in the data/feature engineering space for the public and social sector. We will take a look at 2 python packages to perform this task and look at their pros and cons, and where they work best.

Scaling Machine Learning in Enterprises

By Francys Lanza

06/11/2025 3:10 PM 06/11/2025 3:40 PM UTC AD25: Scaling Machine Learning in Enterprises

Today, with the rapid advancements in technology, research, and tools, building machine learning models has become increasingly accessible. Delivering them consistently, reliably, and at scale in a real business environment? That’s where the real challenge begins.

In this talk, I’ll share how we designed and deployed an enterprise-grade, fully automated ML pipeline at Bankaya, built to enable continuous integration, delivery, and automated retraining of machine learning models across multiple business areas like risk, collections, and marketing.

This infrastructure empowers our data science team to retrain and redeploy models with minimal manual effort, significantly reducing the time from experimentation to production. But beyond speed, the real value lies in scalability and reusability: the same pipeline powers dozens of models across the company, all with standardized monitoring, governance, and traceability.

One of the most impactful results of this automation is that our data scientists can now focus on what truly drives model performance: understanding the dynamics behind each use case, designing better features, and improving model explainability. By reducing the manual workload in between, we’ve enabled the team to concentrate on higher-value tasks that enhance both accuracy and transparency.

By building this unified system, we transformed machine learning from isolated experiments into an enterprise capability, turning models into products ready to scale with the business.

Today, with the rapid advancements in technology, research, and tools, building machine learning models has become increasingly accessible. Delivering them consistently, reliably, and at scale in a real business environment? That’s where the real challenge begins.

In this talk, I’ll share how we designed and deployed an enterprise-grade, fully automated ML pipeline at Bankaya, built to enable continuous integration, delivery, and automated retraining of machine learning models across multiple business areas like risk, collections, and marketing.

Responsible Machine Learning with Post Modeling

By Liliana Millan

06/11/2025 3:45 PM 06/11/2025 4:15 PM UTC AD25: Responsible Machine Learning with Post Modeling

In real-world projects, modeling is not the most important aspect—it’s just one component that interacts with others to create a solution that should improve the outcomes of the people who will be affected by it. In this talk, we will review the different components that make up a complete and responsible deployed AI/ML solution and focus on post-modeling, a part that is rarely explored in both academic and industry settings.

In real-world projects, modeling is not the most important aspect—it’s just one component that interacts with others to create a solution that should improve the outcomes of the people who will be affected by it. In this talk, we will review the different components that make up a complete and responsible deployed AI/ML solution and focus on post-modeling, a part that is rarely explored in both academic and industry settings.

Machine Learning Informed Financial Assistance for Homelessness Prevention

By Kasun Amarasinghe

06/11/2025 4:20 PM 06/11/2025 4:50 PM UTC AD25: Machine Learning Informed Financial Assistance for Homelessness Prevention

Local governments implement Rental Assistance Programs to provide temporary financial assistance to renters struggling to pay their rent to keep them stably housed and prevent homelessness. However, as the number of vulnerable households increases and funding for rental assistance programs remains limited, program administrators must prioritize households to help. Although identifying and prioritizing the most vulnerable individuals is a top concern, the typical practice is to prioritize individuals using simple heuristics such as “first-come-first-served”. This talk will describe a collaborative effort between our team at Carnegie Mellon University and the Allegheny County Department of Human Services, which examined the utility of using Machine Learning (ML) models to inform the prioritization of rental assistance resources to minimize eviction-caused homelessness.

Local governments implement Rental Assistance Programs to provide temporary financial assistance to renters struggling to pay their rent to keep them stably housed and prevent homelessness. However, as the number of vulnerable households increases and funding for rental assistance programs remains limited, program administrators must prioritize households to help. Although identifying and prioritizing the most vulnerable individuals is a top concern, the typical practice is to prioritize individuals using simple heuristics such as “first-come-first-served”. This talk will describe a collaborative effort between our team at Carnegie Mellon University and the Allegheny County Department of Human Services, which examined the utility of using Machine Learning (ML) models to inform the prioritization of rental assistance resources to minimize eviction-caused homelessness.

Building LLM-Powered Apps with Apache Airflow

By Ephraim Anierobi

06/11/2025 4:55 PM 06/11/2025 5:25 PM UTC AD25: Building LLM-Powered Apps with Apache Airflow

In this session, we’ll unpack how Apache Airflow orchestrates the end-to-end workflows behind Ask Astro—an LLM-powered assistant designed to answer questions about Airflow itself. We’ll walk through how the system manages data ingestion, embedding generation, vector index updates, and prompt execution. Whether working with open-source models or commercial APIs, you’ll gain practical insights into building and scaling LLM applications with Airflow.

In this session, we’ll unpack how Apache Airflow orchestrates the end-to-end workflows behind Ask Astro—an LLM-powered assistant designed to answer questions about Airflow itself. We’ll walk through how the system manages data ingestion, embedding generation, vector index updates, and prompt execution. Whether working with open-source models or commercial APIs, you’ll gain practical insights into building and scaling LLM applications with Airflow.

Workshop: Power your data orchestration with Apache Airflow 3

By Kenten Danas

06/11/2025 5:30 PM 06/11/2025 7:00 PM UTC AD25: Workshop: Power your data orchestration with Apache Airflow 3

Apache Airflow is the world’s most popular open source data orchestration tool, used by companies all over the world and in every industry to manage their most critical data workflows. Airflow 3.0, released in April 2025, is the most significant release in Airflow’s history, and brings a better user experience, stronger security, and the ability to run tasks anywhere, at any time. In this workshop, you’ll get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI.

Whether you’re writing traditional ELT/ETL pipelines or complex ML and GenAI workflows, you’ll learn how Airflow 3 will make your day-to-day work smoother and your pipelines even more flexible. This workshop is suitable for intermediate to advanced Airflow users. Beginning users should consider taking the Airflow fundamentals course on the Astronomer Academy before attending this workshop.

Apache Airflow is the world’s most popular open source data orchestration tool, used by companies all over the world and in every industry to manage their most critical data workflows. Airflow 3.0, released in April 2025, is the most significant release in Airflow’s history, and brings a better user experience, stronger security, and the ability to run tasks anywhere, at any time. In this workshop, you’ll get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI.

13:50 - 14:00

Welcome

14:00 - 14:30.

Open Models: Use cases, fine-tuning and tooling ecosystem

By Omar Sanseviero

Although many models work well out of the box, many times you want to customize them with your own data or for a specific domain, adding new features and boosting performance. In this talk, Omar will explain how to fine-tune LLMs, covering topics such as RLHF (reinforcement learning from human feedback) and quantization. Omar will share some of the latest advances in open models, the ecosystem of tools to build your own AI solutions, and how to evaluate and choose the best models according to your needs.

14:35 - 15:05.

2 packages for ethnicity imputation

By Jesus Ramos

Imputation/inference of ethnicity is, sometimes contrary to privacy concerns, an important and often executed task in the data/feature engineering space for the public and social sector. We will take a look at 2 python packages to perform this task and look at their pros and cons, and where they work best.

15:10 - 15:40.

Scaling Machine Learning in Enterprises

By Francys Lanza

Today, with the rapid advancements in technology, research, and tools, building machine learning models has become increasingly accessible. Delivering them consistently, reliably, and at scale in a real business environment? That’s where the real challenge begins.

In this talk, I’ll share how we designed and deployed an enterprise-grade, fully automated ML pipeline at Bankaya, built to enable continuous integration, delivery, and automated retraining of machine learning models across multiple business areas like risk, collections, and marketing.

15:45 - 16:15.

Responsible Machine Learning with Post Modeling

By Liliana Millan

In real-world projects, modeling is not the most important aspect—it’s just one component that interacts with others to create a solution that should improve the outcomes of the people who will be affected by it. In this talk, we will review the different components that make up a complete and responsible deployed AI/ML solution and focus on post-modeling, a part that is rarely explored in both academic and industry settings.

16:20 - 16:50.

Machine Learning Informed Financial Assistance for Homelessness Prevention

By Kasun Amarasinghe

Local governments implement Rental Assistance Programs to provide temporary financial assistance to renters struggling to pay their rent to keep them stably housed and prevent homelessness. However, as the number of vulnerable households increases and funding for rental assistance programs remains limited, program administrators must prioritize households to help. Although identifying and prioritizing the most vulnerable individuals is a top concern, the typical practice is to prioritize individuals using simple heuristics such as “first-come-first-served”. This talk will describe a collaborative effort between our team at Carnegie Mellon University and the Allegheny County Department of Human Services, which examined the utility of using Machine Learning (ML) models to inform the prioritization of rental assistance resources to minimize eviction-caused homelessness.

16:55 - 17:25.

Building LLM-Powered Apps with Apache Airflow

By Ephraim Anierobi

In this session, we’ll unpack how Apache Airflow orchestrates the end-to-end workflows behind Ask Astro—an LLM-powered assistant designed to answer questions about Airflow itself. We’ll walk through how the system manages data ingestion, embedding generation, vector index updates, and prompt execution. Whether working with open-source models or commercial APIs, you’ll gain practical insights into building and scaling LLM applications with Airflow.

17:30 - 19:00.

Workshop: Power your data orchestration with Apache Airflow 3

By Kenten Danas

Apache Airflow is the world’s most popular open source data orchestration tool, used by companies all over the world and in every industry to manage their most critical data workflows. Airflow 3.0, released in April 2025, is the most significant release in Airflow’s history, and brings a better user experience, stronger security, and the ability to run tasks anywhere, at any time. In this workshop, you’ll get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI.

Thursday, June 12, 2025

14:00

14:35

15:10

15:45

16:20

Wrap up

17:50

AI in BI: From Lane Assist to Full Self-Driving Analytics

By Maxime Beauchemin

06/12/2025 2:00 PM 06/12/2025 2:30 PM UTC AD25: AI in BI: From Lane Assist to Full Self-Driving Analytics

The promise of AI in business intelligence is enticing—ask a question in natural language and receive an accurate chart or metric, no SQL required. While this works in controlled environments, real-world analytics demand precision where 90% accuracy isn’t sufficient. Drawing parallels to self-driving cars, we’re currently at a “lane assist” stage in BI, where AI can suggest and automate parts of the workflow but still requires human oversight.

In this talk, I’ll delve into the lessons learned from building AI features like text-to-SQL on top of Apache Superset at Preset. We’ll explore where AI excels in BI, where it falls short, and what it takes to build trust in AI-assisted analytics. Key topics include the importance of context-rich interfaces, seamless human-AI handoffs, and robust feedback loops. We’ll also discuss the challenges of real-world data environments and the necessity of transparency in AI outputs.

The promise of AI in business intelligence is enticing—ask a question in natural language and receive an accurate chart or metric, no SQL required. While this works in controlled environments, real-world analytics demand precision where 90% accuracy isn’t sufficient. Drawing parallels to self-driving cars, we’re currently at a “lane assist” stage in BI, where AI can suggest and automate parts of the workflow but still requires human oversight.

In this talk, I’ll delve into the lessons learned from building AI features like text-to-SQL on top of Apache Superset at Preset. We’ll explore where AI excels in BI, where it falls short, and what it takes to build trust in AI-assisted analytics. Key topics include the importance of context-rich interfaces, seamless human-AI handoffs, and robust feedback loops. We’ll also discuss the challenges of real-world data environments and the necessity of transparency in AI outputs.

Airflow 3 is all you need (for your AI/ML flows)

By Jarek Potiuk

06/12/2025 2:35 PM 06/12/2025 3:05 PM UTC AD25: Airflow 3 is all you need (for your AI/ML flows)

Do you consider productionizing your machine learning workflows and integrating them with all the data processing pipelines you already have? Does your team struggle with multiple people developing their own ways of running team and your monitoring and SRE guys scratching their head on how to efficiently monitor and run all those loosely connected pieces of logic? Airflow 3 to the rescue. With the shiny new Airflow 3 released earlier this year, being result of years of learning and listening to the users and invested time and effort of literally 100s of people to make it happen, you now have the most popular workflow engine that processes data available for you to make all your production pipelines manageable and rock solid. If all you want to hear is just humming of the processing of your data rather than panicky screams of your business that need things working - this talk is for you.

Learn - from an Airflow Maintainer how we adapted and reinvented Airflow to respond to the needs of modern Data + AI + ML processing.

Also there will be a glimpse of what’s next. By reinventing Airflow 3, we laid the foundations for more changes coming in the future, some of them aiming for even more seamless integration with all your future needs, that you might not even realise yet.

Do you consider productionizing your machine learning workflows and integrating them with all the data processing pipelines you already have? Does your team struggle with multiple people developing their own ways of running team and your monitoring and SRE guys scratching their head on how to efficiently monitor and run all those loosely connected pieces of logic? Airflow 3 to the rescue. With the shiny new Airflow 3 released earlier this year, being result of years of learning and listening to the users and invested time and effort of literally 100s of people to make it happen, you now have the most popular workflow engine that processes data available for you to make all your production pipelines manageable and rock solid. If all you want to hear is just humming of the processing of your data rather than panicky screams of your business that need things working - this talk is for you.

Event-Driven AI: Building Intelligent Systems with Atlas Stream Processing

By Alejandro Mancilla

06/12/2025 3:10 PM 06/12/2025 3:40 PM UTC AD25: Event-Driven AI: Building Intelligent Systems with Atlas Stream Processing

Discover how to build responsive AI systems that react to real-time data using event-driven architectures. This beginner-friendly session introduces Atlas Stream Processing as your foundation for creating dynamic AI applications that process continuous data streams and adapt to changing conditions. You’ll learn practical approaches to:

Understanding event-driven design principles for AI applications Processing data streams for real-time ML predictions Building resilient pipelines that handle errors gracefully Implementing simple event patterns for common AI use cases Monitoring your AI systems effectively

Through straightforward examples and demonstrations, we’ll cover essential patterns that work for recommendation engines, anomaly detection systems, and personalization features. No advanced streaming knowledge required!

Join us to see how combining event-driven thinking with Atlas Stream Processing can transform your AI implementations from static batch processes to dynamic, responsive systems that deliver immediate value.

Discover how to build responsive AI systems that react to real-time data using event-driven architectures. This beginner-friendly session introduces Atlas Stream Processing as your foundation for creating dynamic AI applications that process continuous data streams and adapt to changing conditions. You’ll learn practical approaches to:

Understanding event-driven design principles for AI applications Processing data streams for real-time ML predictions Building resilient pipelines that handle errors gracefully Implementing simple event patterns for common AI use cases Monitoring your AI systems effectively

Causal Inference for Homelessness Research

By Brian Blackwell

06/12/2025 3:45 PM 06/12/2025 4:15 PM UTC AD25: Causal Inference for Homelessness Research

Homelessness is the number one issue in public policy in Los Angeles, with over 75,000 people experiencing homelessness on any given night in Los Angeles County. Since its inception in 2017, the California Policy Lab (CPL) has been one of the leading academic research organizations focused on homelessness in Los Angeles, focused on the intersection of predictive modeling and causal inference. In this presentation, we will give a brief introduction to causal inference – a set of statistical methods for inferring relations of cause and effect in observational and experimental data – and describe some of CPL’s causal inference projects related to homelessness, including the impacts of time limited rental subsidies and homelessness prevention.

Homelessness is the number one issue in public policy in Los Angeles, with over 75,000 people experiencing homelessness on any given night in Los Angeles County. Since its inception in 2017, the California Policy Lab (CPL) has been one of the leading academic research organizations focused on homelessness in Los Angeles, focused on the intersection of predictive modeling and causal inference. In this presentation, we will give a brief introduction to causal inference – a set of statistical methods for inferring relations of cause and effect in observational and experimental data – and describe some of CPL’s causal inference projects related to homelessness, including the impacts of time limited rental subsidies and homelessness prevention.

Workshop: Your pipeline writes prompts, how to use Gemini for inference with Apache Beam

By Israel Herraiz

06/12/2025 4:20 PM 06/12/2025 5:50 PM UTC AD25: Workshop: Your pipeline writes prompts, how to use Gemini for inference with Apache Beam

In this workshop, you will use Apache Beam, Google Colab and Google AI Studio to use Gemini from our Beam pipeline, to enrich your data processing with the output provided by Gemini. This tutorial will be an example of how to leverage Beam ML and RunInference to easily integrate any AI or ML model into your pipelines.

To join this workshop you will need a GMail account and access to Google Colab and Google AI Studio (available for free with any GMail account)

In this workshop, you will use Apache Beam, Google Colab and Google AI Studio to use Gemini from our Beam pipeline, to enrich your data processing with the output provided by Gemini. This tutorial will be an example of how to leverage Beam ML and RunInference to easily integrate any AI or ML model into your pipelines.

To join this workshop you will need a GMail account and access to Google Colab and Google AI Studio (available for free with any GMail account)

17:50 - 18:00

Wrap up

14:00 - 14:30.

AI in BI: From Lane Assist to Full Self-Driving Analytics

By Maxime Beauchemin

The promise of AI in business intelligence is enticing—ask a question in natural language and receive an accurate chart or metric, no SQL required. While this works in controlled environments, real-world analytics demand precision where 90% accuracy isn’t sufficient. Drawing parallels to self-driving cars, we’re currently at a “lane assist” stage in BI, where AI can suggest and automate parts of the workflow but still requires human oversight.

In this talk, I’ll delve into the lessons learned from building AI features like text-to-SQL on top of Apache Superset at Preset. We’ll explore where AI excels in BI, where it falls short, and what it takes to build trust in AI-assisted analytics. Key topics include the importance of context-rich interfaces, seamless human-AI handoffs, and robust feedback loops. We’ll also discuss the challenges of real-world data environments and the necessity of transparency in AI outputs.

14:35 - 15:05.

Airflow 3 is all you need (for your AI/ML flows)

By Jarek Potiuk

Do you consider productionizing your machine learning workflows and integrating them with all the data processing pipelines you already have? Does your team struggle with multiple people developing their own ways of running team and your monitoring and SRE guys scratching their head on how to efficiently monitor and run all those loosely connected pieces of logic? Airflow 3 to the rescue. With the shiny new Airflow 3 released earlier this year, being result of years of learning and listening to the users and invested time and effort of literally 100s of people to make it happen, you now have the most popular workflow engine that processes data available for you to make all your production pipelines manageable and rock solid. If all you want to hear is just humming of the processing of your data rather than panicky screams of your business that need things working - this talk is for you.

15:10 - 15:40.

Event-Driven AI: Building Intelligent Systems with Atlas Stream Processing

By Alejandro Mancilla

Discover how to build responsive AI systems that react to real-time data using event-driven architectures. This beginner-friendly session introduces Atlas Stream Processing as your foundation for creating dynamic AI applications that process continuous data streams and adapt to changing conditions. You’ll learn practical approaches to:

Understanding event-driven design principles for AI applications Processing data streams for real-time ML predictions Building resilient pipelines that handle errors gracefully Implementing simple event patterns for common AI use cases Monitoring your AI systems effectively

15:45 - 16:15.

Causal Inference for Homelessness Research

By Brian Blackwell

Homelessness is the number one issue in public policy in Los Angeles, with over 75,000 people experiencing homelessness on any given night in Los Angeles County. Since its inception in 2017, the California Policy Lab (CPL) has been one of the leading academic research organizations focused on homelessness in Los Angeles, focused on the intersection of predictive modeling and causal inference. In this presentation, we will give a brief introduction to causal inference – a set of statistical methods for inferring relations of cause and effect in observational and experimental data – and describe some of CPL’s causal inference projects related to homelessness, including the impacts of time limited rental subsidies and homelessness prevention.

16:20 - 17:50.

Workshop: Your pipeline writes prompts, how to use Gemini for inference with Apache Beam

By Israel Herraiz

In this workshop, you will use Apache Beam, Google Colab and Google AI Studio to use Gemini from our Beam pipeline, to enrich your data processing with the output provided by Gemini. This tutorial will be an example of how to leverage Beam ML and RunInference to easily integrate any AI or ML model into your pipelines.

To join this workshop you will need a GMail account and access to Google Colab and Google AI Studio (available for free with any GMail account)