Table of Contents

Welcome to my summary of Week 4 at AI_devs. Following our exploration of data organization in Week 3, this week focused on building tools and interfaces for AI agents. We learned how to create modular, reusable components that enable AI agents to interact with external services and perform complex tasks autonomously.

AI_devs summaries:

Introduction

This week, we explored how to design and implement tools and interfaces that empower AI agents to operate autonomously. The lessons focused on five key areas:

  • Building modular, reusable tools
  • Document and web content processing
  • External API integrations
  • Managing task queues and async operations
  • Designing scalable AI agent infrastructure

Here’s what I learned throughout the week.

Day 1: Building Tools for AI Agents

In our previous lessons, we learned how to create basic integrations with LLM models. This week takes us to the next level - building a toolkit that lets LLMs work independently to complete assigned tasks.

Our code no longer does most of the work - instead, we’re creating tools that let LLMs make their own decisions. Each tool needs:

  • A clear, unique name that helps the model choose the right tool for the job
  • A brief description of what it can and can’t do
  • Simple instructions in the form of prompts
  • Clear input/output structures so it can work with other tools

These tools work much like Linux apps - each has a specific purpose, but they can be combined to create more complex workflows. For example, we can build tools for:

  • Managing tasks and projects
  • Handling calendars and emails
  • Translating documents
  • Creating tests or audio content
  • Searching the internet
  • Sending notifications through Slack or SMS

These tools can work independently or together. An AI agent can handle complex commands like “Every morning, check these websites, summarize them, and email me the summary” by breaking them down into individual tool operations. The AI agent itself determines what steps are needed to complete the task - this logic isn’t handled by our code. We simply provide the set of tools.

In this lesson we focused on building a todo list manager as our first tool. It can:

  • Get project lists
  • Fetch task lists
  • Add, change, and remove tasks
  • Watch for task updates

Each tool becomes a mini-app that understands natural language. With automatic prompt testing, we can easily make changes and improvements. These tools form a network, handling tasks for us and other AI agents.

Practical Task: Image Analysis and Repair Assistant

Our daily challenge involved building an image processing system. We received a set of photos, many damaged or imperfect, along with an API offering repair tools: REPAIR, DARKEN, and BRIGHTEN. We built a system to:

  1. Download and analyze each photo
  2. Decide if it needed fixing
  3. Apply repairs through the API
  4. Generate descriptions for any people in the images

The solution combined vision models for analysis, decision-making for repairs, and natural language processing for descriptions, creating a workflow where each tool handled its specialized part of the process.

Day 2: Building an Advanced Document Processing System

Our earlier lessons covered various aspects of document processing. Today we combined all these concepts to create a unified system that works with multiple data sources.

We built a simple yet powerful interface that lets AI agents perform common document operations:

  • Loading documents from various sources
  • Creating summaries
  • Answering questions about content
  • Translating between languages
  • Extracting specific information

The system handles tasks like:

  • “Go to https://… and list all mentioned tools”
  • “Download this DOCX file and create a summary”
  • “Translate this document from Polish to English”
  • “Answer questions a, b, c using files x, y, z”

The lesson reinforced our previous knowledge about document formatting, database storage, and data retrieval while showing how to apply these concepts in practice.

Practical Task: Data Classification System

Today’s challenge focused on building a classification system. We received three data sets:

  • correct - examples of properly formatted data
  • incorrect - examples of improperly formatted data
  • verify - data requiring classification

Using Few-Shot Prompting, I created a system to classify entries in the ‘verify’ set as either correct or incorrect. The approach used known examples to teach the model the difference between properly and improperly formatted data, enabling accurate classification of new cases.

Are you interested in staying up-to-date with the latest developments in #Azure, #CloudComputing, #PlatformEngineering, #DevOps, #AI?

Follow me on LinkedIn for regular updates and insights on these topics. I bring a unique perspective and in-depth knowledge to the table. Don't miss out on the valuable content I share – give me a follow today!

Day 3: Advanced Web Content Processing

Building on our document processing work from Day 2, we explored more sophisticated ways to handle web content. Instead of just downloading web pages as documents, we built systems that can actively navigate and interact with web content.

Our web processing logic works in two ways:

  • Full search mode: generating search queries and deciding which pages to download
  • Direct mode: fetching content from specific URLs

Practical Task: Web Navigation Agent

Today’s challenge involved building an AI agent that could search for information on a specially prepared website. The agent needed to:

  • Download page content
  • Check if the page contained the answer
  • Decide which page to visit next if needed

For implementation, I used:

  • github.com/go-rod/rod for web page interaction
  • github.com/JohannesKaufmann/html-to-markdown/v2/converter to convert HTML to Markdown (smaller and more LLM-friendly)

The agent used a prompt that returned two possible actions:

  • ANSWER: when the required information was found
  • NAVIGATE_PAGE: suggesting the next page to visit
AI Agent log
AI Agent log

The agent successfully navigated through the website, analyzing content and making decisions at each step until finding the required information.

Day 4: Integrating with External Services

After exploring web content processing, this lesson focused on external API integration. The lesson covered several example tools:

  • Google Maps for route directions and location information
  • Spotify for music search and playback control
  • Resend for email communication
  • Voice message system using macOS’s ‘say’ command

The discussion centered on handling irreversible actions. When a tool can send emails or post messages, mistakes can’t be undone. The lesson explored programming safeguards to either catch errors or prevent them entirely.

Key points from the lesson:

  • Always limit model permissions to the absolute minimum
  • Include human verification for critical operations
  • Consider using deterministic code instead of LLMs for tasks requiring 100% accuracy
  • Design clear interfaces between models and external APIs

Practical Task: Grid Navigation System

Today’s challenge involved building a natural language navigation API. The system received travel descriptions like “move one square right, then all the way down” and needed to:

  • Parse natural language directions
  • Track position on a special grid
  • Return information about the final location

I built the system using prompts to interpret navigation commands while maintaining position tracking.

Day 5: Building Scalable AI Agent Infrastructure

The final day focused on organizing data and managing complex AI agent operations. The lesson covered a database schema design that includes:

  • User conversation history
  • Message management with document links
  • Task tracking with action lists
  • Tool and document action relationships

The discussion emphasized two key aspects of AI agent systems:

Model Responsibility: The lesson explored how to properly divide work between LLMs and code. LLMs should only handle tasks that can’t be done programmatically, while everything else should be managed by code.

Request Management: Every API has its limits, which can interrupt task execution. For language models, these include:

  • Query count limits
  • Token limits per minute/day
  • Input/output token limits
  • Budget constraints
  • API availability issues

Practical Task: Large Document Analysis System

Today’s challenge involved analyzing a large PDF document to answer specific questions. The main challenges were:

  • The document was too large to process in one LLM request
  • It contained both text and images, requiring special processing

I built a solution using:

  • marker (github.com/VikParuchuri/marker) to convert PDF to Markdown
  • GPT-4o to generate descriptions for extracted images
  • A chunking system to process the document in manageable pieces

Wrapping Up

This week was all about empowering AI agents to work autonomously by designing modular tools and interfaces. We explored building reusable components for tasks like document and web content processing, external service integration, and managing asynchronous operations.

By focusing on dividing responsibilities between code and AI models, we ensured efficient and reliable task execution. From integrating APIs like Google Maps and Spotify to handling complex operations on large documents, this week demonstrated the power of thoughtful design in AI systems.

Thanks for reading! Stay tuned for next week’s insights and challenges.

Do you like this post? Share it with your friends!

You can also subscribe to my RSS channel for future posts.