Together AI Showcases Open Agentic Systems at GTC 2026: FlashAttention-4, ThunderAgent, Voice AI, and Production-Grade Inference — Research and Product Updates Highlight Open Source LLMs and AI Factory Capabilities
Category: Industry Trends、Tool Dynamics
Excerpt:
**Together AI**, as a diamond sponsor of **NVIDIA GTC 2026**, is showcasing its latest research and product innovations at Booth #1213 in San Jose from March 16 to 19. Today’s updates focus on open-source LLMs, voice AI capabilities, production-grade inference, and AI factory infrastructure. Key announcements include **FlashAttention-4** (up to 1.3× faster than cuDNN on NVIDIA Blackwell), the open-source **ThunderAgent** for agentic workloads (delivering a 3.6× throughput improvement), the **ATLAS-2** adaptive learning speculator, and a full-featured voice AI stack supporting real-time speech-to-text and text-to-speech. Together AI demonstrates how enterprises can transition from AI experiments to production deployment in minutes using its GPU clusters and inference platform.
San Jose, California — Together AI, a diamond sponsor at NVIDIA GTC 2026, is showcasing its latest innovations across research and products at booth #1213 this week in San Jose. The company's updates emphasize open source LLMs, voice AI capabilities, production-grade inference, and AI factory infrastructure. Key research announcements include FlashAttention-4 (delivering up to 1.3× faster performance than cuDNN on NVIDIA Blackwell), the open-source ThunderAgent system for agentic workloads (achieving 3.6× throughput improvements), and ATLAS-2 adaptive learning speculator. Together AI demonstrates how enterprises can move from AI experiments to production deployment in minutes using their GPU clusters and inference platform.
📌 Key Highlights at a Glance
- Event: NVIDIA GTC 2026, San Jose, March 16-19
- Sponsorship: Diamond Sponsor, Booth #1213
- Theme: Open Agentic Systems — From Research to Production
- Research: FlashAttention-4, ThunderAgent, ATLAS-2, Reinforcement Learning API
- FlashAttention-4: Up to 1.3× faster than cuDNN on NVIDIA Blackwell GPUs
- ThunderAgent: Open-source, 3.6× throughput improvement for agentic workloads
- Voice AI: Real-time speech-to-text, text-to-speech, and voice agent capabilities
- Production Inference: Enterprise-grade inference optimization
- AI Factory: GPU clusters from 64 to 10,000+ NVIDIA GPUs
- Partnership: NVIDIA Cloud Partner in NVIDIA Partner Network
🚀 Together AI at GTC 2026: Overview
Together AI arrived at NVIDIA GTC 2026 with a comprehensive showcase of research breakthroughs and platform capabilities. As a diamond sponsor, the company is demonstrating its full-stack AI platform for inference, fine-tuning, and GPU clusters—all powered by cutting-edge research that bridges the gap between academic innovation and production deployment.
GTC 2026 Presence
| Element | Details |
|---|---|
| Sponsorship Level | Diamond Sponsor |
| Booth Location | #1213 |
| Event Dates | March 16-19, 2026 |
| Location | San Jose Convention Center, CA |
| Key Activities | Live demos, side events, technical talks |
"Join us from March 16–19 in San Jose as we showcase the latest research breakthroughs and new platform capabilities across open source LLMs, voice AI, production-grade inference, and AI factory infrastructure."
— Together AI Official Announcement, GTC 2026
Core Themes at GTC 2026
Open Source LLMs
Support for leading open-source models with optimized inference
Voice AI
Real-time speech-to-text and text-to-speech for voice agents
Production Inference
Enterprise-grade inference optimization and deployment
AI Factory
Scalable GPU infrastructure from experiments to production
⚡ FlashAttention-4: Breaking Performance Barriers
FlashAttention-4 represents Together AI's latest breakthrough in attention mechanism optimization, delivering significant performance improvements for large language model inference and training.
FlashAttention-4 Performance
Faster than cuDNN on NVIDIA Blackwell GPUs
For next-generation NVIDIA hardware architecture
Efficient attention computation for longer contexts
Available to the research community
Technical Significance
FlashAttention-4 builds on the lineage of Together AI's attention optimization research, addressing the fundamental computational bottleneck in transformer models. Key improvements include:
- IO-Aware Algorithm: Minimizes memory bandwidth bottlenecks for faster computation
- Blackwell Optimization: Native support for NVIDIA's latest GPU architecture
- Memory Efficiency: Enables processing of longer sequences within memory constraints
- Integration Ready: Compatible with major ML frameworks and model architectures
FlashAttention Evolution
| Version | Key Innovation | Performance Gain |
|---|---|---|
| FlashAttention-1 | IO-aware exact attention | 2-4× over baseline |
| FlashAttention-2 | Parallelism optimization | 2× over v1 |
| FlashAttention-3 | H100 optimization | 1.5-2× over v2 |
| FlashAttention-4 | Blackwell architecture native | 1.3× over cuDNN |
🌩️ ThunderAgent: Open Agentic System
ThunderAgent is Together AI's open-source, program-aware system designed for serving and training agentic workloads—addressing the growing need for high-performance infrastructure to support AI agents.
ThunderAgent Performance
Throughput improvement for agentic workloads
Open-sourced and available to the community
Program-aware optimization for agent code
Research foundation for agentic training
Key Capabilities
🎯 Program-Aware Execution
Understands and optimizes agent code structure for efficient execution
📊 High Throughput Serving
Designed for production-scale agent deployment with minimal latency
🔄 Training Integration
Supports both inference and training workflows for agentic systems
🤝 Open Source
Fully open-source with paper available for research community
Significance for the Agent Ecosystem
ThunderAgent addresses a critical gap in the AI agent infrastructure stack. As PR Newswire reported, ThunderAgent is "the research foundation for how high-throughput agentic training will be built." This positions Together AI as a key infrastructure provider for the rapidly growing OpenClaw and agentic AI ecosystem.
🧭 ATLAS-2: Adaptive Learning Speculator
ATLAS-2 (AdapTive-LeArning Speculator System) is Together AI's advanced speculative decoding system that accelerates inference by predicting and pre-computing likely token sequences.
ATLAS-2 Key Features
Speculative Decoding
Predicts future tokens to parallelize generation and reduce latency
Adaptive Learning
Continuously improves predictions based on execution patterns
Inference Acceleration
Significant speedup for autoregressive model inference
Model Agnostic
Works across different model architectures and sizes
Evolution from ATLAS
ATLAS-2 builds on the original ATLAS system, which Together AI described as a solution for making "large language models faster, cheaper, and more efficient." The new version introduces enhanced adaptive learning capabilities and improved speculation accuracy for production deployment.
🎙️ Voice AI: Real-Time Speech Capabilities
Together AI's Voice AI stack provides a comprehensive solution for building real-time voice agents, combining speech-to-text (STT), large language models, and text-to-speech (TTS) on co-located infrastructure.
Voice AI Platform Components
🎤 Speech-to-Text (STT)
High-performance Whisper APIs for fast, accurate transcription and translation
- Whisper Large v3 support
- Streaming transcription
- Multi-language support
🗣️ Text-to-Speech (TTS)
Serverless open-source TTS with natural-sounding voices
- Orpheus TTS
- Kokoro TTS
- Minimax Speech 2.6 Turbo
- Rime Arcana
🤖 Voice Agents
End-to-end voice agent infrastructure
- Real-time response
- Natural conversation flow
- Low latency architecture
Performance Claims
Together AI has positioned its voice AI stack as "the fastest inference for real-time voice AI agents," addressing the fundamental challenge of speed in voice applications. Key advantages include:
- Streaming Whisper STT: Real-time transcription with minimal latency
- Co-located Infrastructure: STT, LLM, and TTS on same infrastructure for minimal network latency
- Multiple Voice Options: Support for various TTS models to match use case requirements
- Production Ready: Enterprise-grade reliability and scaling
"Through one platform, teams can route audio and text through models like Whisper Large v3, Minimax Speech 2.6 Turbo, Rime Arcana, Kokoro, and more."
— Together AI, "Build Real-Time Voice Agents"
⚡ Production-Grade Inference
Together AI demonstrates its production-grade inference capabilities at GTC 2026, showing how enterprises can deploy AI models at scale with optimized performance and cost efficiency.
Inference Platform Features
High Throughput
Optimized for serving millions of requests with minimal latency
Cost Efficient
Advanced batching and optimization reduce inference costs
Model Variety
Support for 100+ open-source and commercial models
Easy Integration
API-first design for seamless integration into applications
Inference Optimization Stack
| Technology | Purpose | Benefit |
|---|---|---|
| FlashAttention-4 | Attention optimization | 1.3× faster inference |
| ATLAS-2 | Speculative decoding | Reduced generation latency |
| Custom Kernels | GPU optimization | Maximize hardware utilization |
| Dynamic Batching | Request optimization | Higher throughput |
🏭 AI Factory: GPU Clusters at Scale
Together AI's AI Factory capabilities are demonstrated at GTC 2026, showcasing how enterprises can deploy scalable AI infrastructure from experimentation to production.
Together GPU Clusters
Minimum cluster size (entry point)
Maximum NVIDIA GPUs supported
Time from experiment to production
Official NVIDIA Cloud Partner
GPU Cluster Capabilities
🎯 Flexible Scaling
Scale from 64 GPUs to 10,000+ based on workload requirements
⚡ Rapid Deployment
Experiments to production in minutes, not weeks
🔒 Enterprise Security
Enterprise-grade security and compliance features
🤝 NVIDIA Partnership
Official NVIDIA Cloud Partner with validated architecture
NVIDIA Cloud Partnership
Together AI is an NVIDIA Cloud Partner in the NVIDIA Partner Network, providing validated infrastructure that meets NVIDIA's standards for performance and reliability. This partnership ensures customers receive optimized solutions backed by both companies' expertise.
📅 GTC Sessions and Events
Together AI is presenting multiple sessions and hosting events throughout GTC 2026:
Key Sessions
| Date | Time | Session |
|---|---|---|
| March 17 | 2:00 PM PST | Engineering Real-World LLM Inference: Bridging Open-Source and Production Systems |
| March 17 | 11:00 AM - 1:30 PM | Executive Lunch: Next-Gen AI Factory Infrastructure (with 5C & NVIDIA) |
| March 17 | 7:30 - 10:30 PM | Tokens After Hours with Together AI & Metronome |
| March 16 | 5:00 PM | Together GPU Clusters: Experiments to Production in Minutes! |
Booth #1213 Activities
- Live Demos: Real-time demonstrations of inference, voice AI, and agentic systems
- Technical Consultations: One-on-one discussions with engineers and researchers
- Product Showcases: Latest platform capabilities and features
- Networking: Connect with Together AI team and community
🏁 Competitive Context
Together AI's GTC 2026 presence highlights its position in the competitive AI infrastructure landscape:
Together AI's Market Position
| Provider | Focus | Key Differentiator |
|---|---|---|
| Together AI | AI Native Cloud | Research-driven optimization, open source focus |
| OpenAI | Foundation Models | Proprietary models, GPT ecosystem |
| Anthropic | Foundation Models | Safety-focused, Claude models |
| Fireworks AI | Inference | Fast inference, model variety |
| Replicate | Model Deployment | Easy deployment, pay-per-use |
Together AI Differentiation
🔬 Research-Driven
Core innovations like FlashAttention and ThunderAgent originate from in-house research
🔓 Open Source Commitment
Strong support for open-source models and contributions back to community
⚡ Performance Focus
Obsessed with making AI faster, cheaper, and more efficient
🏭 Full Stack
Complete platform from GPU clusters to inference to fine-tuning
❓ Frequently Asked Questions
What is Together AI showcasing at GTC 2026?
Together AI is showcasing its latest innovations across research and products, including FlashAttention-4 (1.3× faster than cuDNN), ThunderAgent open-source agentic system (3.6× throughput improvement), ATLAS-2 adaptive learning speculator, Voice AI stack with real-time STT/TTS, and production-grade inference capabilities. The company is demonstrating at booth #1213 as a diamond sponsor.
What is FlashAttention-4?
FlashAttention-4 is Together AI's latest attention optimization technology, delivering up to 1.3× faster performance than NVIDIA's cuDNN on Blackwell GPUs. It represents the fourth generation of the company's FlashAttention research, focused on IO-aware algorithms and memory-efficient computation for transformer models.
What is ThunderAgent?
ThunderAgent is Together AI's open-source, program-aware system for serving and training agentic workloads. It delivers up to 3.6× throughput improvements for AI agent applications and is designed as the research foundation for high-throughput agentic training infrastructure.
What Voice AI capabilities does Together AI offer?
Together AI offers a comprehensive Voice AI stack including speech-to-text (Whisper Large v3 with streaming), text-to-speech (Orpheus, Kokoro, Minimax Speech 2.6 Turbo, Rime Arcana), and end-to-end voice agent infrastructure. The platform is designed for real-time, low-latency voice applications.
What is Together AI's AI Factory capability?
Together AI's AI Factory provides scalable GPU cluster infrastructure ranging from 64 to over 10,000 NVIDIA GPUs. As an NVIDIA Cloud Partner, Together AI enables enterprises to move from AI experiments to production deployment in minutes, with validated architecture and enterprise-grade security.
How is Together AI related to NVIDIA?
Together AI is an NVIDIA Cloud Partner in the NVIDIA Partner Network. The company is a diamond sponsor at GTC 2026 and provides NVIDIA-validated GPU infrastructure. Their research innovations like FlashAttention-4 are optimized for NVIDIA's latest GPU architectures including Blackwell.
🎤 Industry Perspectives
"ThunderAgent is the research foundation for how high-throughput agentic training will be built. This open-source, program-aware system delivers up to 3.6× throughput improvements for agentic workloads."
— PR Newswire, March 5, 2026"Together AI launches the fastest voice AI stack: streaming Whisper STT, serverless open-source TTS (Orpheus & Kokoro), solving the fundamental problem holding back voice applications."
— Together AI Blog"Together AI is obsessed with performance. Making large language models faster, cheaper, and more efficient is their core mission, with ATLAS representing the Adaptive Learning Speculator System for inference acceleration."
— Together AI Research👀 What to Watch For
- FlashAttention-4 Adoption: How quickly the research integrates into major ML frameworks
- ThunderAgent Ecosystem: Community adoption and contributions to the open-source project
- Voice AI Growth: Enterprise adoption of real-time voice agents
- GPU Cluster Expansion: Scaling of Together AI's infrastructure offerings
- Research Papers: Upcoming publications on FlashAttention-4 and ThunderAgent
- NVIDIA Partnership: Deeper integration with NVIDIA's AI factory ecosystem
- Competitive Response: How other inference providers respond to Together AI's innovations
The Bottom Line
Together AI's presence at GTC 2026 demonstrates the company's evolution from an inference provider to a comprehensive AI infrastructure company. With research breakthroughs like FlashAttention-4 and ThunderAgent, Together AI is pushing the boundaries of what's possible in AI performance optimization.
The emphasis on open agentic systems positions Together AI to capture significant value from the rapidly growing AI agent ecosystem. By open-sourcing ThunderAgent and contributing research back to the community, the company is building goodwill while establishing its technologies as foundational infrastructure.
The Voice AI stack addresses a critical market need—real-time voice agents require low-latency STT, LLM inference, and TTS working together seamlessly. Together AI's co-located infrastructure approach solves the latency challenge that has held back voice agent adoption.
As an NVIDIA Cloud Partner with GPU cluster capabilities scaling to 10,000+ GPUs, Together AI is positioned to serve enterprises at any stage of their AI journey—from experiments to production factories. The GTC 2026 showcase underscores the company's ambition: to be the AI native cloud for the next generation of AI applications.
Stay tuned to our Industry Trends section for continued coverage of GTC 2026 and AI infrastructure innovations.










