Research Series: AI Model Comparison for Jewelry Photography
- ← Part 1: Baseline Capability Test
- Part 2: Head-to-Head Model Comparison (this article)
- Part 3: Studio Shots Comparison →
Abstract
As AI image generation matures, jewelry brands face a practical question: which models produce commercially viable product photography? Despite rapid advances in diffusion architectures, no systematic evaluation exists for this specialized domain. We present a pairwise comparison of 6 frontier models—Nano Banana Pro, Nano Banana, Gemini 2.5 Flash (Google), and FLUX.2 Pro, FLUX.2 Max, FLUX.2 Flex (Black Forest Labs)—across 270 head-to-head evaluations. We measure three dimensions: pairwise preference, ring accuracy relative to reference images, and photorealism. Our findings reveal that model performance is highly workflow-dependent: Google models achieve 89% win rates for generation tasks, while Black Forest Labs models achieve 70% win rates for replacement tasks. We introduce a “production-ready” metric combining accuracy and realism, and calculate cost-efficiency per usable image. These results have direct implications for model selection in commercial jewelry photography pipelines.
1. Introduction
1.1 The Problem
Jewelry e-commerce represents a $300+ billion global market where product photography directly impacts conversion rates. Traditional photography requires physical inventory, studio setups, and skilled photographers—creating bottlenecks for brands with large catalogs or frequent new releases.
AI image generation offers a potential solution: generate product photography from reference images alone. However, jewelry presents unique challenges that general-purpose benchmarks don’t capture:
- Fine detail reproduction: Prongs, pavé settings, and engravings must be accurately rendered
- Material properties: Metal reflectivity and gemstone refraction require precise handling
- Hand realism: On-hand shots demand anatomically correct, photorealistic skin
- Brand accuracy: The generated ring must match the reference exactly—not a similar ring
1.2 The Gap
Existing model evaluations focus on general image quality metrics (FID, CLIP scores) or broad categories (faces, landscapes, objects). No published work systematically evaluates frontier models for jewelry-specific tasks with commercially relevant success criteria.
1.3 Our Contribution
We present the first systematic evaluation of frontier image generation models for commercial jewelry photography. Our study:
- Compares 6 leading models across two distinct workflows (generation and replacement)
- Introduces domain-specific metrics: ring accuracy and production-readiness
- Reveals workflow-dependent performance inversions not visible in aggregate rankings
- Provides cost-efficiency analysis for production deployment decisions
2. Related Work
2.1 Diffusion Model Benchmarks
Standard benchmarks like DrawBench, PartiPrompts, and COCO evaluate general image generation quality. These capture broad capabilities but miss domain-specific requirements like product accuracy and fine detail reproduction.
2.2 Commercial Image Generation
Recent work has explored AI for product photography in fashion and furniture, but jewelry—with its reflective surfaces, intricate details, and strict accuracy requirements—remains understudied.
2.3 Prior Work in This Series
In Part 1 of this research, we evaluated 11 models on basic jewelry generation capability. Five models were eliminated due to fundamental failures (wrong object generation, severe artifacts). The remaining 6 models advanced to this systematic comparison.
3. Methodology
3.1 Models Evaluated
We evaluated 6 frontier models that demonstrated basic jewelry generation capability in Part 1:
| Model | Provider | Architecture | Cost/Image |
|---|---|---|---|
| Nano Banana Pro | Imagen-based | $0.150 | |
| Nano Banana | Imagen-based | $0.039 | |
| Gemini 2.5 Flash | Multimodal | $0.039 | |
| FLUX.2 Pro | Black Forest Labs | FLUX | $0.090 |
| FLUX.2 Max | Black Forest Labs | FLUX | $0.190 |
| FLUX.2 Flex | Black Forest Labs | FLUX | $0.315 |
All models were accessed via Replicate API with default parameters. Costs reflect actual invoice data from December 2025.
3.2 Task Design
We evaluated two workflows representing common production use cases:
Workflow A: Generate
- Input: Ring reference image only
- Task: Generate a photorealistic on-hand shot from scratch
- Challenge: Model must create realistic hand anatomy while accurately reproducing the ring
Workflow B: Replace
- Input: Ring reference image + hand photograph
- Task: Replace existing ring in hand photo with reference ring
- Challenge: Model must preserve hand realism while accurately swapping the ring
3.3 Test Set
We selected 9 rings across 3 complexity levels:
| Complexity | Description | Count |
|---|---|---|
| Simple | Solitaire, single stone, minimal setting | 3 |
| Medium | Multiple stones, moderate detail | 3 |
| Complex | Pavé, clusters, intricate settings | 3 |
Each ring was processed by all 6 models in both workflows, yielding 108 total images (6 models × 9 rings × 2 workflows).
3.4 Evaluation Protocol
Pairwise Comparison
We conducted round-robin pairwise comparisons: every model versus every other model for each ring-workflow combination.
- 15 unique model pairs × 9 rings × 2 workflows = 270 comparisons
- Images displayed side-by-side with randomized left/right position
- Evaluator selected winner or tie for each pair
- No model labels shown during evaluation
Ring Accuracy Rating
Each image was rated independently for ring accuracy:
| Rating | Definition |
|---|---|
| Exact | Perfect match to reference |
| Close | Minor variations, clearly the same ring |
| Similar | Same style but noticeable differences |
| Wrong | Different ring entirely |
Photorealism Rating
Each image was rated for AI-generated appearance:
| Rating | Definition | Commercial Viability |
|---|---|---|
| Photorealistic | Indistinguishable from real photo | Viable |
| Minor tells | Small artifacts, trained eye might detect | Viable |
| Noticeable | Clearly AI-generated | Marginal |
| Obviously AI | Severe artifacts, uncanny appearance | Not viable |
3.5 Production-Ready Metric
We define an image as “production-ready” if it meets both criteria:
Production-Ready = (Ring Accuracy: Exact OR Close) AND (Photorealism: Photorealistic OR Minor Tells)
This captures the minimum bar for commercial use: the ring must be recognizably correct, and the image must not appear obviously artificial.
4. Results
4.1 Aggregate Rankings
Across all 270 comparisons, the overall rankings were:
| Rank | Model | Wins | Losses | Ties | Win Rate |
|---|---|---|---|---|---|
| 1 | Nano Banana Pro | 58 | 28 | 4 | 66.7% |
| 2 | FLUX.2 Max | 41 | 36 | 13 | 52.8% |
| 3 | Nano Banana | 42 | 40 | 8 | 51.1% |
| 4 | FLUX.2 Pro | 40 | 43 | 7 | 48.3% |
| 5 | FLUX.2 Flex | 34 | 48 | 8 | 42.2% |
| 6 | Gemini 2.5 Flash | 31 | 51 | 8 | 38.9% |
However, these aggregate numbers obscure a critical finding.
4.2 Workflow-Dependent Performance
When separated by workflow, the rankings invert dramatically.
Generate Workflow (ring image only):
| Rank | Model | Win Rate |
|---|---|---|
| 1 | Nano Banana Pro | 88.9% |
| 2 | Nano Banana | 58.9% |
| 3 | Gemini 2.5 Flash | 46.7% |
| 4 | FLUX.2 Flex | 42.2% |
| 5 | FLUX.2 Max | 35.6% |
| 6 | FLUX.2 Pro | 27.8% |
Replace Workflow (ring + hand reference):
| Rank | Model | Win Rate |
|---|---|---|
| 1 | FLUX.2 Max | 70.0% |
| 2 | FLUX.2 Pro | 68.9% |
| 3 | Nano Banana Pro | 44.4% |
| 4 | Nano Banana | 43.3% |
| 5 | FLUX.2 Flex | 42.2% |
| 6 | Gemini 2.5 Flash | 31.1% |
FLUX.2 Pro moves from last place (27.8%) in Generate to second place (68.9%) in Replace—a 41 percentage point swing.
4.3 Visual Comparison
The following figures illustrate the workflow-dependent quality differences.
Figure 1: Reference Ring (Medium Complexity)
Figure 2: Generate Workflow Results
Nano Banana Pro
Photorealistic, exact ring
Nano Banana
Minor tells, close ring
Gemini 2.5 Flash
Minor tells, similar ring
FLUX.2 Pro
Noticeable AI artifacts
FLUX.2 Max
Minor tells, close ring
FLUX.2 Flex
Minor tells, close ring
Generate workflow requires creating realistic hands from scratch. Google models produce more naturalistic skin texture and hand poses. Black Forest Labs models exhibit higher rates of visible AI artifacts.
Figure 3: Replace Workflow Results (Same Ring)
Nano Banana Pro
Photorealistic
Nano Banana
Photorealistic
Gemini 2.5 Flash
Photorealistic
FLUX.2 Pro
Photorealistic
FLUX.2 Max
Photorealistic
FLUX.2 Flex
Photorealistic
Replace workflow preserves the original hand photograph. All models achieve photorealistic results when not required to generate hands from scratch.
4.4 Ring Accuracy
Ring accuracy measures how faithfully the model reproduces the reference ring.
Table 1: Ring Accuracy by Model and Workflow
| Model | Generate (Accurate) | Replace (Accurate) | Delta |
|---|---|---|---|
| Nano Banana Pro | 89% | 78% | -11% |
| FLUX.2 Flex | 78% | 78% | 0% |
| Nano Banana | 78% | 56% | -22% |
| FLUX.2 Max | 56% | 78% | +22% |
| Gemini 2.5 Flash | 56% | 67% | +11% |
| FLUX.2 Pro | 44% | 89% | +45% |
FLUX.2 Pro exhibits the largest workflow-dependent accuracy shift: 44% in Generate versus 89% in Replace. The model struggles to imagine rings correctly but excels at preserving them during image editing.
4.5 Photorealism (AI Look)
We rated each image for visible AI artifacts.
Table 2: Photorealism Distribution — Generate Workflow
| Model | Photorealistic | Minor Tells | Noticeable | Obviously AI |
|---|---|---|---|---|
| Nano Banana Pro | 100% | 0% | 0% | 0% |
| Nano Banana | 0% | 100% | 0% | 0% |
| Gemini 2.5 Flash | 0% | 100% | 0% | 0% |
| FLUX.2 Flex | 0% | 100% | 0% | 0% |
| FLUX.2 Max | 0% | 67% | 22% | 11% |
| FLUX.2 Pro | 0% | 44% | 33% | 22% |
Nano Banana Pro was the only model rated 100% photorealistic in Generate. FLUX.2 Pro and Max produced “obviously AI” images in 22% and 11% of Generate outputs respectively.
Table 3: Photorealism Distribution — Replace Workflow
| Model | Photorealistic |
|---|---|
| All 6 models | 100% |
All models achieved 100% photorealistic ratings in Replace workflow. Starting with a real hand photograph eliminates the AI-generated hand problem entirely.
4.6 Production-Ready Rates
Combining ring accuracy and photorealism yields the production-ready metric.
Table 4: Production-Ready Rate by Workflow
| Model | Generate | Replace |
|---|---|---|
| Nano Banana Pro | 89% | 78% |
| Nano Banana | 78% | 56% |
| FLUX.2 Flex | 78% | 78% |
| Gemini 2.5 Flash | 56% | 67% |
| FLUX.2 Max | 44% | 78% |
| FLUX.2 Pro | 33% | 89% |
Figure 4: Production-Ready Rate Inversion
Generate Replace
Nano Banana Pro ████████████████░░░░ ██████████████░░░░░░
89% 78%
FLUX.2 Pro ██████░░░░░░░░░░░░░░ ████████████████░░░░
33% 89%
FLUX.2 Pro’s production-ready rate increases from 33% to 89% when switching from Generate to Replace—a near-complete inversion.
4.7 Failure Analysis
We categorized failure modes across workflows.
Table 5: Failure Distribution by Workflow
| Failure Type | Generate | Replace |
|---|---|---|
| Obviously AI appearance | 30 | 1 |
| Wrong finger placement | 28 | 4 |
| Ring does not match reference | 79 | 88 |
Generate workflow produces 30× more AI-appearance failures due to the hand generation challenge. Replace workflow produces slightly more ring accuracy failures, possibly due to editing artifacts.
Table 6: Failure Patterns by Model
| Model | Primary Failure Mode | Frequency |
|---|---|---|
| Gemini 2.5 Flash | Ring inaccuracy | 46 instances |
| FLUX.2 Pro | AI appearance (Generate) | 22% of outputs |
| FLUX.2 Max | AI appearance (Generate) | 11% of outputs |
| FLUX.2 Flex | Ring inaccuracy | 20 instances |
| Nano Banana | Ring inaccuracy | 28 instances |
| Nano Banana Pro | Ring inaccuracy | 19 instances (lowest) |
Gemini 2.5 Flash exhibited the poorest ring fidelity across both workflows. Nano Banana Pro had the lowest overall failure rate.
4.8 Ring Complexity Effects
We analyzed performance by ring complexity level.
Table 7: Win Rates by Ring Complexity
| Complexity | Top Model | Win Rate |
|---|---|---|
| Simple | FLUX.2 Max | 61.7% |
| Medium | Nano Banana Pro | 76.7% |
| Complex | Nano Banana Pro | 88.3% |
Black Forest Labs models compete effectively on simple rings. Google’s advantage increases with ring complexity—Nano Banana Pro achieves 88.3% win rate on complex multi-stone settings versus 61.7% for the best Black Forest Labs model.
Figure 5: Simple vs Complex Ring Performance
NBP: Simple ring
Win rate: 61.7%
F2Max: Simple ring
Win rate: 61.7%
NBP: Complex ring
Win rate: 88.3%
Left/Center: Simple rings show competitive performance across providers. Right: Complex rings reveal Google’s advantage in fine detail reproduction.
5. Analysis
5.1 Why Do Rankings Invert?
The workflow-dependent performance inversion reflects fundamentally different task requirements.
Generate workflow requires the model to synthesize realistic human hands from its training distribution. Google’s models—trained on larger, more diverse image datasets—produce more naturalistic hand anatomy and skin texture. Black Forest Labs models exhibit higher rates of artifacts: uncanny skin rendering, anatomically incorrect finger positions, and obvious digital textures.
Replace workflow requires precise image editing: preserving the hand photograph while seamlessly integrating a new ring. Black Forest Labs’ FLUX architecture, designed for image-to-image transformation, excels at this task. The model can attend to the ring region specifically while leaving the hand largely unchanged.
5.2 The Photorealism Gap
The most striking finding is the photorealism difference between workflows. In Generate, only Nano Banana Pro achieved 100% photorealistic ratings. In Replace, all six models achieved 100%.
This suggests that hand generation—not ring rendering—is the primary source of AI artifacts in jewelry photography. When given a real hand photograph as reference, even models that struggle with generation produce commercially viable results.
5.3 Ring Accuracy Trade-offs
FLUX.2 Pro’s 45-point accuracy improvement in Replace (44% → 89%) indicates that the model’s ring rendering capability is intact—the problem in Generate is not ring reproduction but ring-in-context imagination. When the model must “decide” what ring to place on a generated hand, it often produces a plausible but incorrect ring. When editing an existing image, it can focus on accurate reproduction.
5.4 Cost-Efficiency Analysis
We calculated cost per production-ready image by dividing unit cost by production-ready rate.
Table 8: Cost per Production-Ready Image
| Model | Generate | Replace |
|---|---|---|
| Nano Banana | $0.05 | $0.07 |
| Gemini 2.5 Flash | $0.07 | $0.06 |
| Nano Banana Pro | $0.17 | $0.19 |
| FLUX.2 Pro | $0.27 | $0.10 |
| FLUX.2 Flex | $0.40 | $0.40 |
| FLUX.2 Max | $0.43 | $0.24 |
For Generate workflow, Nano Banana offers the best cost-efficiency at $0.05 per usable image. For Replace workflow, FLUX.2 Pro achieves $0.10 per usable image with the highest accuracy (89%).
FLUX.2 Flex is overpriced in both workflows—same production-ready rate as cheaper alternatives at 4-8× the cost.
6. Limitations
6.1 Sample Size
Our evaluation used 9 rings across 3 complexity levels. While sufficient to identify significant trends, a larger test set would provide higher confidence in the findings.
6.2 Single Evaluator
Pairwise comparisons and quality ratings were performed by a single evaluator. Inter-rater reliability studies would strengthen the methodology.
6.3 Prompt Variation
All images were generated using standardized prompts. Different prompting strategies might yield different relative performance.
6.4 Temporal Validity
Model capabilities evolve rapidly. These results reflect December 2025 model versions; future updates may change relative performance.
6.5 Product Category
This study evaluated rings only. Results may not generalize to necklaces, earrings, bracelets, or other jewelry categories with different visual characteristics.
7. Future Work
7.1 Extended Product Categories
Part 3 will evaluate the same models on additional shot types:
- Studio hero shots (white background, product-only)
- Flat lay compositions
- On-model shots for necklaces and earrings
7.2 Alternative Approaches
We will compare the Replace workflow winners (FLUX.2 Pro/Max) against mask-based inpainting approaches (Ideogram) and LoRA fine-tuned models.
7.3 Scale Testing
Production deployment requires consistent performance at scale. We will test batch consistency and failure rate stability across larger generation runs.
8. Conclusion: Two Product Options
We presented the first systematic evaluation of frontier image generation models for commercial jewelry photography. Our central finding—that model performance inverts between generation and replacement workflows—has a direct implication for product design:
A single-workflow platform would underserve half of all use cases. The data supports offering customers two distinct options.
8.1 The Case for Two Options
The performance inversion we observed is not a minor variation—it’s a complete ranking reversal:
| Metric | Generate Winner | Replace Winner |
|---|---|---|
| Win rate | Nano Banana Pro (89%) | FLUX.2 Max (70%) |
| Production-ready | Nano Banana Pro (89%) | FLUX.2 Pro (89%) |
| Photorealism | Nano Banana Pro (100%) | All models (100%) |
| Best value | Nano Banana ($0.05) | FLUX.2 Pro ($0.10) |
A platform built exclusively on Google models would fail at replacement tasks. A platform built exclusively on Black Forest Labs models would produce unacceptable AI artifacts in generation tasks. Neither approach serves all customer needs.
This study provides the empirical foundation for offering both workflows as first-class product options.
8.2 Option A: Generate From Scratch
What it is: AI creates the complete image—hand, ring, background—from a ring reference photo alone.
Pros
- No photography required — Customer needs only product images, not hand models or studio shoots
- Unlimited variety — Each generation produces a unique hand pose and composition
- Lower barrier to entry — Ideal for brands without existing photography assets
- Faster onboarding — Customer can generate images immediately after uploading ring photos
Cons
- Photorealism varies by model — Only Nano Banana Pro achieves 100% photorealistic results; other models show “minor tells” or worse
- Higher failure rate for complex rings — Models struggle to accurately reproduce intricate multi-stone settings
- AI appearance risk — 11-22% of Black Forest Labs outputs rated “obviously AI” in this workflow
Model Recommendations
| Priority | Model | Production-Ready | Cost/Usable |
|---|---|---|---|
| Premium quality | Nano Banana Pro | 89% | $0.17 |
| Best value | Nano Banana | 78% | $0.05 |
| Budget option | Gemini 2.5 Flash | 56% | $0.07 |
| Avoid | FLUX.2 Pro, FLUX.2 Max | 33-44% | $0.27-0.43 |
Ideal Use Cases
- Brands launching new collections without existing lifestyle photography
- High-volume catalog generation where some variation is acceptable
- Simple to medium complexity rings (solitaires, basic multi-stone)
- Cost-sensitive applications where $0.05/usable is the target
8.3 Option B: Replace With Templates
What it is: AI swaps the ring in an existing hand photograph, preserving the original hand, pose, and lighting.
Pros
- 100% photorealistic — All models achieved photorealistic ratings; the real hand photo eliminates AI appearance issues entirely
- Consistent brand aesthetic — Same template produces visually cohesive catalog
- Higher ring accuracy — FLUX.2 Pro achieves 89% accuracy vs 44% in Generate
- Model flexibility — Performance differences between models are smaller; more options available
Cons
- Requires template library — Customer needs hand photography assets or access to a template library
- Less variety — Same template produces similar compositions; variety requires multiple templates
- Higher onboarding friction — Customers must upload or select templates before generating
Model Recommendations
| Priority | Model | Production-Ready | Cost/Usable |
|---|---|---|---|
| Best accuracy | FLUX.2 Pro | 89% | $0.10 |
| Premium alternative | FLUX.2 Max | 78% | $0.24 |
| Budget option | Gemini 2.5 Flash | 67% | $0.06 |
| Avoid | FLUX.2 Flex | 78% | $0.40 (overpriced) |
Ideal Use Cases
- Brands with existing hand photography they want to extend
- High-accuracy requirements where ring fidelity is critical
- Complex rings (pavé, clusters, intricate settings)
- Consistent catalog aesthetics across product lines
8.4 Pricing Implications
The cost structures differ significantly between options:
| Option | Budget Tier | Standard Tier | Premium Tier |
|---|---|---|---|
| Generate | $0.05 (Nano Banana) | $0.07 (Gemini) | $0.17 (Nano Banana Pro) |
| Replace | $0.06 (Gemini) | $0.10 (FLUX.2 Pro) | $0.24 (FLUX.2 Max) |
Generate offers a lower floor ($0.05) but the premium tier is more expensive ($0.17 vs $0.10). Replace offers a tighter cost range with better quality consistency.
8.5 Platform Recommendations
Based on these findings, we recommend:
- Offer both options as distinct product features — Let customers choose based on their assets and needs
- Build a template library for Replace — Unlocks the higher-quality workflow for customers without photography assets
- Implement complexity-based routing — Route simple rings to budget models, complex rings to premium models
- Add ring accuracy validation — Automated quality checks before delivery
- Do not include FLUX.2 Flex — Overpriced in all scenarios
8.6 Open Questions for Future Research
This study tested on-hand ring photography specifically. The Generate vs Replace trade-off may differ for other shot types. Future research should address:
- Studio shots: Is it better to generate a flat lay from scratch or replace products in a template composition?
- Lifestyle scenes: Do the model rankings hold for environmental backgrounds?
- Other jewelry categories: Do necklaces, earrings, and bracelets show the same workflow-dependent patterns?
- Template design: What template characteristics optimize Replace workflow performance?
These questions will guide our next phase of research as we expand the platform’s capabilities.
Appendix: Head-to-Head Matrix
Complete pairwise comparison results:
| F2Pro | F2Max | F2Flex | NaBan | NaBPro | Gemini | |
|---|---|---|---|---|---|---|
| FLUX.2 Pro | - | 44% | 61% | 44% | 36% | 56% |
| FLUX.2 Max | 56% | - | 61% | 36% | 36% | 75% |
| FLUX.2 Flex | 39% | 39% | - | 50% | 28% | 56% |
| Nano Banana | 56% | 64% | 50% | - | 36% | 50% |
| Nano Banana Pro | 64% | 64% | 72% | 64% | - | 69% |
| Gemini Flash | 44% | 25% | 44% | 50% | 31% | - |
Research conducted December 2025. 270 pairwise comparisons, 108 images, 6 models.
studio formel Research — Advancing AI for commercial content generation.
Related Articles
- Which AI Model Works Best for Jewelry Photography? — Practical takeaways from this research
- Prompt Engineering: Angle Control — How to control camera angles in prompts
- AI Image Studios & Models: 2025 Guide — Overview of all platforms tested
About studio formel
studio formel is an AI-powered creative platform built specifically for jewelry brands. We combine systematic research on AI generation with a flexible asset management system, helping jewelry sellers create professional images, videos, and ads at scale.