We're in early access — onboarding jewelry brands one at a time

← Back to Research

Published

December 22, 2025

Category

Model Comparison

Author

Formel Studio

AI Jewelry Photography: Which Models Can Even Do It? (Part 1)

We tested 11 AI image generators to find which ones can produce usable jewelry photography. This elimination study identifies which models to avoid entirely and which deserve further testing.

AI Jewelry Photography: Which Models Can Even Do It? (Part 1)

Before optimizing prompts or building complex workflows, we needed to answer a foundational question: Which AI models can even produce usable jewelry photography?

Study type: Elimination round — 132 images, 11 models, blind evaluation

Research Series: AI Model Comparison for Jewelry Photography


The Question This Study Answers

This is Part 1 of our AI jewelry photography research. Before comparing the fine details between top models, we first needed to eliminate the models that simply can’t handle jewelry photography at all.

This study answers: Which AI models should you avoid entirely for jewelry product photography?

This study does NOT answer: Which of the capable models is definitively “best”? (That’s Part 2)

Why Elimination First?

Jewelry photography is demanding. It requires:

  • Accurate reproduction of metal colors and finishes
  • Precise stone placement and proportions
  • Realistic reflections and sparkle
  • Natural-looking hands and skin
  • Correct ring placement on fingers

Many AI models that work well for general image generation fail completely when asked to reproduce specific jewelry from a reference image. We needed to identify these failures before investing time in detailed comparisons.


TL;DR: Elimination Results

After testing 11 models across 132 images:

ELIMINATED — Do not use for jewelry photography:

  • Qwen-Image (Alibaba): 100% rejection rate — failed every test
  • Runway Gen-4 Image: 100% rejection rate — failed every test
  • Seedream 4 (ByteDance): 75% rejection rate — too unreliable
  • GPT Image 1.5 (OpenAI): 58% rejection rate — inconsistent
  • Seedream 4.5 (ByteDance): 58% rejection rate — inconsistent

ADVANCING TO PART 2 — Show promise for jewelry:

  • Nano Banana Pro (Google): 0% rejection rate — never failed
  • FLUX.2 Pro (Black Forest Labs): 8% rejection rate — reliable
  • Nano Banana (Google): 8% rejection rate — reliable
  • FLUX.2 Flex (Black Forest Labs): 17% rejection rate — usable
  • Gemini 2.5 Flash (Google): 25% rejection rate — borderline
  • FLUX.2 Max (Black Forest Labs): 23% rejection rate — borderline

Key workflow findings (valid regardless of model ranking):

  • The “Replace” approach works 31% better than generating from scratch
  • Adding text descriptions of jewelry provides virtually no improvement
  • Simple rings have 50% higher success rates than complex designs

Study Design

Reference Images

We tested with three rings at varying complexity levels:

Reference image: Simple gold signet ring with oval flat face used for AI model testing Reference image: Medium complexity gold ring with pear-shaped diamond and halo setting Reference image: Complex 14k gold band with alternating gold bars and diamond clusters

LevelDescription
SimpleGold signet ring with oval flat face, polished, no stones
MediumDelicate gold ring with pear-shaped diamond center stone and halo setting
Complex14k gold band with alternating vertical gold bars and diamond clusters

Hand Reference for Replace Scenario

Hand reference image used for AI ring replacement testing - woman's hand with placeholder ring

Models Tested

ModelLabCost/Image
FLUX.2 ProBlack Forest Labs~$0.045
FLUX.2 MaxBlack Forest Labs~$0.10
FLUX.2 FlexBlack Forest Labs~$0.12
Nano BananaGoogle~$0.039
Nano Banana ProGoogle~$0.15
Gemini 2.5 FlashGoogle~$0.039
GPT Image 1.5OpenAI~$0.05
Qwen-ImageAlibaba~$0.025
Seedream 4ByteDance~$0.03
Seedream 4.5ByteDance~$0.04
Runway Gen-4 ImageRunway~$0.05

Test Scenarios

Each model was tested in 4 scenarios:

ScenarioTaskPurpose
A: GenerateCreate ring-on-hand from ring image onlyTest basic capability
B: Generate + DescriptionAdd text description of the ringTest if words help
C: ReplaceSwap ring on existing hand photoTest precision editing
D: Replace + DescriptionReplace with text descriptionTest combined approach

Total: 11 models × 3 rings × 4 scenarios = 132 images


Evaluation Methodology

Blind Evaluation for Elimination

Our goal was to identify which models produce usable results vs. which should be eliminated. The evaluation was designed to surface clear failures:

  1. Anonymization: All 132 images were assigned random IDs (01-11)
  2. Batch presentation: Images shown in randomized batches of 4
  3. Relative ranking: Within each batch, evaluator selected 1st place, 2nd place, or marked as “Reject”
  4. Blind reveal: Model identities mapped back only after all scoring complete

Evaluation Interface

[Screenshot placeholder: Blind evaluation UI showing 4 images side by side with rating controls]

What This Methodology Measures Well

  • Clear failures: A model rejected in 12/12 batches is genuinely unusable
  • Consistent performers: A model never rejected across varied conditions is reliable
  • Workflow comparisons: Within-model comparisons (Replace vs Generate) are valid

What This Methodology Does NOT Measure

  • Precise rankings between top models: The batch composition varied, so “1st place” in one batch isn’t directly comparable to “1st place” in another
  • Head-to-head winner: We cannot say “Nano Banana Pro definitively beats FLUX.2 Pro”

This is why we’re calling this Part 1: Elimination. Part 2 will conduct controlled head-to-head comparisons of the advancing models.

Scoring Criteria

Each image was rated on 5 dimensions:

DimensionOptionsWhat We Measured
Ring MatchExact / Close / Similar style / Wrong ringDoes it match the reference?
Hand QualityNatural / Minor issues / Major issuesDoes the hand look real?
PlacementCorrect / Wrong finger / FloatingIs the ring properly worn?
AI LookPhotorealistic / Slight AI / Obviously AIWould customers notice?
Verdict1st / 2nd / OK / RejectWould you publish this?

Results: Clear Eliminations

Models That Failed Completely

Qwen-Image and Runway Gen-4 should be avoided entirely for jewelry photography. Both achieved a 100% rejection rate — every single output was unusable.

Common failure patterns:

  • Generated completely different rings, ignoring the reference image
  • In some cases, didn’t generate a ring at all
  • These models appear not optimized for reference-image fidelity

Visual examples of failure — Qwen-Image:

Qwen-Image - Simple ring Qwen-Image - Medium ring Qwen-Image - Complex ring

Visual examples of failure — Runway Gen-4:

Runway Gen-4 - Simple ring Runway Gen-4 - Medium ring Runway Gen-4 - Complex ring

Models With High Failure Rates

Seedream 4 (75% reject), GPT Image 1.5 (58% reject), and Seedream 4.5 (58% reject) showed partial capability but failed too often to be reliable:

  • Ring style often “inspired by” but not matching reference
  • Correct concept but wrong execution (proportions, stone count, metal color)
  • ByteDance models (Seedream) particularly struggled with fine jewelry details

Elimination Summary

Rejection Rate by Model (Lower = Better)

Nano Banana Pro FLUX.2 Pro Nano Banana FLUX.2 Flex FLUX.2 Max Gemini 2.5 Flash Seedream 4.5 GPT Image 1.5 Seedream 4 Qwen-Image Runway Gen-4

0% — ADVANCING 8% — ADVANCING 8% — ADVANCING 17% — ADVANCING 23% — BORDERLINE 25% — BORDERLINE 58% — ELIMINATED 58% — ELIMINATED 75% — ELIMINATED 100% 100%
StatusModelRejection RateVerdict
ADVANCINGNano Banana Pro0%Never failed — reliable
ADVANCINGFLUX.2 Pro8%Rarely failed — reliable
ADVANCINGNano Banana8%Rarely failed — reliable
ADVANCINGFLUX.2 Flex17%Occasional failures — usable
BORDERLINEFLUX.2 Max23%Frequent failures — test carefully
BORDERLINEGemini 2.5 Flash25%Frequent failures — test carefully
ELIMINATEDSeedream 4.558%Too unreliable
ELIMINATEDGPT Image 1.558%Too unreliable
ELIMINATEDSeedream 475%Mostly fails
ELIMINATEDQwen-Image100%Complete failure
ELIMINATEDRunway Gen-4100%Complete failure

Results: Models That Show Promise

These models produced usable jewelry photography and advance to Part 2 for head-to-head comparison.

Nano Banana Pro — Most consistent performer (0% rejection rate):

Nano Banana Pro - Simple ring Nano Banana Pro - Medium ring Nano Banana Pro - Complex ring

FLUX.2 Pro — Strong performer at lower cost (8% rejection rate):

FLUX.2 Pro - Simple ring FLUX.2 Pro - Medium ring FLUX.2 Pro - Complex ring

Important note: While Nano Banana Pro showed the lowest rejection rate, this study cannot definitively claim it’s “better” than FLUX.2 Pro. The batch composition varied, meaning these models weren’t always compared head-to-head on identical tasks. Part 2 will address this with controlled comparisons.


Workflow Findings

These findings compare performance within models across different approaches, so they’re valid regardless of cross-model ranking questions.

Replace Works Better Than Generate

Generate from Scratch vs Replace on Hand

Generate 35% success

Replace 46% success

+31% improvement with Replace approach

Recommendation: When possible, use the Replace workflow. Provide a hand reference image and ask the AI to swap in your ring. This gives the AI more spatial context and produces better results.

Text Descriptions Don’t Help

Effect of Text Description on Accuracy

No description 40% success

With description 41% success

Recommendation: Don’t waste time writing detailed jewelry descriptions. The reference image contains more information than words can convey. Focus your prompts on the scene (pose, lighting, background) instead.

Simple Rings Work Better

Success Rate by Ring Complexity

Simple 50% success

Medium 33% success

Complex 39% success

Recommendation: Start with your simpler pieces when adopting AI photography. Validate your workflow before attempting complex multi-stone designs.


What Goes Wrong: Failure Mode Analysis

Ring Match Accuracy (All 132 Images)

  • Exact match: 29% — AI perfectly reproduced the reference
  • Close: 30% — Minor differences, clearly the same ring
  • Similar style: 15% — Right category, wrong details
  • Wrong ring entirely: 27% — AI generated a different ring

Common Failure Patterns

  1. Wrong ring entirely (27%) — AI generates a generic ring instead of the reference. Most common with eliminated models.

  2. Wrong finger (21%) — Ring on index/middle finger instead of ring finger. Even good models made this mistake.

  3. Floating/clipping (7%) — Ring merged into hand or appeared to float. More common with complex designs.

  4. Obviously AI (18%) — Uncanny skin, unrealistic lighting. Would reduce customer trust.


Practical Recommendations

For E-commerce Sellers Today

Based on elimination results, here’s what you can do now:

Definitely avoid:

  • Qwen-Image
  • Runway Gen-4 Image
  • Seedream 4
  • GPT Image 1.5
  • Seedream 4.5

Safe to test (pending Part 2 results):

  • Nano Banana Pro ($0.15/image) — most consistent
  • FLUX.2 Pro ($0.045/image) — best value candidate
  • Nano Banana ($0.039/image) — budget option

Workflow tips:

  • Use Replace approach when possible
  • Skip detailed jewelry descriptions
  • Start with simple rings
  • Budget for 1.5-2x generations to select the best

Cost Guidance

ModelCostReliability
Nano Banana Pro$0.15/imageNever failed in this study
FLUX.2 Pro$0.045/image92% reliable
Nano Banana$0.039/image92% reliable

Study Limitations

What This Study Measured

  • Clear eliminations: Models that fail consistently across varied conditions
  • Reliable performers: Models that rarely produce unusable results
  • Workflow effectiveness: Replace vs Generate, Description impact

What This Study Did NOT Measure

  • Precise quality ranking between top models: Due to varied batch composition, we cannot definitively rank Nano Banana Pro vs FLUX.2 Pro vs others
  • Head-to-head comparisons: Models weren’t always compared on identical tasks

Why This Matters

A “1st place” in a batch with 3 weak competitors isn’t the same as “1st place” against 3 strong competitors. The elimination results are valid — if a model fails 100% of the time, it’s genuinely bad. But the relative ranking of successful models requires controlled head-to-head testing.


Next: Part 2 — Head-to-Head Comparison

This elimination study answered: Which models can do jewelry photography?

Part 2 will answer: Of the capable models, which is actually best?

Planned methodology for Part 2:

  • Show all 6 advancing models’ outputs for identical ring + scenario
  • Rank 1-6 within each controlled comparison
  • Repeat for multiple conditions
  • Produce definitive quality ranking

Additional planned research:

  • Consistency testing (10+ generations per model)
  • Other jewelry types (bracelets, necklaces, earrings)
  • Ideogram inpainting comparison

Reproducibility

Materials:

  • Reference images: 3 rings at simple/medium/complex levels
  • Prompts: Documented in methodology
  • Raw scores: 132 images with full rating data

Cost to reproduce:

  • 132 images × ~$0.06 average = ~$8 total
  • Replicate API access required
  • ~1 hour for evaluation

Conclusion

This elimination study provides clear guidance on which AI models to avoid for jewelry photography:

Do not use: Qwen-Image, Runway Gen-4, Seedream 4, GPT Image 1.5, Seedream 4.5

Safe to use: Nano Banana Pro, FLUX.2 Pro, Nano Banana (with FLUX.2 Flex, Gemini 2.5 Flash, and FLUX.2 Max as borderline options)

The workflow findings are actionable today:

  • Use Replace instead of Generate (+31% improvement)
  • Skip detailed jewelry descriptions (no benefit)
  • Start with simple rings (50% better success)

Part 2 will determine which of the advancing models produces the best quality. Until then, Nano Banana Pro’s 0% failure rate makes it the safest choice, while FLUX.2 Pro offers strong reliability at 1/3 the cost.


Part 1 of ongoing AI jewelry photography research. December 2025.

Questions? Contact hello@studioformel.com



About studio formel

studio formel is an AI-powered creative platform built specifically for jewelry brands. We combine systematic research on AI generation with a flexible asset management system, helping jewelry sellers create professional images, videos, and ads at scale.

Learn more about our platform →

Apply our research to your jewelry brand

Our AI platform uses these research findings to help you create professional product images at scale.

Get early access

Topics

AI jewelry photography AI product photography FLUX Imagen ring photography e-commerce images lifestyle product photos