Why does AI-generated UI look good but feel wrong?

Because AI optimizes for visual plausibility, not product fit. It produces output that looks like a real interface without understanding why each element exists. The result is polished mediocrity: clean surfaces with no structural reasoning underneath.

How long should an evaluation take?

The quick checklist takes 2 minutes per screen. A full six-lens review takes 10 to 15 minutes. For important work, the full review is worth it. For rapid iteration, the checklist is enough to catch the worst problems.

Can I use AI to evaluate AI output?

Yes, for surface checks like consistency, spacing, and accessibility patterns. But AI evaluating AI has a blind spot: it shares the same biases that created the output. Use AI for the mechanical checks and your own judgment for originality, brand fit, and product decisions.

How to Evaluate AI Output Like a Senior Designer

Six-lens critique wheel

Critique frame AI Output evaluate before accepting

Lens 01 Review Hierarchy Is the important thing prominent?
Lens 02 Review Consistency Does every element follow rules?
Lens 03 Review Accessibility Can everyone use this?
Lens 04 Review Originality Does it have a point of view?
Lens 05 Review Feasibility Can it be built and maintained?
Lens 06 Review Product fit Does it solve the right problem?

What usually changes after senior review

Reviewed Raw AI

The polished mediocrity problem

AI-generated UI has a specific failure mode. It looks professional. The spacing is clean. The typography is reasonable. The colors are not ugly. A junior designer might look at it and say “that is fine.”

A senior designer looks at the same screen and sees:

No clear hierarchy. Everything is the same visual weight.
Generic structure. Hero, three cards, testimonials, footer. Same as every other site.
No brand point of view. This could belong to any company.
Surface polish hiding structural emptiness. The page looks designed but says nothing.
Accessibility theater. It looks accessible without actually being tested.

The problem is not that the output is bad. The problem is that it is not good enough to ship, and it takes experience to see why.

This guide gives you a framework for seeing it.

The six-lens rubric

Use these six lenses to evaluate any AI-generated screen, component, or page. Each lens catches a different category of problem.

1. Hierarchy

What to check: Is the most important thing the most prominent thing?

Red flags:

Every element has similar visual weight
The CTA is the same size as secondary links
Headlines do not stand out from body text
Multiple competing focal points

What good looks like: One clear primary action. One clear headline. Everything else is visually subordinate. A user can tell in 3 seconds what the page is about and what to do.

Ask yourself: If I blur my eyes, does the important thing still stand out?

2. Consistency

What to check: Does every element follow the same rules?

Red flags:

Two different button styles on the same page
Inconsistent spacing (32px here, 24px there, 40px somewhere else)
Mixed border-radius values
Font sizes that do not follow a scale
Colors that are close but not identical

What good looks like: Every button is the same. Every card follows the same spacing. Every heading uses the same scale. It looks like one person made it, not a committee.

Ask yourself: Could I write the rules this page follows? If not, there are no rules.

3. Accessibility

What to check: Can people with different abilities use this?

Red flags:

Text on backgrounds with insufficient contrast
Buttons or links smaller than 44x44px
No visible focus states
Color as the only indicator of state (red for error, green for success, nothing else)
Images without alt text
Form inputs without labels

What good looks like: Contrast ratios pass WCAG AA. Interactive elements are large enough. States are communicated through more than just color. Focus states are visible. Screen readers would make sense of the structure.

Ask yourself: Could someone use this with a keyboard only? With a screen reader? In direct sunlight?

4. Originality

What to check: Does this have a point of view, or could it belong to anyone?

Red flags:

Default SaaS layout (hero, features, testimonials, pricing, footer)
Inter or system font with no typographic personality
Blue or purple accent with no brand rationale
Illustrations that look like every other AI illustration
Copy that uses words like “streamline,” “leverage,” or “next-generation”

What good looks like: The page has visual decisions that reflect the brand, the audience, and the product. It could not be mistaken for a different company. The style choices are deliberate, not default.

Ask yourself: If I removed the logo, would I still know whose page this is?

5. Feasibility

What to check: Can this actually be built and maintained?

Red flags:

Layouts that work at one breakpoint but will break on mobile
Components that look unique but share no patterns with the rest of the system
Animations that require custom code for marginal visual benefit
Data that is hardcoded in the mockup but dynamic in production
States that are not accounted for (loading, empty, error, overflow)

What good looks like: Components follow patterns that scale. Responsive behavior is considered. Edge cases are handled. A developer could build this without guessing.

Ask yourself: What happens when the headline is twice as long? When there are zero items? When the data is loading?

6. Product fit

What to check: Does this actually solve the right problem for the right user?

Red flags:

Pretty UI with no clear user goal
Features displayed without context for why they matter
No indication of what happens after the CTA
Copy that describes the product instead of the user’s outcome
A page that looks like a brochure instead of a tool

What good looks like: The page is organized around what the user needs to understand and do. Every section has a job. The copy speaks to the user’s problem, not the product’s features. The design serves the conversion goal.

Ask yourself: If I were the target user, would I know exactly what this does and why I should care?

Surface critique vs structural critique

Surface critique catches visual issues. Structural critique catches design issues. Both matter, but they are different skills.

Surface critique (what most people do)

The spacing is off
The colors clash
The font is too small
The button is hard to see

These are real problems. But fixing them does not fix a bad design. It just makes a bad design look cleaner.

Structural critique (what senior designers do)

The page has no clear hierarchy
The information architecture does not match the user’s mental model
The CTA asks for commitment before building trust
The page solves a problem the user does not have
The component is not reusable across other contexts

Structural problems require rethinking, not polishing.

When evaluating AI output, check surface issues first (they are fast to fix), but always do a structural pass. AI is very good at producing clean surfaces with broken structures underneath.

Giving feedback that improves the next iteration

“Make it better” is not feedback. Here is what works:

Be specific about what is wrong

Bad: “The hero does not feel right.”

Better: “The hero has no hierarchy. The headline, subhead, and CTA are all the same visual weight. Make the headline 2x the size of the subhead and move the CTA below a clear value statement.”

Name the lens

“This fails the accessibility lens. The contrast ratio on the muted text is too low and the button is smaller than 44px.”

When you name the lens, AI knows what category of improvement to make.

Provide a reference

“Look at the quiet intelligence style from the Style Explorer. The current output is too loud. Reduce to 2 colors, increase whitespace, and drop the gradient.”

Say what to keep

“The section structure is good. Keep the order. Change the visual treatment.”

AI tends to regenerate everything unless you tell it what to preserve.

The “earning its place” heuristic

Senior designers carry one question into every review that AI does not: does this element earn its place?

When you scan an AI-generated screen, you’ll see filler. A divider that has no information value. A “trusted by” row of fake logos because the section felt empty. A second CTA that competes with the first. An info icon that links nowhere. Three feature cards because the layout grid was three columns wide.

Filler is a design problem, not a content problem. If a section feels empty, the right move is to fix the layout, not to invent content to fill it. One thousand no’s for every yes. This is the hardest critique skill to teach because the missing element is invisible until you name it.

Walk through the screen element by element. For each one, ask: if I deleted this, would the page lose something the user actually needs? If the answer is no, delete it. The screens that read as “designed” are the ones where every element passed this test.

This is the lens AI lacks by default. It will fill space because it learned that pages have content. Your job is to be the editor that removes.

The 2-minute checklist

Run this on any AI-generated screen before accepting it.

One clear action. Can I tell in 3 seconds what to do?
Hierarchy exists. Headline is dominant, body is subordinate, CTA is obvious.
Consistency holds. Buttons, spacing, fonts, and colors follow visible rules.
Contrast passes. No text on backgrounds I cannot read easily.
It has a point of view. This could not be mistaken for a generic template.
Edge cases considered. What happens when content is longer, shorter, empty, or loading?
Every element earns its place. Nothing is there just because the layout had room.

If three or more fail, redo. Do not polish.

When to accept AI output vs when to redo

Accept and refine when:

The structure is right but the surface needs polish
The hierarchy is clear but the style needs adjustment
The content is good but the copy tone is off

Redo from scratch when:

The structure does not match the user’s goal
The page has no clear hierarchy at all
The design is generic with no brand point of view
The component is not reusable or scalable
You would be embarrassed to show it to a senior colleague

Polishing a bad structure is wasted effort. Redoing with a better prompt is faster.

Exercise

Run the six-lens rubric on the last AI output you accepted too fast

20 min

Score the output against all six lenses
Pick one AI-generated artifact you used recently: a Figma Make frame, a v0 component, a Lovable page, a Claude-generated hero. Open a blank note. For each of the six lenses (Hierarchy, Consistency, Accessibility, Originality, Feasibility, Product Fit), write pass or fail and one sentence explaining why.
Six entries, one per lens, each with a verdict and a one-sentence reason

Reasons cite specific elements (“The primary CTA and the secondary link are the same size”), not vague impressions

At least one lens scored a clear fail, even if you originally thought the output was good enough

Decide refine or redo, then give feedback that names the lens
Count your fails. Three or fewer: refine. Four or more: redo. Either way, write the feedback in the format from the “Name the lens” section: lens name, specific problem, specific fix. Paste it into your AI tool and regenerate.
Your feedback prompt names at least one lens explicitly (“This fails the hierarchy lens…”)

The feedback is specific enough that a junior designer could apply it without asking follow-up questions

The regenerated output fixes at least one of the failures you identified, not a different problem the AI picked on its own

Finished this lesson?

Mark it complete to track your progress through "AI Design Starter Path".

Lesson

8 / 11

Progress

73%

How to Evaluate AI Output Like a Senior Designer

The polished mediocrity problem

The six-lens rubric

1. Hierarchy

2. Consistency

3. Accessibility

4. Originality

5. Feasibility

6. Product fit

Surface critique vs structural critique

Surface critique (what most people do)

Structural critique (what senior designers do)

Giving feedback that improves the next iteration

Be specific about what is wrong

Name the lens

Provide a reference

Say what to keep

The “earning its place” heuristic

The 2-minute checklist

When to accept AI output vs when to redo

Run the six-lens rubric on the last AI output you accepted too fast

Score the output against all six lenses

Decide refine or redo, then give feedback that names the lens

Finished this lesson?

What to Do When AI Gives You Junk

AI Anti-Patterns Gallery: 10 Reasons Your AI Website Looks Generic

The polished mediocrity problem

The six-lens rubric

1. Hierarchy

2. Consistency

3. Accessibility

4. Originality

5. Feasibility

6. Product fit

Surface critique vs structural critique

Surface critique (what most people do)

Structural critique (what senior designers do)

Giving feedback that improves the next iteration

Be specific about what is wrong

Name the lens

Provide a reference

Say what to keep

The “earning its place” heuristic

The 2-minute checklist

When to accept AI output vs when to redo

Score the output against all six lenses

Decide refine or redo, then give feedback that names the lens

Finished this lesson?

Create an account to continue

Read this next

Screenshot to UX Critique

What to Do When AI Gives You Junk

AI Anti-Patterns Gallery: 10 Reasons Your AI Website Looks Generic