Website Crawling
Website Crawling (2025-12-30): Honestify can ingest public content from your portfolio, blog, or GitHub Pages to enrich your knowledge base without manual copy-paste.
Feature area: Website Crawler
On this page
This release covers Website Crawling Version 1.3.0, shipped 2025-12-30. Status: shipped. No breaking changes.
Summary
Honestify can ingest public content from your portfolio, blog, or GitHub Pages to enrich your knowledge base without manual copy-paste.
Engineers with strong public writing still retyped URLs and summaries into Honestify—duplicate work that discouraged knowledge base adoption. A respectful crawler fetches public pages, extracts main content, and indexes text for AI retrieval with user-approved URL allowlists.
What Changed
URL allowlist
NewUsers specify domains and max crawl depth before indexing starts.
Content extraction
NewReadability-style main-content detection strips nav and boilerplate.
Crawl scheduling
ImprovedPeriodic re-crawl option for sites that update frequently.
KB setup time
Before
~25 min
After
~12 min
Users with 5+ public pages
- URL allowlist — Users specify domains and max crawl depth before indexing starts.
- Content extraction — Readability-style main-content detection strips nav and boilerplate.
- Crawl scheduling — Periodic re-crawl option for sites that update frequently.
Why We Built It
Engineers with strong public writing still retyped URLs and summaries into Honestify—duplicate work that discouraged knowledge base adoption.
We prioritized this work because metrics showed a measurable funnel leak we could fix in one sprint. The fix needed to be durable—not a patch—so we addressed root causes in Website Crawler rather than symptoms alone.
Engineers, recruiters, and hiring managers all benefit when Honestify behaves predictably in production. This release reflects that bar.
User Impact
Knowledge base setup time decreased 50% for users with existing portfolio sites.
| Audience | How you benefit |
|---|---|
| Engineers | Faster profile setup, clearer AI answers, less manual rework |
| Recruiters | More complete profiles and reliable share links when candidates use Honestify |
| Founders / hiring managers | Better signal on candidate preparation and skills alignment |
| Platform engineers | Infrastructure patterns that reduce incident risk |
Relevant skills: rag, python, system design, typescript. Target roles: ai engineer, backend engineer, devops engineer.
Technical Highlights
- Robots.txt compliance with crawl-delay respect
- Per-user rate limits and concurrent crawl caps
- Chunking pipeline for embedding index
- Dedup via content hash
Rollout used feature flags with staged percentage increase and automatic rollback on error budget burn.
Before
Website Crawling: before vs after
Before
Users manually pasted article text or skipped external content entirely.
After
Submit a root URL; Honestify discovers and indexes linked pages within configured depth and domain bounds.
Users moving from the previous experience should notice submit a root URL; Honestify discovers and indexes linked pages within configured depth and domain bounds.
Screenshots
Future Improvements
What we are building next
- GitHub README indexing
- RSS feed subscriptions
- Selective page exclusion UI
Known limitations
- · JavaScript-heavy SPAs may require manual paste fallback
- · Authentication-gated pages not supported
Feedback welcome: Reply via in-app feedback or support—especially if you hit edge cases we did not cover in this release.
Related Features
This update connects to other Honestify work:
- Related updates: knowledge base launch, improved prompt engineering, ai chat improvements
- Guides: how to learn ai engineering, writing better documentation, building a personal brand
- Research: rag adoption, mcp adoption, emerging technologies on honestify
- Practice questions: explain rag, explain embeddings, design ai chatbot
Create your own AI profile
Upload your resume, add expertise, and share a profile link beside LinkedIn so recruiters can ask follow-up questions before the interview.