Skip to content

Commit d7c97aa

Browse files
Copilotyanyiwu
andcommitted
Add adaptive topN selection for different document sizes
Co-authored-by: yanyiwu <[email protected]>
1 parent 7ed7de1 commit d7c97aa

File tree

3 files changed

+102
-0
lines changed

3 files changed

+102
-0
lines changed

README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,37 @@ simhash值是: 17831459094038722629
5454

5555
详情请看 [demo](https://github.com/yanyiwu/simhash-demo)
5656

57+
### 关键词数量(topN)的设置
58+
59+
`make()` 函数的 `topN` 参数控制从文本中抽取多少个关键词来参与 simhash 计算。关键词越多,指纹对文本内容的覆盖越全面,但计算开销也随之增加。
60+
61+
**不同大小的文件应该使用不同的 topN:**
62+
63+
| 文本长度(字节) | 建议 topN |
64+
|:---:|:---:|
65+
| ≤ 600 | 5 |
66+
| ~1200 | 10 |
67+
| ~7800 | 65 |
68+
| ≥ 24000 | 200 |
69+
70+
**自动选择 topN(推荐):**
71+
72+
可以使用 `Simhasher::getTopN(text)` 获取自动推荐的 topN 值(规则:`max(5, min(200, text.size() / 120))`),
73+
或直接调用无需手动指定 topN 的 `make(text, v64)` 重载,它会自动调用 `getTopN()`:
74+
75+
```cpp
76+
Simhasher shash(DICT_PATH, HMM_PATH, IDF_PATH, STOP_WORDS_PATH);
77+
string text = /* 读入文本 */;
78+
79+
// 方法一:自动确定 topN
80+
uint64_t simhashValue;
81+
shash.make(text, simhashValue);
82+
83+
// 方法二:手动查询推荐值后再调用
84+
size_t topN = Simhasher::getTopN(text);
85+
shash.make(text, topN, simhashValue);
86+
```
87+
5788
### Benchmark
5889
```sh
5990
./benchmark/benchmarking

include/simhash/Simhasher.hpp

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,39 @@ namespace simhash
106106
return v64;
107107
}
108108

109+
/**
110+
* @brief
111+
* Compute a recommended topN (number of keywords) for the given text.
112+
*
113+
* For Chinese UTF-8 text, each character is roughly 3 bytes and each word
114+
* is roughly 2 characters (~6 bytes). A good rule of thumb is to extract
115+
* approximately one keyword per 120 bytes of input, clamped to [5, 200].
116+
*
117+
* Example expected values:
118+
* text.size() <= 600 bytes → topN = 5
119+
* text.size() ~ 1200 bytes → topN = 10
120+
* text.size() ~ 7800 bytes → topN = 65
121+
* text.size() >= 24000 bytes → topN = 200 (cap)
122+
*
123+
* Users who need tighter control can still pass an explicit topN to make().
124+
*/
125+
static size_t getTopN(const string& text)
126+
{
127+
const size_t topNMin = 5;
128+
const size_t topNMax = 200;
129+
return std::max(topNMin, std::min(topNMax, text.size() / 120));
130+
}
131+
132+
/**
133+
* @brief
134+
* Adaptive overload: topN is chosen automatically via getTopN().
135+
* Suitable when the caller does not want to tune topN manually.
136+
*/
137+
bool make(const string& text, uint64_t& v64) const
138+
{
139+
return make(text, getTopN(text), v64);
140+
}
141+
109142
static bool isEqual(uint64_t lhs, uint64_t rhs, unsigned short n = 3)
110143
{
111144
unsigned short cnt = 0;

test/unittest/TSimhash.cpp

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,3 +87,41 @@ TEST(SimhasherTest, Test2)
8787

8888
}
8989

90+
TEST(SimhasherTest, TestGetTopN)
91+
{
92+
// Very short text should return the minimum topN of 5.
93+
string shortText(100, 'a');
94+
ASSERT_EQ(Simhasher::getTopN(shortText), (size_t)5);
95+
96+
// Text of exactly 600 bytes: 600/120 = 5 → still 5 (minimum).
97+
string text600(600, 'a');
98+
ASSERT_EQ(Simhasher::getTopN(text600), (size_t)5);
99+
100+
// Text of 1200 bytes: 1200/120 = 10.
101+
string text1200(1200, 'a');
102+
ASSERT_EQ(Simhasher::getTopN(text1200), (size_t)10);
103+
104+
// Text of 7800 bytes: 7800/120 = 65.
105+
string text7800(7800, 'a');
106+
ASSERT_EQ(Simhasher::getTopN(text7800), (size_t)65);
107+
108+
// Very long text should be capped at the maximum topN of 200.
109+
string longText(30000, 'a');
110+
ASSERT_EQ(Simhasher::getTopN(longText), (size_t)200);
111+
}
112+
113+
TEST(SimhasherTest, TestAdaptiveMake)
114+
{
115+
Simhasher shash("../submodules/cppjieba/dict/jieba.dict.utf8", "../submodules/cppjieba/dict/hmm_model.utf8", "../submodules/cppjieba/dict/idf.utf8", "../submodules/cppjieba/dict/stop_words.utf8");
116+
117+
// Verify that the adaptive make() overload (no explicit topN) produces the same
118+
// result as calling make() with getTopN() explicitly.
119+
string s;
120+
ASSERT_TRUE(loadFile2Str("../test/testdata/news_content", s));
121+
122+
uint64_t u_adaptive, u_explicit;
123+
ASSERT_TRUE(shash.make(s, u_adaptive));
124+
ASSERT_TRUE(shash.make(s, Simhasher::getTopN(s), u_explicit));
125+
ASSERT_EQ(u_adaptive, u_explicit);
126+
}
127+

0 commit comments

Comments
 (0)