[新手上路]批处理新手入门导读[视频教程]批处理基础视频教程[视频教程]VBS基础视频教程[批处理精品]批处理版照片整理器
[批处理精品]纯批处理备份&还原驱动[批处理精品]CMD命令50条不能说的秘密[在线下载]第三方命令行工具[在线帮助]VBScript / JScript 在线参考
返回列表 发帖

windows代码页转换工具wincp.exe

本帖最后由 happy886rr 于 2017-6-1 18:43 编辑

[tvcp已经更名为wincp,修复数个bug,版本号升级为1.1]
wincp代码页转化工具,支持文本编码转换,BOM头修改、伪造BOM、去BOM,自动修正参数,BOM自动偏移 等功能。

下载:存外链图为a.zip解压便是。


WINCP.EXE  (TEXT CODEPAGE CONVERSION TOOL, BY LEO, VERSION 1.1)

摘要:
=========================================================================
代码页转化工具,支持文本编码转换,BOM头修改、伪造、去BOM,自动修正参数,
BOM自动偏移... 等功能。

用处特殊,效果奇佳。不仅仅是编码转换,更具代码页翻译、伪造、BOM自定义,加
密等等。

补充:code_page参数可以是代码页数字入936,也可以是代码页缩写如GBK,具体对
照详见 备注(常见代码页缩写)。
=========================================================================


用法:
-------------------------------------------------------------------------
wincp [input_file] -f [code_page] -t [code_page] -s [skip_number] -b[fill_BOM] -o [out_file]
-------------------------------------------------------------------------
  -f  From the code page
  -t  Translate to the code page
  -s  Skip the number of bytes
  -b  Filling BOM
  -o  Output file name
  -h  Show help information
-------------------------------------------------------------------------


举例:
-------------------------------------------------------------------------
REM 将test.txt从BIG5编码转为UTF8编码
wincp test.txt -o out.txt -f BIG5 -t UTF8

REM 将test.txt从ANSI编码转为UTF8编码
wincp test.txt -o out.txt -f 936 -t 65001

REM 将test.txt从UTF8编码转为UCS-2LE编码,即通常的UNICODE编码,并填充其BOM头为0xFFFE。
wincp test.txt -f 65001 -t 1200 -s 0 -b 0xFFFE -o out.txt

REM 将test.txt从UNICODE大端编码转为UTF8编码
wincp test.txt -o out.txt -f UCS2BE -t UTF8
wincp test.txt -oout.txt -fUNICODEBE -tUTF8

REM 将test.txt去除BOM
wincp test.txt -oout.txt

REM 伪造BOM
wincp test.txt -oout.txt -b0xADFF0000
...
-------------------------------------------------------------------------


备注:(常见代码页缩写)
-------------------------------------------------------------------------
ANSI    0
GBK     936
GB18030 54936
BIG5    950

UNICODE   UTF16    UCS2    1200
UNICODEBE UTF16BE  UCS2BE  1201

UTF8    65001
UTF7    65000

UTF32   12000
UTF32BE 12001
-------------------------------------------------------------------------


代码页:(通用代码页对照表)
-------------------------------------------------------------------------
  437 — 最初的 IBM PC 代码页,实现了扩展ASCII字符集
  737 — 希腊语
  850 — Latin-1(西欧语言)
  852 — Latin-2(中欧及东欧语言)
  855 — 西里尔(Cyril)字母
  857 — 土耳其语
  858 — 带欧元符号的“多语言”
  860 — 葡萄牙语
  861 — 冰岛语
  863 — 法语 加拿大英语
  865 — 北欧
  866 — 西里尔(Cyril)字母
  869 — 希腊语
  874 — 泰文字母
  932 — 日本
  949 — 韩国
  936 — GBK中文编码
  950 — BIG5繁体中文
1200 — UCS-2LE Unicode 小端序
1201 — UCS-2BE Unicode 大端序
1250 — 东欧拉丁字母
1251 — 古斯拉夫语
1252 — 西欧拉丁字母 ISO-8859-1.
1253 — 希腊语
1254 — 土耳其语
1255 — 希伯来语
1256 — 阿拉伯语
1257 — 巴尔
1258 — 越南
1254 — 土耳其语
10000 — Macintosh Roman encoding (followed by several other Mac character sets)
10007 — Macintosh Cyrillic encoding
10029 — Macintosh Central European encoding
12000 — utf-32 Unicode UTF-32, little endian byte order; available only to managed applications
12001 — utf-32BE Unicode UTF-32, big endian byte order; available only to managed applications
28591 — iso-8859-1 ISO 8859-1 Latin 1; Western European (ISO)
51936 — EUC-CN EUC Simplified Chinese; Chinese Simplified (EUC)
54936 — GB18030
65000 — UTF-7 Unicode
65001 — UTF-8 Unicode
-------------------------------------------------------------------------


BOM:(常见字节顺序标记)
-------------------------------------------------------------------------
UTF-8       EF BB BF
UTF-16 (LE) FF FE
UTF-16 (BE) FE FF
UTF-32 (LE) FF FE 00 00
UTF-32 (BE) 00 00 FE FF
UTF-7       2B 2F 76 +[38|39|2B|2F]
UTF-1       F7 64 4C
UTF-EBCDIC  DD 73 66 73
SCSU        0E FE FF
BOCU-1      FB EE 28 (+FF)
GB-18030    84 31 95 33
-------------------------------------------------------------------------

版本:
VERSION 1.0


源码支持单宽字符,各类win编译器编译。
  1. /*
  2. TEXT CODEPAGE CONVERSION TOOL, COPYRIGHT@2017~2019 BY LEO, VERSION 1.1
  3. WINCP.EXE
  4. UNICODE COMPILATION:
  5. ==> G++ wincp.cpp -D _UNICODE -D UNICODE -municode -O2 -static
  6. ==> CL  wincp.cpp /O2 /Oy- /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /MD
  7. ANSI COMPILATION:
  8. ==> G++ wincp.cpp -O2 -static
  9. ==> CL  wincp.cpp /O2 /Oy- /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /MD
  10. */
  11. #include <stdio.h>
  12. #include <stdlib.h>
  13. #include <string.h>
  14. #include <windows.h>
  15. #include <locale.h>
  16. #include <ctype.h>
  17. #include <tchar.h>
  18. #include <time.h>
  19. #if !defined(_MSC_VER) && !defined(bool)
  20. #include <stdbool.h>
  21. #endif
  22. #if !defined(WIN32) && !defined(__WIN32__)
  23. #error Only run on windows system
  24. #endif
  25. /***************定义宏变量***************/
  26. //文件限制
  27. #define MAX_FILE_SIZE 1024 * 1024
  28. //标准行长
  29. #define BUFF_SIZE 1024
  30. //BOM容器长度
  31. #define BOMS_SIZE 4
  32. //编码检测阈值(字节)
  33. #define CHECK_SIZE 16383
  34. //定义帮助说明
  35. #define HELP_INFORMATION _T("\
  36. wincp v1.1 - Console text codepage conv tool - Copyright (C) 2017-2019 by LEO\n\
  37. Usage: wincp [input_file] -f [code_page] -t [code_page] -s [skip_number] -b[fill_BOM] -o [out_file]\n\
  38. \n\
  39. General options:\n\
  40.   -f  From the code page\n\
  41.   -t  Translate to the code page\n\
  42.   -s  Skip the number of bytes\n\
  43.   -b  Filling BOM\n\
  44.   -o  Output file name\n\
  45.   -h  Show help information\n\
  46. \n\
  47. Official website:\n\
  48.       http://www.bathome.net/thread-44343-1-1.html\n\
  49. ")
  50. /*
  51. Microsoft code pages:\n\
  52.   897 – IBM-PC SBCS Japanese (JIS X 0201-1976)
  53.   941 – IBM-PC Japanese DBCS for Open environment
  54.   947 – IBM-PC DBCS for (Big5 encoding)
  55.   950 – Traditional Chinese MIX (Big5 encoding) (1114 + 947) (same with euro: 1370)
  56. 1114 – IBM-PC SBCS (Simplified Chinese; GBK; Traditional Chinese; Big5 encoding)
  57. 1126 – IBM-PC Korean SBCS
  58. 1162 – Windows Thai (Extension of 874; but still called that in Windows)
  59. 1169 – Windows Cyrillic Asian
  60. 1250 – Windows Central Europe
  61. 1251 – Windows Cyrillic
  62. 1252 – Windows Western
  63. 1253 – Windows Greek
  64. 1254 – Windows Turkish
  65. 1255 – Windows Hebrew
  66. 1256 – Windows Arabic
  67. 1257 – Windows Baltic
  68. 1258 – Windows Vietnamese
  69. 1361 – Korean (JOHAB)
  70. 1362 – Korean Hangul DBCS
  71. 1363 – Windows Korean (1126 + 1362) (Windows CP 949)
  72. 1372 – IBM-PC MS T Chinese Big5 encoding (Special for DB2)
  73. 1373 – Windows Traditional Chinese (extension of 950)
  74. 1374 – IBM-PC DB Big5 encoding extension for HKSCS
  75. 1375 – Mixed Big5 encoding extension for HKSCS (intended to match 950)
  76. 1385 – IBM-PC Simplified Chinese DBCS (Growing CS for GB18030, also used for GBK PC-DATA.)
  77. 1386 – IBM-PC Simplified Chinese GBK (1114 + 1385) (Windows CP 936)
  78. 1391 – Simplified Chinese 4 Byte (Growing CS for GB18030, also used for GBK PC-DATA.)
  79. 1392 – IBM-PC Simplified Chinese MIX (1252 + 1385 + 1391)
  80. ...
  81. */
  82. //开关解析宏名
  83. #define _OPT_TEOF -1
  84. #define _OPT_TILL -2
  85. #define _OPT_TERR -3
  86. //开关解析变量
  87. int OPTIND=1, OPTOPT, UNOPTIND=-1;
  88. TCHAR* OPTARG;
  89. #if defined(_UNICODE) || defined(UNICODE)
  90. #define TCHARFORMAT WCHAR
  91. #else
  92. #define TCHARFORMAT CHAR
  93. #endif
  94. //BOM转UINT宏函数
  95. #define BOM2UINT(x) (unsigned int)(((unsigned char)(x)[0]<<24)|((unsigned char)(x)[1]<<16)|((unsigned char)(x)[2]<<8)|((unsigned char)(x)[3]))
  96. /***************功能函数群***************/
  97. //判断纯数字
  98. int _istPositiveNumber(TCHAR* instr)
  99. {
  100. //过滤前空
  101. while(_istspace(*instr))
  102. {
  103. instr++;
  104. }
  105. //过滤空值和负数
  106. if(*instr == _T('\0') || *instr == _T('-'))
  107. {
  108. return -1;
  109. }
  110. //判断每一位是数字
  111. while(_istdigit(*(instr)))
  112. {
  113. instr++;
  114. }
  115. //判断结尾
  116. return (*instr == _T('\0')) ?0 :1;
  117. }
  118. //获取代码页
  119. int _tgetCP(TCHAR* instr)
  120. {
  121. //空指针
  122. if(instr == NULL)
  123. {
  124. return -1;
  125. }
  126. //设置返回值
  127. int retCP;
  128. switch(_istPositiveNumber(instr))
  129. {
  130. case -1:
  131. return -1;
  132. case  0:
  133. return _ttoi((TCHARFORMAT*)instr);
  134. case  1:
  135. break;
  136. }
  137. if     (_tcsicmp(instr, _T("ANSI")   ) ==0)
  138. {
  139. retCP=CP_ACP;
  140. }
  141. else if(_tcsicmp(instr, _T("GBK")    ) ==0)
  142. {
  143. retCP=936;
  144. }
  145. else if(_tcsicmp(instr, _T("GB18030")) ==0)
  146. {
  147. retCP=54936;
  148. }
  149. else if(_tcsicmp(instr, _T("BIG5")   ) ==0)
  150. {
  151. retCP=950;
  152. }
  153. else if(
  154.     _tcsicmp(instr, _T("UNICODE")  ) ==0 ||
  155.     _tcsicmp(instr, _T("UTF16")    ) ==0 ||
  156.     _tcsicmp(instr, _T("UCS2")     ) ==0
  157. )
  158. {
  159. retCP=1200;
  160. }
  161. else if(
  162.     _tcsicmp(instr, _T("UNICODEBE")) ==0 ||
  163.     _tcsicmp(instr, _T("UTF16BE")  ) ==0 ||
  164.     _tcsicmp(instr, _T("UCS2BE")   ) ==0
  165. )
  166. {
  167. retCP=1201;
  168. }
  169. else if(_tcsicmp(instr, _T("UTF7")   ) ==0)
  170. {
  171. retCP=65000;
  172. }
  173. else if(_tcsicmp(instr, _T("UTF8")   ) ==0)
  174. {
  175. retCP=65001;
  176. }
  177. else if(_tcsicmp(instr, _T("UTF32")  ) ==0)
  178. {
  179. retCP=12000;
  180. }
  181. else if(_tcsicmp(instr, _T("UTF32BE")) ==0)
  182. {
  183. retCP=12001;
  184. }
  185. else
  186. {
  187. retCP=-1;
  188. }
  189. return retCP;
  190. }
  191. //字符转HEX
  192. int C2HEX(TCHAR intc)
  193. {
  194. int hret=-1;
  195. if     (_T('0')<=intc && intc<=_T('9'))
  196. {
  197. hret=intc-48;
  198. }
  199. else if(_T('A')<=intc && intc<=_T('F'))
  200. {
  201. hret=intc-55;
  202. }
  203. else if(_T('a')<=intc && intc<=_T('f'))
  204. {
  205. hret=intc-87;
  206. }
  207. else
  208. {
  209. hret=-1;
  210. }
  211. return hret;
  212. }
  213. //BOM头转BINBYTE
  214. int TCHARRAY2BIN(TCHAR* instr, BYTE* &tainer)
  215. {
  216. memset(tainer, 0, BOMS_SIZE);
  217. if(*instr == _T('x') || *instr == _T('X'))
  218. {
  219. instr ++;
  220. }
  221. if(_tcsnicmp(instr, _T("0x"), 2) ==0)
  222. {
  223. instr += 2;
  224. }
  225. int i=-1, hexNUM;
  226. while(++i<BOMS_SIZE)
  227. {
  228. hexNUM=C2HEX(*instr++);
  229. if(hexNUM != -1)
  230. {
  231. tainer[i] |= (hexNUM<<4);
  232. }
  233. else
  234. {
  235. break;
  236. }
  237. hexNUM=C2HEX(*instr++);
  238. if(hexNUM != -1)
  239. {
  240. tainer[i] |= hexNUM;
  241. }
  242. else
  243. {
  244. break;
  245. }
  246. }
  247. return i;
  248. }
  249. //开关解析模块
  250. int _tgetopt(int nargc, TCHAR* nargv[], TCHAR* ostr)
  251. {
  252. static TCHAR* place = (TCHAR*)_T("");
  253. static TCHAR* lastostr = NULL;
  254. register TCHAR* oli;
  255. if(ostr!=lastostr)
  256. {
  257. lastostr=ostr;
  258. place=(TCHAR*)_T("");
  259. }
  260. if(!*place)
  261. {
  262. if(
  263.     (OPTIND>=nargc)                           ||
  264.     (*(place=nargv[OPTIND]) !=(TCHAR)_T('-')) ||
  265.     (!*(++place))
  266. )
  267. {
  268. if(*place !=(TCHAR)_T('-') && OPTIND <nargc)
  269. {
  270. place =(TCHAR*)_T("");
  271. if(UNOPTIND == -1)
  272. {
  273. UNOPTIND = OPTIND++;
  274. return _OPT_TILL;
  275. }
  276. else
  277. {
  278. return _OPT_TERR;
  279. }
  280. }
  281. place=(TCHAR*)_T("");
  282. return _OPT_TEOF;
  283. }
  284. if (*place == (TCHAR)_T('-') && *(place+1) == (TCHAR)_T('\0'))
  285. {
  286. ++OPTIND;
  287. return _OPT_TEOF;
  288. }
  289. }
  290. if (
  291.     (OPTOPT=*place++) == (TCHAR)_T(':') ||
  292.     !(oli=(TCHAR*)_tcschr((TCHARFORMAT*)ostr, (TCHAR)OPTOPT))
  293. )
  294. {
  295. if(!*place)
  296. {
  297. ++OPTIND;
  298. }
  299. }
  300. if (oli != NULL && *(++oli) !=(TCHAR)_T(':'))
  301. {
  302. OPTARG=NULL;
  303. if(!*place)
  304. {
  305. ++OPTIND;
  306. }
  307. }
  308. else
  309. {
  310. if(*place)
  311. {
  312. OPTARG=place;
  313. }
  314. else if(nargc <= ++OPTIND)
  315. {
  316. place=(TCHAR*)_T("");
  317. }
  318. else
  319. {
  320. OPTARG=nargv[OPTIND];
  321. }
  322. place=(TCHAR*)_T("");
  323. ++OPTIND;
  324. }
  325. return OPTOPT;
  326. }
  327. //代码页转化
  328. void PageTurnAround(const BYTE* input, int inputSIZE, int inPAGE, int outPAGE, BYTE* &outDATA, int &oLEN)
  329. {
  330. int wLEN;
  331. char* outCACHE=NULL;
  332. wchar_t* wcsCACHE=NULL;
  333. if(inPAGE == outPAGE)
  334. {
  335. outDATA=(BYTE*)input, oLEN=inputSIZE;
  336. return;
  337. }
  338. //针对UCS-2输入代码页
  339. if(inPAGE == 1200)
  340. {
  341. wcsCACHE=(wchar_t*)input;
  342. wLEN=inputSIZE/2+1;
  343. goto TOMCS;
  344. }
  345. if(inPAGE == 1201)
  346. {
  347. wchar_t* wp=(wchar_t*)input;
  348. while(*wp)
  349. {
  350. *wp = (((*wp)&0x00FF)<<8)|(((*wp)&0xFF00)>>8);
  351. wp ++;
  352. }
  353. wcsCACHE=(wchar_t*)input;
  354. wLEN=inputSIZE/2+1;
  355. goto TOMCS;
  356. }
  357. //输入代码页 过渡到 UNICODE中转代码页
  358. wLEN=MultiByteToWideChar(inPAGE, 0, (char*)input,-1, NULL, 0);
  359. if(wLEN <1)
  360. {
  361. _ftprintf(stderr, _T("Unable to convert code page\n"));
  362. exit(1);
  363. }
  364. wcsCACHE=(wchar_t*)malloc(wLEN * sizeof(wchar_t));
  365. MultiByteToWideChar(inPAGE, 0, (char*)input, -1, wcsCACHE, wLEN);
  366. TOMCS:
  367. //针对UCS-2输出代码页
  368. if(outPAGE == 1200)
  369. {
  370. outDATA=(BYTE*)wcsCACHE, oLEN=(wLEN-1)*2;
  371. return;
  372. }
  373. if(outPAGE == 1201)
  374. {
  375. wchar_t* wp=(wchar_t*)wcsCACHE;
  376. while(*wp)
  377. {
  378. *wp = (((*wp)&0x00FF)<<8)|(((*wp)&0xFF00)>>8);
  379. wp ++;
  380. }
  381. outDATA=(BYTE*)wcsCACHE, oLEN=(wLEN-1)*2;
  382. return;
  383. }
  384. //UNICODE中转代码页 过渡到 输出代码页
  385. int uLEN=WideCharToMultiByte(outPAGE, 0, wcsCACHE, -1, NULL, 0, NULL, NULL);
  386. if(uLEN <1)
  387. {
  388. _ftprintf(stderr, _T("Unable to convert code page\n"));
  389. exit(1);
  390. }
  391. outCACHE=(char*)malloc(uLEN);
  392. WideCharToMultiByte(outPAGE, 0, wcsCACHE, -1, outCACHE, uLEN, NULL, NULL);
  393. outDATA=(BYTE*)outCACHE, oLEN=uLEN-1;
  394. return;
  395. }
  396. //文本转化核心
  397. bool ConveTextFile(TCHAR* inFILE, TCHAR* outFILE, int inPAGE, int outPAGE, int skipNUMBER, int binBOM_SIZE, BYTE* tainerBOM_BIN)
  398. {
  399. //读取输入文件
  400. FILE* inFP=_tfopen(inFILE, _T("rb"));
  401. if(inFP == NULL)
  402. {
  403. _ftprintf(stderr, _T("Open input file error\n"));
  404. exit(1);
  405. }
  406. //获取字典文件尺寸
  407. fseek(inFP, 0, SEEK_END);
  408. int fsize = ftell(inFP);
  409. if(fsize > MAX_FILE_SIZE)
  410. {
  411. _ftprintf(stderr, _T("The input file is too large, can not be greater than %dKB\n"), MAX_FILE_SIZE/1024);
  412. exit(1);
  413. }
  414. fseek(inFP, (long)skipNUMBER, SEEK_SET);
  415. //动态分配文本容器
  416. BYTE* inDATA=(BYTE*)malloc(fsize+1);
  417. //将文本流读入内存
  418. int readSIZE=fsize-skipNUMBER;
  419. fread(inDATA, sizeof(BYTE), readSIZE, inFP);
  420. fclose(inFP);
  421. inDATA[fsize-skipNUMBER]='\0';
  422. //转化代码页
  423. int oLEN=0;
  424. BYTE* outDATA=NULL;
  425. //调用代码页转换函数
  426. PageTurnAround(inDATA, readSIZE, inPAGE, outPAGE, outDATA, oLEN);
  427. if(oLEN <1)
  428. {
  429. return false;
  430. }
  431. //读取输出文件
  432. FILE* outFP=_tfopen(outFILE, _T("wb"));
  433. if(outFP == NULL)
  434. {
  435. _ftprintf(stderr, _T("Open output file error\n"));
  436. exit(1);
  437. }
  438. fwrite(tainerBOM_BIN, sizeof(BYTE), binBOM_SIZE, outFP);
  439. fwrite(outDATA, sizeof(BYTE), oLEN, outFP);
  440. fclose(outFP);
  441. free(inDATA);
  442. return true;
  443. }
  444. #if defined _MSC_VER
  445. #else
  446. extern "C"
  447. #endif
  448. //*************MAIN主函数入口*************/
  449. int _tmain(int argc, TCHAR** argv)
  450. {
  451. if(argc<2)
  452. {
  453. //无参数则退出
  454. _ftprintf(stdout, HELP_INFORMATION);
  455. return 0;
  456. }
  457. //设置传入参数
  458. TCHAR *opeOUTFILE=NULL,  *opeINFILE=NULL;
  459. int    opeIN_PAGE=CP_ACP, opeOUT_PAGE=CP_ACP, opeSKIP_NUMBER=0, opeBOM_SIZE=0;
  460. BYTE   opeFLAG=0x00, *pTAINER=NULL, tainerBOM_BIN[BOMS_SIZE]= {0};
  461. //开关解析
  462. int K=_OPT_TEOF;
  463. while( (K=_tgetopt(argc, argv, (TCHAR*)_T("f:t:s:b:o:hF:T:S:B:O:H"))) != _OPT_TEOF)
  464. {
  465. switch(K)
  466. {
  467. case _T('f'):
  468. case _T('F'):
  469. opeIN_PAGE =_tgetCP(OPTARG);
  470. if(opeIN_PAGE == -1)
  471. {
  472. _ftprintf(stderr, _T("The switch '-f' needs a positive number\n"));
  473. exit(1);
  474. }
  475. opeFLAG |= 0x01;
  476. break;
  477. case _T('t'):
  478. case _T('T'):
  479. opeOUT_PAGE =_tgetCP(OPTARG);
  480. if(opeIN_PAGE == -1)
  481. {
  482. _ftprintf(stderr, _T("The switch '-t' needs a positive number\n"));
  483. exit(1);
  484. }
  485. opeFLAG |= 0x02;
  486. break;
  487. case _T('s'):
  488. case _T('S'):
  489. if(OPTARG == NULL)
  490. {
  491. _ftprintf(stderr, _T("The switch '-s' needs a positive number\n"));
  492. exit(1);
  493. }
  494. opeSKIP_NUMBER = _ttoi((TCHARFORMAT*)OPTARG);
  495. if(! (0<= opeSKIP_NUMBER && opeSKIP_NUMBER <=4 ) )
  496. {
  497. _ftprintf(stderr, _T("The switch '-s' needs a number between {0,4}\n"));
  498. exit(1);
  499. }
  500. opeFLAG |= 0x04;
  501. break;
  502. case _T('b'):
  503. case _T('B'):
  504. if(OPTARG != NULL && _tcslen(OPTARG) <= 8)
  505. {
  506. _ftprintf(stderr, _T("The switch '-b' needs binary number\n"));
  507. exit(1);
  508. }
  509. pTAINER=(BYTE*)tainerBOM_BIN;
  510. opeBOM_SIZE = TCHARRAY2BIN(OPTARG, pTAINER);
  511. opeFLAG |= 0x08;
  512. break;
  513. case _T('o'):
  514. case _T('O'):
  515. if(OPTARG != NULL)
  516. {
  517. opeFLAG |= 0x10;
  518. opeOUTFILE = OPTARG;
  519. }
  520. break;
  521. case _T('h'):
  522. case _T('H'):
  523. _ftprintf(stdout, HELP_INFORMATION);
  524. return 0;
  525. case _OPT_TILL:
  526. //第一个无选项的参数识别为输入名
  527. opeINFILE = argv[UNOPTIND];
  528. break;
  529. case _OPT_TERR:
  530. _ftprintf(stderr, _T("Extra parameters \"%s\"\n"), argv[OPTIND]);
  531. exit(1);
  532. default:
  533. _ftprintf(stderr, _T("Unknown switch '-%c'\n"), K);
  534. exit(1);
  535. }
  536. }
  537. //无输入,强制退出
  538. if(opeINFILE == NULL)
  539. {
  540. _ftprintf(stderr, _T("Needs input file name\n"));
  541. exit(1);
  542. }
  543. //无输出,强制覆盖
  544. if(opeOUTFILE == NULL)
  545. {
  546. opeOUTFILE=opeINFILE;
  547. }
  548. //无参数,SKIP智能偏移
  549. if((opeFLAG&0x04) == 0)
  550. {
  551. FILE* inFP=_tfopen(opeINFILE, _T("rb"));
  552. if(inFP == NULL)
  553. {
  554. _ftprintf(stderr, _T("Open input file error\n"));
  555. exit(1);
  556. }
  557. fread(tainerBOM_BIN, sizeof(BYTE), BOMS_SIZE, inFP);
  558. fclose(inFP);
  559. UINT uBOM_VALUE = BOM2UINT(tainerBOM_BIN);
  560. //倒序识别BOM
  561. switch(uBOM_VALUE)
  562. {
  563. case 0xFFFE0000:
  564. case 0x0000FEFF:
  565. case 0x2B2F7638:
  566. case 0x84319533:
  567. opeSKIP_NUMBER = 4;
  568. break;
  569. default:
  570. if(
  571.     (uBOM_VALUE>>16) == 0xFFFE ||
  572.     (uBOM_VALUE>>16) == 0xFEFF
  573. )
  574. {
  575. opeSKIP_NUMBER = 2;
  576. }
  577. else if((uBOM_VALUE>>8) == 0xEFBBBF)
  578. {
  579. opeSKIP_NUMBER = 3;
  580. }
  581. else
  582. {
  583. opeSKIP_NUMBER = 0;
  584. }
  585. break;
  586. }
  587. }
  588. //无参数,BOM自动修正
  589. if((opeFLAG&0x08) == 0)
  590. {
  591. TCHAR* tcsBIN =_T("");
  592. switch(opeOUT_PAGE)
  593. {
  594. case 1200:
  595. tcsBIN =_T("0xFFFE");
  596. break;
  597. case 1201:
  598. tcsBIN =_T("0xFEFF");
  599. break;
  600. case 12000:
  601. tcsBIN =_T("0xFFFE0000");
  602. break;
  603. case 12001:
  604. tcsBIN =_T("0x0000FEFF");
  605. break;
  606. case 65001:
  607. tcsBIN =_T("0xEFBBBF");
  608. break;
  609. case 65007:
  610. tcsBIN =_T("0x2B2F7638");
  611. break;
  612. case 54936:
  613. tcsBIN =_T("0x84319533");
  614. break;
  615. default:
  616. break;
  617. }
  618. //填充BOM缓存
  619. pTAINER = (BYTE*)tainerBOM_BIN;
  620. opeBOM_SIZE = TCHARRAY2BIN(tcsBIN, pTAINER);
  621. }
  622. //执行代码页转化
  623. if(! ConveTextFile(opeINFILE, opeOUTFILE, opeIN_PAGE, opeOUT_PAGE, opeSKIP_NUMBER, opeBOM_SIZE, tainerBOM_BIN))
  624. {
  625. _ftprintf(stderr, _T("Conver file error\n"));
  626. return 1;
  627. }
  628. return 0;
  629. }
复制代码
2

评分人数

演示下将本地的目录“小说”目录下的所有网页转换成txt,
希望可以做到去除非段落换行,比如浏览器打开网页文件显示:

  
色当即羞红了起来
网页代码应该去除两个<br/>  <br/>及之间的内容:
“脸<br/>  <br/>色当即羞红了起来”替换成“脸色当即羞红了起来”

如果<br/>  <br/>左边有中文右边有正规的段落换行(全角半角空格多个),不希望替换;
如果<br/>  <br/>左边有中文右边有中文,必须替换;
如果<br/>  <br/>左边第一个字是,、“:;,必须替换;
如果<br/>  <br/>右边是,。、“”:;!?…,必须替换;
如果<br/>  <br/>左边第一个字是。?!”……右边是……不希望替换;
如果<br/>  <br/>左边是:右边是“必须替换;

还有有时候所有的换行不是<br/>  <br/>而是</P><P>、<br/><br/>或
<br/>
<br/>

最后提取标题和<br/>之间内容,自定义的替换广告内容,和对多个空字符换行被清理掉,变成干净的ANSI文本

我这里有几个测试网页

TOP

回复 2# 3518228042
建议直接用sed,处理这些最好用脚本和正则。当然,C语言也能做,但是代码将会很繁琐。

TOP

返回列表