27 Commits

Author SHA1 Message Date
ldy
3e78e9f48e Optimization:
1. added new regular expression format for volume
2. added new strip method for msc
3. deleted blank-space author
4. optimized middle name strip method
5. added new matching pattern for no table author list
6. added exception storing for AUTHOR SEARCHING ERROR
Bug fix:
1. error record saving
2023-08-11 14:26:59 +08:00
ldy
69b10a9f72 Merge remote-tracking branch 'origin/main' 2023-08-11 12:44:53 +08:00
XCX
e504e73409 Changed the file path of saving data 2023-08-11 12:22:48 +08:00
XCX
8ea31d08f4 Fix the bug of adding duplicate data 2023-08-11 12:19:55 +08:00
ldy
35f5f2ac5e Optimization:
clustered error files into a folder
2023-08-11 11:42:02 +08:00
ldy
7726650eaa Bug fixed:
ignored blank-space elements in the middle name list
2023-08-10 13:40:26 +08:00
ldy
71e613d994 Optimization:
less memory usage
data collection for volume HTML format error
added time elapse monitor
2023-08-10 12:57:28 +08:00
ldy
2c25682f81 Bug Fix:
1. unworkable retrying function back online baby
New Function:
1. reformatted datetime_transform funtion to handle more month typos
2. reformatted process_article function into 3 functions to use multi-threads better running time
3. renewed article url search technique to handle different volume websites
4. more exception handling
5. bettered keywords and affiliation strip method
6. added methods for processing author data when there exists no author table
7. added code for retry failed processing paper
8. more detailed error messages storage
2023-08-10 01:15:17 +08:00
XCX
a9c753567c Add code for saving data 2023-08-09 12:22:42 +08:00
XCX
9ee9bc4462 Replace the code for merging data 2023-08-08 22:57:29 +08:00
XCX
73cf15980f Fix the bugs 2023-08-08 22:48:55 +08:00
ldy
49746b779b handled 2 typos in month while formatting date 2023-08-08 13:24:51 +08:00
XCX
1e98615778 A new code for same web data merge00_File_merge 2023-08-06 19:42:43 +08:00
ldy
e9bdb9cdff deleted unnecessary retrying commands 2023-08-03 12:06:22 +08:00
XCX
e49e829682 Fix the saving problem 2023-08-03 12:01:51 +08:00
ldy
2d1f2c504d 更新 EJDE_spider/ejde_main.py
adjust output datetime format
2023-08-02 11:21:37 +08:00
XCX
2fc3b85bab Corrected the loops, the program will now not add the same data repeatedly 2023-08-01 19:11:24 +08:00
XCX
01c1a7d978 Changed the code to unify the time format 2023-07-31 18:19:11 +08:00
SHL
ee0f956645 Merge branch 'main' of https://git.ecwuuuuu.com/datamining/CST_scrawlCode 2023-07-27 12:05:50 +08:00
SHL
20cf71530a 10 years articles 2023-07-27 12:04:37 +08:00
XCX
c1e1e59e05 更新 EJQTDE_spider/ejqtde_main.py 2023-07-27 10:30:26 +08:00
XCX
07c334a903 删除文件
该文件已经移动至其他文件夹ProjectEuclid_spider,并且本地已经备份原文件

Signed-off-by: XCX <xcx@jack@ecwuuuuu.com>
2023-07-27 10:28:51 +08:00
XCX
26fed37e17 Modified old code 2023-07-27 10:26:02 +08:00
XCX
cfa9345a79 Update a new spider code for math.u-szeged.hu/ejqtde. Modified the code of SpringerOpen_spider 2023-07-26 23:25:30 +08:00
SHL
b2c845dc6e the code of projecteuclid_spider 2023-07-14 20:47:47 +08:00
XCX
d8addf5204 Update the code these weeks 2023-07-14 18:50:36 +08:00
XCX
04806fa367 Initial commit 2023-07-14 18:29:27 +08:00