-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
429 Too many requests #733
Comments
Hi, we are currently reworking the abstract extraction |
Any updates on this or workaround for this ? the extraction of german wikipedia worked only once in April 2022. |
Hi, yes, we have some updates around text extraction. So, this summer, we had a Google Summer of Code project during which one student upgraded text extraction and it became better (at least we reduced number of 429 errors but still sometimes text extraction process becomes frozen at some point of time). So in this branch there is all related work https://github.com/dbpedia/extraction-framework/tree/celian-gsoc . During this gsoc project there were implemented two new MediawikiConnectors based on previous one: https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala - this MediawikiConnector uses current Mediawiki API that we always have used before, but there was added some new configurations so as result number of 429 HTTP errors were reduced. But sometimes extraction doesn't completes and when maybe 70-95% (I am not completly sure in these numbers but when we tested it and compared with datasets that we had in previous releases, the number of extracted pages looks like were almost the same) of pages from dump were extracted then the extraction process just becomes frozen. I recommend you to run extraction only for one language per process (in extraction.text.properties file just write one language). https://github.com/dbpedia/extraction-framework/blob/celian-gsoc/core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorRest.scala - here is used new REST Mediawiki API. And for this one we still have same problem with frozen process during extraction. |
Hi,
I have configured https://github.com/dbpedia/marvin-config to extract german wikipedia. A first run worked for the 20220401 dump.
Today i run again to extract the 20220601 dump, but it only worked partly the extraction framework and after some time only HTTP 429 was returned from https://de.wikipedia.org/w/api.php.
Exception; de; Main Extraction at 00:00.957s for 62 datasets; Main Extraction failed for instance http://de.dbpedia.org/resource/Liste_von_Autoren/J: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php java.io.IOException: Server returned HTTP response code: 429 for URL: https://de.wikipedia.org/w/api.php at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1902) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268) at org.dbpedia.extraction.util.MediaWikiConnector$$anonfun$retrievePage$1.apply$mcVI$sp(MediaWikiConnector.scala:97) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166) ...
I used the following settings in extractionConfiguration/extraction.de.properties
It seems the extraction-framework does not handle this HTTP error properly. I would be great if the
Retry-After
HTTP header is used to handle such errors. Any suggestions which properties to adjust for this problem?The text was updated successfully, but these errors were encountered: